OpenWPM Documentation
What is OpenWPM?
Web Privacy Measurement is the observation of websites and services to detect, characterize and quantify privacy-impacting behaviors. Applications of Web Privacy Measurement include the detection of price discrimination, targeted news articles and new forms of browser fingerprinting. Although originally focused solely on privacy violations, WPM now encompasses measuring security violations on the web as well.
For these studies to be truly large-scale and repeatable, creating an automated measurement platform is necessary. At least within the academic literature, measurement infrastructures in the field of WPM have been largely one-off and do not comprehensively address the engineering challenges within this realm.
OpenWPM, a flexible, stable, scalable and general web measurement platform, is our solution to this infrastructure vacuum. This tutorial shows how to get started with OpenWPM, gives an overview of its general functionality and lists some key engineering challenges which are still being solved. We hope that this tool will enable other researchers to perform WPM studies and welcome future collaboration.
Core contribution
Our core contribution has been in decoupling measurement and browser automation to provide the stability necessary to run web-scale studies, but in a way that enables the ease of extension for new measurement studies. For example, FourthParty is an excellent Firefox plugin to capture HTTP data and javascript calls during a normal Firefox run. Each researcher who wishes to use FourthParty must write their own code to automate the measurement, grapple with errors, and process a separate output file for each run. Rinse and repeat for the next measurement project.
Instead, the functionality of each individual project can be built into a common measurement platform. Our goal in releasing OpenWPM has been to provide a platform to do just that (we have already implemented most of the FourthParty functionality). Once study specific code has been implemented a researcher can immediately deploy a crawl utilizing multiple browsers in parallel on the same machine, with all browser data aggregated in a central database.
Our primary technical contributions thus far are as follows:
- Parallel browser automation with synchronization
- Browser crash recovery with full profile support
- Ability to set per-browser properties e.g. screen size, extensions, user-agent string
- Javascript emulation of mouse movement and scrolling
- Per-browser HTTP request/response logging
- Scanning of Flash Storage and HTTP Cookie Storage after each page visit (extending to other storage locations is possible)
- Loading and saving of browser profiles for multi-crawl studies
- Full command logging
- Aggregation of measurement data centrally from all browsers
Getting started
This wiki provides information on the platform architecture, information on platform set-up, and demo of how to both run and customize the platform.