OpenWPM Documentation

What is OpenWPM?

Web Privacy Measurement is the observation of websites and services to detect, characterize and quantify privacy-impacting behaviors. Applications of Web Privacy Measurement include the detection of price discrimination, targeted news articles and new forms of browser fingerprinting. Although originally focused solely on privacy violations, WPM now encompasses measuring security violations on the web as well.

For these studies to be truly large-scale and repeatable, creating an automated measurement platform is necessary. At least within the academic literature, measurement infrastructures in the field of WPM have been largely one-off and do not comprehensively address the engineering challenges within this realm.

OpenWPM, a flexible, stable, scalable and general web measurement platform, is our solution to this infrastructure vacuum. This tutorial shows how to get started with OpenWPM, gives an overview of its general functionality and lists some key engineering challenges which are still being solved. We hope that this tool will enable other researchers to perform WPM studies and welcome future collaboration.

Core contribution

Our core contribution has been in decoupling measurement and browser automation to provide the stability necessary to run web-scale studies, but in a way that enables the ease of extension for new measurement studies. For example, FourthParty is an excellent Firefox plugin to capture HTTP data and javascript calls during a normal Firefox run. Each researcher who wishes to use FourthParty must write their own code to automate the measurement, grapple with errors, and process a separate output file for each run. Rinse and repeat for the next measurement project.

Instead, the functionality of each individual project can be built into a common measurement platform. Our goal in releasing OpenWPM has been to provide a platform to do just that (we have already implemented most of the FourthParty functionality). Once study specific code has been implemented a researcher can immediately deploy a crawl utilizing multiple browsers in parallel on the same machine, with all browser data aggregated in a central database.

Our primary technical contributions thus far are as follows:

Parallel browser automation with synchronization
Browser crash recovery with full profile support
Ability to set per-browser properties e.g. screen size, extensions, user-agent string
Javascript emulation of mouse movement and scrolling
Per-browser HTTP request/response logging
Scanning of Flash Storage and HTTP Cookie Storage after each page visit (extending to other storage locations is possible)
Loading and saving of browser profiles for multi-crawl studies
Full command logging
Aggregation of measurement data centrally from all browsers

Getting started

This wiki provides information on the platform architecture, information on platform set-up, and demo of how to both run and customize the platform.