2016-04-27 19:18:14 +03:00
|
|
|
OpenWPM [![Build Status](https://travis-ci.org/citp/OpenWPM.svg)](https://travis-ci.org/citp/OpenWPM)
|
2014-03-04 00:41:59 +04:00
|
|
|
=======
|
2014-04-28 09:01:05 +04:00
|
|
|
|
2016-01-14 17:49:52 +03:00
|
|
|
OpenWPM is a web privacy measurement framework which makes it easy to collect
|
2015-11-17 01:04:06 +03:00
|
|
|
data for privacy studies on a scale of thousands to millions of site. OpenWPM
|
|
|
|
is built on top of Firefox, with automation provided by Selenium. It includes
|
|
|
|
several hooks for data collection, including a proxy, a Firefox extension, and
|
|
|
|
access to Flash cookies. Check out the instrumentation section below for more
|
|
|
|
details.
|
2014-04-28 09:01:05 +04:00
|
|
|
|
2015-11-17 01:04:06 +03:00
|
|
|
Installation
|
|
|
|
------------
|
2014-04-28 09:01:05 +04:00
|
|
|
|
2015-11-17 01:04:06 +03:00
|
|
|
OpenWPM has been developed and tested on Ubuntu 14.04. An installation script,
|
|
|
|
`install.sh` is included to install both the system and python dependencies
|
|
|
|
automatically. A few of the python dependencies require specific versions, so
|
|
|
|
you should install the dependencies in a virtual environment if you're
|
|
|
|
installing a shared machine.
|
2014-04-28 10:10:06 +04:00
|
|
|
|
2015-11-17 01:04:06 +03:00
|
|
|
It is likely that OpenWPM will also work on Mac OSX, however this has not been
|
|
|
|
tested. If you have experience running OpenWPM on other platforms, please let
|
|
|
|
us know!
|
2015-03-20 18:50:07 +03:00
|
|
|
|
2015-11-17 01:04:06 +03:00
|
|
|
Quick Start
|
2015-04-02 04:58:42 +03:00
|
|
|
-----------
|
|
|
|
|
2015-11-17 01:04:06 +03:00
|
|
|
Once installed, it's very easy to run a quick test of OpenWPM. Check out
|
|
|
|
`demo.py` for an example. This will the default setting specified in
|
|
|
|
`automation/default_manager_params.json` and
|
|
|
|
`automation/default_browser_params.json`, with the exception of the changes
|
|
|
|
specified in `demo.py`.
|
|
|
|
|
|
|
|
You can test other configurations by changing the values in these two
|
|
|
|
dictionaries. `manager_params` is meant to specify the platform-wide settings,
|
|
|
|
while `browser_params` specifies browser-specific settings (and as such
|
|
|
|
defaults to a `list` of settings, of length equal to the number of browsers you
|
|
|
|
are using. We are currently working on full documentation of these settings.
|
|
|
|
|
|
|
|
The [wiki](https://github.com/citp/OpenWPM/wiki) provides a more in-depth
|
|
|
|
tutorial, however it is currently out of date. In particular you can find
|
|
|
|
[advanced features](https://github.com/citp/OpenWPM/wiki/Advanced-Features),
|
|
|
|
and [additional
|
|
|
|
commands](https://github.com/citp/OpenWPM/wiki/Available-Commands).
|
|
|
|
You can also take a look at two of our past studies (1) and (2), which use the
|
|
|
|
infrastructure.
|
|
|
|
|
|
|
|
(1) [The Web Never Forgets](https://github.com/citp/TheWebNeverForgets)
|
|
|
|
(2) [Cookies that Give You Away](https://github.com/englehardt/cookies-that-give-you-away)
|
|
|
|
|
|
|
|
Instrumentation
|
|
|
|
---------------
|
|
|
|
|
|
|
|
OpenWPM includes the following instrumentation by default:
|
|
|
|
|
|
|
|
* An HTTP Proxy (mitmproxy)
|
|
|
|
* HTTP Requests and Responses
|
|
|
|
* Parsing of HTTP Request and Response Cookies
|
|
|
|
* NOTE: this will not include cookies set by Javascript, see our
|
|
|
|
Firefox extension option below.
|
|
|
|
* De-duplicated content storage
|
|
|
|
* Right new we detect and store javascript, but this can be expanded
|
|
|
|
* A Firefox Extension
|
|
|
|
* Javascript calls
|
|
|
|
* Cookie setting and access
|
|
|
|
* Disk Scans
|
|
|
|
* Flash cookie setting
|
|
|
|
* Cookie access
|
|
|
|
|
|
|
|
Data Format
|
|
|
|
-----------
|
|
|
|
|
|
|
|
OpenWPM saves crawl data in several outputs. The bulk of the data is stored
|
|
|
|
in a SQLite database, but additional data may be stored in locations detailed
|
|
|
|
below.
|
|
|
|
|
|
|
|
* HTTP, Cookie, Javascript calls, and meta-data
|
|
|
|
* SQLite database specified by `manager_params['database_name']`.
|
|
|
|
* Schema specified by: `automation/schema.sql`, instrumentation may specify
|
|
|
|
additional tables necessary for their measurements.
|
|
|
|
* Javascript files
|
|
|
|
* Collected when `browser_params['save_javascript'] = True`
|
|
|
|
* Javascript files are stored in `javascript.ldb`. The location of this
|
|
|
|
database is specified by `manager_params['data_directory']`.
|
|
|
|
* The files are stored with `zlib` compression by the hash of the
|
|
|
|
uncompressed content.
|
|
|
|
* The files are stored in a `LevelDB` database, accessed with `plyvel`.
|
|
|
|
* This hash is used to reference the scripts from the SQLite database, for
|
|
|
|
example the `content_hash` column of HTTP Responses.
|
|
|
|
* Log Files
|
|
|
|
* Stored in the directory specified by `manager_params['data_directory']`.
|
|
|
|
* Name specified by `manager_params['log_file']`.
|
|
|
|
* Browser Profile
|
|
|
|
* Contains cookies, Flash objects, and so on that are dumped after a crawl
|
|
|
|
is finished
|
|
|
|
* Dumped to the location specified in `dump_profile` command.
|
|
|
|
|
|
|
|
The database is keyed by the crawler ID and the `top_url` being visited (the
|
|
|
|
url typed into the browser address bar).
|
|
|
|
|
|
|
|
Disclaimer
|
|
|
|
-----------
|
|
|
|
|
|
|
|
Note that OpenWPM is under active development, and should be considered
|
|
|
|
experimental software. The repository may contain experimental features that
|
|
|
|
aren't fully tested. We recommend using a [tagged
|
|
|
|
release](https://github.com/citp/OpenWPM/releases).
|
|
|
|
|
|
|
|
Although OpenWPM is actively used by our group for research studies and we
|
|
|
|
regularly use of the data collected, it is still possible there are unknown bugs
|
|
|
|
in the infrastructure. We are in the process of writing comprehensive tests to
|
|
|
|
verify the integrity of all included instrumentation. Prior to using OpenWPM
|
|
|
|
for your own research we encourage you to write tests (and submit pull
|
|
|
|
requests!) for any instrumentation that isn't currently included in our test
|
|
|
|
scripts.
|
2015-04-02 04:58:42 +03:00
|
|
|
|
2015-06-10 02:21:24 +03:00
|
|
|
Citation
|
|
|
|
--------
|
|
|
|
|
2015-11-17 01:04:06 +03:00
|
|
|
If you use OpenWPM in your research, please cite our current [Technical
|
2016-05-22 04:59:07 +03:00
|
|
|
Report](http://randomwalker.info/publications/OpenWPM_1_million_site_tracking_measurement.pdf) on the
|
2015-11-17 01:04:06 +03:00
|
|
|
infrastructure. You can use the following BibTeX.
|
2015-06-10 02:21:24 +03:00
|
|
|
|
2016-05-22 04:59:07 +03:00
|
|
|
@unpublished{englehardt2015census,
|
|
|
|
author = "Steven Englehardt and Arvind Narayanan",
|
|
|
|
title = "{Online tracking: A 1-million-site measurement and analysis}",
|
|
|
|
month = may,
|
|
|
|
year = "2016",
|
|
|
|
note = "[Technical Report]"
|
2015-06-10 02:21:24 +03:00
|
|
|
}
|
|
|
|
|
2015-03-20 18:50:07 +03:00
|
|
|
License
|
|
|
|
-------
|
|
|
|
|
2015-12-22 19:42:20 +03:00
|
|
|
OpenWPM is licenced under GNU GPLv3. Additional code has been included from
|
|
|
|
[FourthParty](https://github.com/fourthparty/fourthparty) and
|
|
|
|
[Privacy Badger](https://github.com/EFForg/privacybadgerfirefox), both of which
|
|
|
|
are licensed GPLv3+.
|