This repository contains or links to all assets relevant to the WWW'20 paper: "The representative of automated Web crawls as a surrogate for human browsing"

Перейти к файлу

Sarah Bird 558e428168 Merge pull request #2 from mozilla/torrent Add link to bittorrent copy of crawl data		2020-06-07 17:14:11 -05:00
human-browsing-top-sites	Update and clean up READMEs	2020-05-04 15:31:39 -05:00
list-comparison	Add README for list comparison dir	2020-04-29 18:00:37 -05:00
lists	Add lists used for crawls	2020-04-21 12:22:30 -05:00
.gitignore	Add gitignore	2020-04-27 16:36:30 -05:00
3366423.3380104.pdf	Upload ACM formatted manuscript.	2020-04-28 08:30:55 -07:00
CODE_OF_CONDUCT.md	Add code of conduct	2020-04-21 15:51:59 -05:00
LICENSE	Initial commit	2020-01-25 16:38:59 +01:00
README.md	Add link to bittorrent copy of crawl data	2020-06-07 12:43:38 -04:00

README.md

The representativeness of automated Web crawls as a surrogate for human browsing: companion repository

This repository contains or links to all assets relevant to the WWW'20 paper: The representative of automated Web crawls as a surrogate for human browsing. All listed assets will be made publicly available pending internal privacy/trust audit processes required prior to data release. For specific inquiries pertaining to data access and collaborations on privacy enhancing technologies research please reach out to the corresponding authors listed on the manuscript.

Lists used for crawls: under lists directory
Trexa repo: https://github.com/mozilla/trexa
Crawl preparation (pre crawl and depth crawl code): https://github.com/mozilla/crawl-prep
Crawl database: Google Doc
Crawl downloads: all the crawl data is stored in a S3 bucket. The total size of the data is 184.2GB comprised of:
- 18.4GB for the 44 time sequence crawls
- 36.4GB for the two large companion crawls of ~100k sites
- 129.4GB for the remaining 60 crawls
The compressed crawl data (64GB) is available on BitTorrent on AcademicTorrents
Alternate orchestration repo: https://github.com/birdsarah/faust-selenium
List comparison analysis: under list-comparison directory
DP-protected top-level domain ranking for opt-in human users [August-2019]: under human-browsing-top-sites directory

If you find any of the resources contained int his repository valuable for your research please cite the original manuscript for which this work was produced:

@inproceedings{10.1145/3366423.3380104,
author = {Zeber, David and Bird, Sarah and Oliveira, Camila and Rudametkin, Walter and Segall, Ilana and Wolls\'{e}n, Fredrik and Lopatka, Martin},
title = {The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing},
year = {2020},
isbn = {9781450370233},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3366423.3380104},
doi = {10.1145/3366423.3380104},
booktitle = {Proceedings of The Web Conference 2020},
pages = {167–178},
numpages = {12},
keywords = {Web Crawling, Online Privacy, Tracking, Browser Fingerprinting, World Wide Web},
location = {Taipei, Taiwan},
series = {WWW ’20}
}

README.md Убрать экранирование Экранировать

The representativeness of automated Web crawls as a surrogate for human browsing: companion repository

README.md