OpenWPM/demo.py

78 строки
2.7 KiB
Python
Исходник Обычный вид История

Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
from pathlib import Path
Command refactoring (#750) * Refactored GetCommand, BrowseCommand to have execute method * Fixed type name format issues in __issue_command * Fixed everything I broke * Changed import style so tests can run * Added BrowseCommad to imports * Added some more self * Added logging to explain failing test * Added one more self * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * Ported SaveScreenshotCommand It now uses the new command.execute(...) syntax * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * Ported SaveScreenshotFullPage #763 * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * Ported DumpPageSource and RecursiveDumpPageSource (#767) * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Command refactoring (#770) * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Ran isort * Added append_command * remove custom function command and format code * Refactored GetCommand, BrowseCommand to have execute method * Fixed type name format issues in __issue_command * Fixed everything I broke * Changed import style so tests can run * Added BrowseCommad to imports * Added some more self * Added logging to explain failing test * Added one more self * Ported SaveScreenshotCommand It now uses the new command.execute(...) syntax * Ported SaveScreenshotFullPage #763 * Ported DumpPageSource and RecursiveDumpPageSource (#767) * Command refactoring (#770) * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Ran isort * Added append_command * remove duplicate append_command * Refactored GetCommand, BrowseCommand to have execute method * Fixed type name format issues in __issue_command * Fixed everything I broke * Changed import style so tests can run * Added BrowseCommad to imports * Added some more self * Added logging to explain failing test * Added one more self * Ported SaveScreenshotCommand It now uses the new command.execute(...) syntax * Ported SaveScreenshotFullPage #763 * Ported DumpPageSource and RecursiveDumpPageSource (#767) * Command refactoring (#770) * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Ran isort * Added append_command * generate new xpi * Fixing tests * Fixing tests * Fixing up more tests * Removed type annotations * Fixing tests * Fixing tests * Removed command_executor * Moved Commands to commands * Fixing imports * Fixed skipped test * Removed duplicate append_command * docs: update adding command in usingOpenWPM * Forgot to save * Removed datadir * Cleaning up imports * Implemented simple command * Added documentation to simple_command.py * Renamed to custom_command.py * Moved docs around * Referencing BaseCommand.execute * Update docs/Using_OpenWPM.md Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Cyrus <cyruskarsan@gmail.com> Co-authored-by: cyruskarsan <55566678+cyruskarsan@users.noreply.github.com> Co-authored-by: Steven Englehardt <senglehardt@mozilla.com>
2021-01-09 13:15:01 +03:00
from custom_command import LinkCountingCommand
from openwpm.command_sequence import CommandSequence
Command refactoring (#750) * Refactored GetCommand, BrowseCommand to have execute method * Fixed type name format issues in __issue_command * Fixed everything I broke * Changed import style so tests can run * Added BrowseCommad to imports * Added some more self * Added logging to explain failing test * Added one more self * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * Ported SaveScreenshotCommand It now uses the new command.execute(...) syntax * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * Ported SaveScreenshotFullPage #763 * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * Ported DumpPageSource and RecursiveDumpPageSource (#767) * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Command refactoring (#770) * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Ran isort * Added append_command * remove custom function command and format code * Refactored GetCommand, BrowseCommand to have execute method * Fixed type name format issues in __issue_command * Fixed everything I broke * Changed import style so tests can run * Added BrowseCommad to imports * Added some more self * Added logging to explain failing test * Added one more self * Ported SaveScreenshotCommand It now uses the new command.execute(...) syntax * Ported SaveScreenshotFullPage #763 * Ported DumpPageSource and RecursiveDumpPageSource (#767) * Command refactoring (#770) * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Ran isort * Added append_command * remove duplicate append_command * Refactored GetCommand, BrowseCommand to have execute method * Fixed type name format issues in __issue_command * Fixed everything I broke * Changed import style so tests can run * Added BrowseCommad to imports * Added some more self * Added logging to explain failing test * Added one more self * Ported SaveScreenshotCommand It now uses the new command.execute(...) syntax * Ported SaveScreenshotFullPage #763 * Ported DumpPageSource and RecursiveDumpPageSource (#767) * Command refactoring (#770) * attempt at refactoring save_screenshot * fixed indentation, attempt at refactoring save_screenshot * refactored SaveScreenshot command to have execute method * reformatted code using black * refactored savefullscreenshot command to follow command sequence * formatted files with black * removed extraneous commands * refactored dump page source and formatted code with black * reformatted recursive dump page source command and formatted code w black * formatted files using isort * formatted all files with isort * refactor finalize command * refactored initalize command and formatted with black and isort * missed a conflict * Ran isort * Added append_command * generate new xpi * Fixing tests * Fixing tests * Fixing up more tests * Removed type annotations * Fixing tests * Fixing tests * Removed command_executor * Moved Commands to commands * Fixing imports * Fixed skipped test * Removed duplicate append_command * docs: update adding command in usingOpenWPM * Forgot to save * Removed datadir * Cleaning up imports * Implemented simple command * Added documentation to simple_command.py * Renamed to custom_command.py * Moved docs around * Referencing BaseCommand.execute * Update docs/Using_OpenWPM.md Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Cyrus <cyruskarsan@gmail.com> Co-authored-by: cyruskarsan <55566678+cyruskarsan@users.noreply.github.com> Co-authored-by: Steven Englehardt <senglehardt@mozilla.com>
2021-01-09 13:15:01 +03:00
from openwpm.commands.browser_commands import GetCommand
Refactoring browser and manager params into dataclasses (#807) * initial file commit * add new dependency for dataclasses * implemeted basic BrowserParams dataclass * dependencies update * file reformat * implemented basic ManagerParams dataclass * Update environment dependencies * Added new error class to validate browser and manager params * file reformat * Update scripts/environment-unpinned.yaml Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * added validations for BrowserParams dataclass * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Removed unnecessary checks Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed error string formatting Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting) * Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)" This reverts commit e550c3bd604f415272bd05ee3d9c76397ad98006. * Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses" This reverts commit aff5a384e737477746d6a38d3b2be6244f8dfd11, reversing changes made to 6ecaf5d0a94d376126692c3785692ba10626d88a. * Revert "Update environment dependencies" This reverts commit 385825b10aee4610a6e304122bec4ab2b7219a5b. * Revert "Merge branch 'turn_browser_and_manager_params_into_dataclasses' of https://github.com/ankushduacodes/OpenWPM into turn_browser_and_manager_params_into_dataclasses" This reverts commit 6ecaf5d0a94d376126692c3785692ba10626d88a, reversing changes made to e550c3bd604f415272bd05ee3d9c76397ad98006. * file reformat * finalized validate_browser_params function * fixed typo in error string * added validations for manager_params * Explanation for using list for supported browser * Revert "Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses"" This reverts commit 6c3e98e57bd9c42acd029c74649742dcc81de86c. * Revert "Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)"" This reverts commit fc8f48f1878ea7c43b342989ce581dc3d6eab929. * import name change from .Error to .error * moved call_instrument check to config.py * fixed accidental use of dict syntax in a class * moved save_content check from deploy_firefox.py * deleting redundent file * deleted more redundent files * removed redundant imports * added new save_content check * property name changevariables can not have '-' * added new attribute to ManagerParams * adapted files to validate manager & broswer params - also added logic to convert the objects(BrowserParams and ManagerParams) to dictionaries to not break the functionality - also updated demo.py to work with new file names on this branch * removed obsolete documentaion * Dependency Update * Revert "Dependency Update" This reverts commit 8ee3a02b1764883a1f5922e0b52e9f17f8e098db. * Dependencies Update * unset memory and process watchdogs * add new output_format and failure_limit checks * inheriting dataclasses and added type hints to fn * added todo * fixed inheritance of dataclasses acc. to plan * refactor use of dict to use dataclasses(pending) * more refactoring use of dict to dataclasses - Also changed some type hints related to new refactoring * fixed screenshot directory issue - because of which some of the tests were failing * added try-except clause for unexpected errors * added tests to cover dataclasses * added some new and edited some old docs * refactor use of __dict__ to dataclass.to_dict() * Revert "refactor use of __dict__ to dataclass.to_dict()" This reverts commit a4f35513fa26d23a073c16af9fb332045826dcb2. * fixed some tests * refactor use of __dict__ in favor of dataclass.to_dict() method * removed some TODOS * fixed dataclases validation tests * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/task_manager.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * minor fixed wrt polishing the PR * added new check and test for crawl configs Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de>
2020-12-02 12:10:45 +03:00
from openwpm.config import BrowserParams, ManagerParams
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
from openwpm.storage.sql_provider import SQLiteStorageProvider
Refactoring browser and manager params into dataclasses (#807) * initial file commit * add new dependency for dataclasses * implemeted basic BrowserParams dataclass * dependencies update * file reformat * implemented basic ManagerParams dataclass * Update environment dependencies * Added new error class to validate browser and manager params * file reformat * Update scripts/environment-unpinned.yaml Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * added validations for BrowserParams dataclass * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Removed unnecessary checks Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed error string formatting Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting) * Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)" This reverts commit e550c3bd604f415272bd05ee3d9c76397ad98006. * Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses" This reverts commit aff5a384e737477746d6a38d3b2be6244f8dfd11, reversing changes made to 6ecaf5d0a94d376126692c3785692ba10626d88a. * Revert "Update environment dependencies" This reverts commit 385825b10aee4610a6e304122bec4ab2b7219a5b. * Revert "Merge branch 'turn_browser_and_manager_params_into_dataclasses' of https://github.com/ankushduacodes/OpenWPM into turn_browser_and_manager_params_into_dataclasses" This reverts commit 6ecaf5d0a94d376126692c3785692ba10626d88a, reversing changes made to e550c3bd604f415272bd05ee3d9c76397ad98006. * file reformat * finalized validate_browser_params function * fixed typo in error string * added validations for manager_params * Explanation for using list for supported browser * Revert "Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses"" This reverts commit 6c3e98e57bd9c42acd029c74649742dcc81de86c. * Revert "Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)"" This reverts commit fc8f48f1878ea7c43b342989ce581dc3d6eab929. * import name change from .Error to .error * moved call_instrument check to config.py * fixed accidental use of dict syntax in a class * moved save_content check from deploy_firefox.py * deleting redundent file * deleted more redundent files * removed redundant imports * added new save_content check * property name changevariables can not have '-' * added new attribute to ManagerParams * adapted files to validate manager & broswer params - also added logic to convert the objects(BrowserParams and ManagerParams) to dictionaries to not break the functionality - also updated demo.py to work with new file names on this branch * removed obsolete documentaion * Dependency Update * Revert "Dependency Update" This reverts commit 8ee3a02b1764883a1f5922e0b52e9f17f8e098db. * Dependencies Update * unset memory and process watchdogs * add new output_format and failure_limit checks * inheriting dataclasses and added type hints to fn * added todo * fixed inheritance of dataclasses acc. to plan * refactor use of dict to use dataclasses(pending) * more refactoring use of dict to dataclasses - Also changed some type hints related to new refactoring * fixed screenshot directory issue - because of which some of the tests were failing * added try-except clause for unexpected errors * added tests to cover dataclasses * added some new and edited some old docs * refactor use of __dict__ to dataclass.to_dict() * Revert "refactor use of __dict__ to dataclass.to_dict()" This reverts commit a4f35513fa26d23a073c16af9fb332045826dcb2. * fixed some tests * refactor use of __dict__ in favor of dataclass.to_dict() method * removed some TODOS * fixed dataclases validation tests * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/task_manager.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * minor fixed wrt polishing the PR * added new check and test for crawl configs Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de>
2020-12-02 12:10:45 +03:00
from openwpm.task_manager import TaskManager
2014-07-01 20:37:17 +04:00
# The list of sites that we wish to crawl
2020-02-28 19:39:39 +03:00
NUM_BROWSERS = 1
2020-05-08 02:27:52 +03:00
sites = [
2020-09-11 16:14:09 +03:00
"http://www.example.com",
"http://www.princeton.edu",
"http://citp.princeton.edu/",
2020-05-08 02:27:52 +03:00
]
2014-07-01 20:37:17 +04:00
Refactoring browser and manager params into dataclasses (#807) * initial file commit * add new dependency for dataclasses * implemeted basic BrowserParams dataclass * dependencies update * file reformat * implemented basic ManagerParams dataclass * Update environment dependencies * Added new error class to validate browser and manager params * file reformat * Update scripts/environment-unpinned.yaml Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * added validations for BrowserParams dataclass * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Removed unnecessary checks Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed error string formatting Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting) * Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)" This reverts commit e550c3bd604f415272bd05ee3d9c76397ad98006. * Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses" This reverts commit aff5a384e737477746d6a38d3b2be6244f8dfd11, reversing changes made to 6ecaf5d0a94d376126692c3785692ba10626d88a. * Revert "Update environment dependencies" This reverts commit 385825b10aee4610a6e304122bec4ab2b7219a5b. * Revert "Merge branch 'turn_browser_and_manager_params_into_dataclasses' of https://github.com/ankushduacodes/OpenWPM into turn_browser_and_manager_params_into_dataclasses" This reverts commit 6ecaf5d0a94d376126692c3785692ba10626d88a, reversing changes made to e550c3bd604f415272bd05ee3d9c76397ad98006. * file reformat * finalized validate_browser_params function * fixed typo in error string * added validations for manager_params * Explanation for using list for supported browser * Revert "Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses"" This reverts commit 6c3e98e57bd9c42acd029c74649742dcc81de86c. * Revert "Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)"" This reverts commit fc8f48f1878ea7c43b342989ce581dc3d6eab929. * import name change from .Error to .error * moved call_instrument check to config.py * fixed accidental use of dict syntax in a class * moved save_content check from deploy_firefox.py * deleting redundent file * deleted more redundent files * removed redundant imports * added new save_content check * property name changevariables can not have '-' * added new attribute to ManagerParams * adapted files to validate manager & broswer params - also added logic to convert the objects(BrowserParams and ManagerParams) to dictionaries to not break the functionality - also updated demo.py to work with new file names on this branch * removed obsolete documentaion * Dependency Update * Revert "Dependency Update" This reverts commit 8ee3a02b1764883a1f5922e0b52e9f17f8e098db. * Dependencies Update * unset memory and process watchdogs * add new output_format and failure_limit checks * inheriting dataclasses and added type hints to fn * added todo * fixed inheritance of dataclasses acc. to plan * refactor use of dict to use dataclasses(pending) * more refactoring use of dict to dataclasses - Also changed some type hints related to new refactoring * fixed screenshot directory issue - because of which some of the tests were failing * added try-except clause for unexpected errors * added tests to cover dataclasses * added some new and edited some old docs * refactor use of __dict__ to dataclass.to_dict() * Revert "refactor use of __dict__ to dataclass.to_dict()" This reverts commit a4f35513fa26d23a073c16af9fb332045826dcb2. * fixed some tests * refactor use of __dict__ in favor of dataclass.to_dict() method * removed some TODOS * fixed dataclases validation tests * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/task_manager.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * minor fixed wrt polishing the PR * added new check and test for crawl configs Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de>
2020-12-02 12:10:45 +03:00
# Loads the default ManagerParams
# and NUM_BROWSERS copies of the default BrowserParams
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
manager_params = ManagerParams(num_browsers=NUM_BROWSERS)
Refactoring browser and manager params into dataclasses (#807) * initial file commit * add new dependency for dataclasses * implemeted basic BrowserParams dataclass * dependencies update * file reformat * implemented basic ManagerParams dataclass * Update environment dependencies * Added new error class to validate browser and manager params * file reformat * Update scripts/environment-unpinned.yaml Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * added validations for BrowserParams dataclass * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Removed unnecessary checks Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed error string formatting Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting) * Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)" This reverts commit e550c3bd604f415272bd05ee3d9c76397ad98006. * Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses" This reverts commit aff5a384e737477746d6a38d3b2be6244f8dfd11, reversing changes made to 6ecaf5d0a94d376126692c3785692ba10626d88a. * Revert "Update environment dependencies" This reverts commit 385825b10aee4610a6e304122bec4ab2b7219a5b. * Revert "Merge branch 'turn_browser_and_manager_params_into_dataclasses' of https://github.com/ankushduacodes/OpenWPM into turn_browser_and_manager_params_into_dataclasses" This reverts commit 6ecaf5d0a94d376126692c3785692ba10626d88a, reversing changes made to e550c3bd604f415272bd05ee3d9c76397ad98006. * file reformat * finalized validate_browser_params function * fixed typo in error string * added validations for manager_params * Explanation for using list for supported browser * Revert "Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses"" This reverts commit 6c3e98e57bd9c42acd029c74649742dcc81de86c. * Revert "Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)"" This reverts commit fc8f48f1878ea7c43b342989ce581dc3d6eab929. * import name change from .Error to .error * moved call_instrument check to config.py * fixed accidental use of dict syntax in a class * moved save_content check from deploy_firefox.py * deleting redundent file * deleted more redundent files * removed redundant imports * added new save_content check * property name changevariables can not have '-' * added new attribute to ManagerParams * adapted files to validate manager & broswer params - also added logic to convert the objects(BrowserParams and ManagerParams) to dictionaries to not break the functionality - also updated demo.py to work with new file names on this branch * removed obsolete documentaion * Dependency Update * Revert "Dependency Update" This reverts commit 8ee3a02b1764883a1f5922e0b52e9f17f8e098db. * Dependencies Update * unset memory and process watchdogs * add new output_format and failure_limit checks * inheriting dataclasses and added type hints to fn * added todo * fixed inheritance of dataclasses acc. to plan * refactor use of dict to use dataclasses(pending) * more refactoring use of dict to dataclasses - Also changed some type hints related to new refactoring * fixed screenshot directory issue - because of which some of the tests were failing * added try-except clause for unexpected errors * added tests to cover dataclasses * added some new and edited some old docs * refactor use of __dict__ to dataclass.to_dict() * Revert "refactor use of __dict__ to dataclass.to_dict()" This reverts commit a4f35513fa26d23a073c16af9fb332045826dcb2. * fixed some tests * refactor use of __dict__ in favor of dataclass.to_dict() method * removed some TODOS * fixed dataclases validation tests * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/task_manager.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * minor fixed wrt polishing the PR * added new check and test for crawl configs Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de>
2020-12-02 12:10:45 +03:00
browser_params = [BrowserParams(display_mode="headless") for _ in range(NUM_BROWSERS)]
2014-07-01 20:37:17 +04:00
# Update browser configuration (use this for per-browser settings)
for browser_param in browser_params:
2017-07-28 23:37:35 +03:00
# Record HTTP Requests and Responses
browser_param.http_instrument = True
# Record cookie changes
browser_param.cookie_instrument = True
# Record Navigations
browser_param.navigation_instrument = True
# Record JS Web API calls
browser_param.js_instrument = True
# Record the callstack of all WebRequests made
browser_param.callstack_instrument = True
2020-08-04 17:40:11 +03:00
# Record DNS resolution
browser_param.dns_instrument = True
2014-07-01 20:37:17 +04:00
# Update TaskManager configuration (use this for crawl-wide settings)
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
manager_params.data_directory = Path("./datadir/")
manager_params.log_path = Path("./datadir/openwpm.log")
Refactoring browser and manager params into dataclasses (#807) * initial file commit * add new dependency for dataclasses * implemeted basic BrowserParams dataclass * dependencies update * file reformat * implemented basic ManagerParams dataclass * Update environment dependencies * Added new error class to validate browser and manager params * file reformat * Update scripts/environment-unpinned.yaml Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * added validations for BrowserParams dataclass * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Removed unnecessary checks Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed error string formatting Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting) * Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)" This reverts commit e550c3bd604f415272bd05ee3d9c76397ad98006. * Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses" This reverts commit aff5a384e737477746d6a38d3b2be6244f8dfd11, reversing changes made to 6ecaf5d0a94d376126692c3785692ba10626d88a. * Revert "Update environment dependencies" This reverts commit 385825b10aee4610a6e304122bec4ab2b7219a5b. * Revert "Merge branch 'turn_browser_and_manager_params_into_dataclasses' of https://github.com/ankushduacodes/OpenWPM into turn_browser_and_manager_params_into_dataclasses" This reverts commit 6ecaf5d0a94d376126692c3785692ba10626d88a, reversing changes made to e550c3bd604f415272bd05ee3d9c76397ad98006. * file reformat * finalized validate_browser_params function * fixed typo in error string * added validations for manager_params * Explanation for using list for supported browser * Revert "Revert "Merge branch 'master' into turn_browser_and_manager_params_into_dataclasses"" This reverts commit 6c3e98e57bd9c42acd029c74649742dcc81de86c. * Revert "Revert "Changed filenamea and necessary imports to resolve conflicts with new master branch(refering to PEP-8 reformatting)"" This reverts commit fc8f48f1878ea7c43b342989ce581dc3d6eab929. * import name change from .Error to .error * moved call_instrument check to config.py * fixed accidental use of dict syntax in a class * moved save_content check from deploy_firefox.py * deleting redundent file * deleted more redundent files * removed redundant imports * added new save_content check * property name changevariables can not have '-' * added new attribute to ManagerParams * adapted files to validate manager & broswer params - also added logic to convert the objects(BrowserParams and ManagerParams) to dictionaries to not break the functionality - also updated demo.py to work with new file names on this branch * removed obsolete documentaion * Dependency Update * Revert "Dependency Update" This reverts commit 8ee3a02b1764883a1f5922e0b52e9f17f8e098db. * Dependencies Update * unset memory and process watchdogs * add new output_format and failure_limit checks * inheriting dataclasses and added type hints to fn * added todo * fixed inheritance of dataclasses acc. to plan * refactor use of dict to use dataclasses(pending) * more refactoring use of dict to dataclasses - Also changed some type hints related to new refactoring * fixed screenshot directory issue - because of which some of the tests were failing * added try-except clause for unexpected errors * added tests to cover dataclasses * added some new and edited some old docs * refactor use of __dict__ to dataclass.to_dict() * Revert "refactor use of __dict__ to dataclass.to_dict()" This reverts commit a4f35513fa26d23a073c16af9fb332045826dcb2. * fixed some tests * refactor use of __dict__ in favor of dataclass.to_dict() method * removed some TODOS * fixed dataclases validation tests * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update docs/Configuration.md Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/config.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * Update openwpm/task_manager.py Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de> * minor fixed wrt polishing the PR * added new check and test for crawl configs Co-authored-by: Stefan Zabka <zabkaste@informatik.hu-berlin.de>
2020-12-02 12:10:45 +03:00
# memory_watchdog and process_watchdog are useful for large scale cloud crawls.
# Please refer to docs/Configuration.md#platform-configuration-options for more information
# manager_params.memory_watchdog = True
# manager_params.process_watchdog = True
2014-07-01 20:37:17 +04:00
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
# Commands time out by default after 60 seconds
with TaskManager(
manager_params,
browser_params,
SQLiteStorageProvider(Path("./datadir/crawl-data.sqlite")),
None,
) as manager:
# Visits the sites
for index, site in enumerate(sites):
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
def callback(success: bool, val: str = site) -> None:
print(
f"CommandSequence for {val} ran {'successfully' if success else 'unsuccessfully'}"
)
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
# Parallelize sites over all number of browsers set above.
command_sequence = CommandSequence(
site,
site_rank=index,
callback=callback,
)
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
# Start by visiting the page
command_sequence.append_command(GetCommand(url=site, sleep=3), timeout=60)
# Have a look at custom_command.py to see how to implement your own command
command_sequence.append_command(LinkCountingCommand())
2014-07-01 20:37:17 +04:00
2021-03-19 23:17:06 +03:00
# Run commands across all browsers (simple parallelization)
Data Aggregator Rewrite (#753) * First steps in the rewrite * Fixed import paths * One giant refactor * Fixing tests * Adding mypy * Removed mypy from pre-commit workflow * First draft on DataAggregator * Wrote a DataAggregator that starts and shuts down * Created tests and added more empty types * Got demo.py working * Created sql_provider * Cleaned up imports in TaskManager * Added async * Fixed minor bugs * First steps at porting arrow * Introduced TableName and different Task handling * Added more failing tests * First first completes others don't * It works * Started working on arrow_provider * Implemented ArrowProvider * Added logger fixture * Fixed test_storage_controller * Fixing OpenWPMTest.visit() * Moved test/storage_providers to test/storage * Fixing up tests * Moved automation to openwpm * Readded datadir to .gitignore * Ran repin.sh * Fixed formatting * Let's see if this works * Fixed imports * Got arrow_memory_provider working * Starting to rewrite tests * Setting up fixtures * Attempting to fix all the tests * Still fixing tests * Broken content saving * Added node * Fixed screenshot tests * Fixing more tests * Fixed tests * Implemented local_storage.py * Cleaned up flush_cache * Fixing more tests * Wrote test for LocalArrowProvider * Introduced tests for local_storage_provider.py * Asserting test dir is empty * Creating subfolder for different aggregators * New depencies and init() * Everything is terribly broken * Figured out finalize_visit_id * Running two event loops kinda works??? * Rearming the event * Introduced mypy * Downgraded black in pre-commit * Modifying the database directly * Fixed formatting * Made mypy a lil stricter * Fixing docs and config printing * Realising I've been using the wrong with * Trying to figure arrow_storage * Moving lock initialization in in_memory_storage * Fixing tests * Fixing up tests and adding more typechecking * Fixed num_browsers in test_cache_hits_recorded * Parametrized unstructured * String fix * Added failing test * New test * Review changes with Steven * Fixed repin.sh and test_arrow_cache * Minor change * Fixed prune-environment.py * Removing references to DataAggregator * Fixed test_seed_persistance * More paths * Fixed test display shutdown * Made cache test more robust * Update crawler.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Slimming down ManagerParams * Fixing more tests * Update test/storage/test_storage_controller.py Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Purging references to DataAggregator * Reverted changes to .travis.yml * Demo.py saves locally again * Readjusting test paths * Expanded comment on initialize to reference #846 * Made token optional in finalize_visit_id * Simplified test paramtetrization * Fixed callback semantics change * Removed test_parse_http_stack_trace_str * Added DataSocket * WIP need to fix path encoding * Fixed path encoding * Added task and crawl to schema * Fixed paths in GitHub actions * Refactored completion handling * Fix tests * Trying to fix tests on CI * Removed redundant setting of tag * Removing references to S3 * Purging more DataAggregator references * Craking up logging to figure out test failure * Moved test_values into a fixture * Fixing GcpUnstructuredProvider * Fixed paths for future crawls * Renamed sqllite to official sqlite * Restored demo.py * Update openwpm/commands/profile_commands.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Restored previous behaviour of DumpProfileCommand Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed leftovers * Cleaned up comments * Expanded lock check * Fixed more stuff * More comment updates * Update openwpm/socket_interface.py Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com> * Removed outdated comment * Using config_encoder * Renamed tar_location to tar_path * Removed references to database_name in docs * Cleanup * Moved screenshot_path and source_dump_path to ManagerParamsInternal * Fixed imports * Fixing up comments * Fixing up comments * More docs * updated dependencies * Fixed test_task_manager * Reupgraded to python 3.9.1 * Restoring crawl_reference in mp_logger * Removed unused imports * Apply suggestions from code review Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> * Cleaned up socket handling * Fixed TaskManager.__exit__ * Moved validation code into config.py * Removed comment * Removed comment * Removed comment Co-authored-by: Steven Englehardt <senglehardt@mozilla.com> Co-authored-by: Georgia Kokkinou <geor5ko@gmail.com>
2021-02-22 19:51:32 +03:00
manager.execute_command_sequence(command_sequence)