13 Available Commands
Steven Englehardt редактировал(а) эту страницу 2020-07-23 09:19:32 -07:00

There are a number of commands that can be sent to the TaskManager.

CommandSequence API

The TaskManager has a method manager.execute_command_sequence(), which accepts a CommandSequence object. The CommandSequence object allows you to concatenate multiple commands together to be executed by one browser in a single logical "site visit." For example:

cs = CommandSequence(url='http://www.example.com', reset=False) # Top level site to visit
cs.get() # Visit the page
cs.dump_flash_cookies() # Record flash cookies from site visit
manager.execute_command_sequence(cs)

The available commands of a CommandSequence are:

get

Using cs.get(sleep, timeout) will execute a visit to the CommandSequence's url. The cookies, flash_cookies, http_request, http_response, and localStorage, and site_visits tables will be modified in the database.

browse

Using cs.browse(num_links, sleep, timeout) will load URL in a tab as cs.get() would. Once loaded, the browser will visit num_links, returning back to the homepage after each link. All links selected will be from the same hostname as URL. This may require a longer timeout, especially when used with bot detection mitigation.

save_screenshot

cs.save_screenshot(screenshot_name, timeout) will take a screenshot of the current page and save it to [data_directory]/screenshots/screenshot_name.png.

dump_page_source

cs.dump_page_source(dump_name, timeout) will dump the html of the current rendered page to a file in [data_directory]/sources/dump_name.html.

dump_flash_cookies

Calling cs.dump_flash_cookies(timeout) will read from flash storage and save any changes to the state since the URL of the CommandSequence was first retrieved (via 'get' or 'browse'). NOTE: this command closes the tab, so all commands that rely on the page existing should be done before this.

dump_profile

Calling cs.dump_profile(dump_folder, closer_webdriver=False, compress, timeout) saves the browsers local state as a gzipped tar into dump_folder. If close_webdriver is set to True, the manager will close the webdriver so all browser data syncs to disk before copying. You will likely want to use with command with close_webdriver = True when saving that state at the end of a crawl.

run_custom_function

Support for running custom functions was added in #104, and allows a user of the platform to define and run a custom command without having to go through the procedure of adding a new command into the platform. The CommandSequence::run_custom_function command is intended for one-off commands that are unlikely to be re-used between crawl scripts. In particular, it's useful for writing platform tests where the test script needs to take a specific action that wouldn't normally be taken during a crawl (and thus that we wouldn't want to have a command for in the platform).

To use run_custom_function you must first define a new function adhering to the following API:

def my_custom_function(arg1, arg2, ..., argn, **kwargs):
    driver = kwargs['driver']
    ...

where my_custom_function is the function handle, arg1 to argn are positional arguments you'd like to pass to the function, and **kwargs is a keyword argument that will be populated by the platform with internal state. When my_custom_function is called by the platform **kwargs will be populated with the following arguments which can be accessed and used within your custom function:

  • kwargs['command'] -- the command tuple submitted by the task manager to the browser manager
  • kwargs['driver'] -- the webdriver instance for that browser
  • kwargs['proxy_queue'] -- the queue used to send commands to the proxy
  • kwargs['extension_socket'] -- the socket information for sending commands to the extension
  • kwargs['browser_settings'] -- the (platform defined) browser settings dictionary
  • kwargs['browser_params'] -- the (user defined) browser configuration parameter dictionary
  • kwargs['manager_params'] -- the (user defined) manager configuration parameter dictionary

Second, you must call your custom function from within a command sequence, for example:

from OpenWPM.automation import TaskManager, CommandSequence

def my_custom_function(arg1, arg2, ..., argn, **kwargs):
    driver = kwargs['driver']
    ...
    return

URL = 'http://example.com'
manager = TaskManager.TaskManager(manager_params, browser_params)
cs = CommandSequence.CommandSequence(URL)
cs.get(sleep=10, timeout=60)
cs.run_custom_function(my_custom_function, (arg1, arg2, ..., argn))
manager.execute_command_sequence(cs)
manager.close()

where the command sequence is passed the function handle and the positional arguments you want to pass to the command. You can find an example in the platform tests for the command.

Simple API (deprecated)

get

Using manager.get(URL) will load URL in a tab. The cookies, flash_cookies, http_request, http_response, and localStorage, and site_visits tables will be modified in the database.

browse

Using manager.browse(URL, num_links) will load URL in a tab as manager.get() would. Once loaded, the browser will visit num_links, returning back to the homepage after each link. All links selected will be from the same hostname as URL. This may require a longer timeout, especially when used with bot detection mitigation.

NOTE: All records in the database will have a visit_id set to the corresponding site_url in the site_visits table, even for resources requested during visits to sub-links on the page. For example, some http requests may be labeled as coming from http://www.yahoo.com when they are actually from http://www.yahoo.com/news/ after a homepage link was clicked.