There are a number of commands that can be sent to the TaskManager.
CommandSequence API
The TaskManager has a method manager.execute_command_sequence()
, which accepts a CommandSequence
object. The CommandSequence object allows you to concatenate multiple commands together to be executed by one browser in a single logical "site visit." For example:
cs = CommandSequence(url='http://www.example.com', reset=False) # Top level site to visit
cs.get() # Visit the page
cs.dump_flash_cookies() # Record flash cookies from site visit
manager.execute_command_sequence(cs)
The available commands of a CommandSequence are:
get
Using cs.get(sleep, timeout)
will execute a visit to the CommandSequence's url
. The cookies, flash_cookies, http_request, http_response, and localStorage, and site_visits tables will be modified in the database.
browse
Using cs.browse(num_links, sleep, timeout)
will load URL
in a tab as cs.get()
would. Once loaded, the browser will visit num_links
, returning back to the homepage after each link. All links selected will be from the same hostname as URL
. This may require a longer timeout, especially when used with bot detection mitigation.
save_screenshot
cs.save_screenshot(screenshot_name, timeout)
will take a screenshot of the current page and save it to [data_directory]/screenshots/screenshot_name.png.
dump_page_source
cs.dump_page_source(dump_name, timeout)
will dump the html of the current rendered page to a file in [data_directory]/sources/dump_name.html.
dump_flash_cookies
Calling cs.dump_flash_cookies(timeout)
will read from flash storage and save any changes to the state since the URL of the CommandSequence was first retrieved (via 'get' or 'browse'). NOTE: this command closes the tab, so all commands that rely on the page existing should be done before this.
dump_profile
Calling cs.dump_profile(dump_folder, closer_webdriver=False, compress, timeout)
saves the browsers local state as a gzipped tar into dump_folder
. If close_webdriver
is set to True
, the manager will close the webdriver so all browser data syncs to disk before copying. You will likely want to use with command with close_webdriver = True
when saving that state at the end of a crawl.
run_custom_function
Support for running custom functions was added in #104, and allows a user of the platform to define and run a custom command without having to go through the procedure of adding a new command into the platform. The CommandSequence::run_custom_function
command is intended for one-off commands that are unlikely to be re-used between crawl scripts. In particular, it's useful for writing platform tests where the test script needs to take a specific action that wouldn't normally be taken during a crawl (and thus that we wouldn't want to have a command for in the platform).
To use run_custom_function
you must first define a new function adhering to the following API:
def my_custom_function(arg1, arg2, ..., argn, **kwargs):
driver = kwargs['driver']
...
where my_custom_function
is the function handle, arg1
to argn
are positional arguments you'd like to pass to the function, and **kwargs
is a keyword argument that will be populated by the platform with internal state. When my_custom_function
is called by the platform **kwargs
will be populated with the following arguments which can be accessed and used within your custom function:
kwargs['command']
-- the command tuple submitted by the task manager to the browser managerkwargs['driver']
-- the webdriver instance for that browserkwargs['proxy_queue']
-- the queue used to send commands to the proxykwargs['extension_socket']
-- the socket information for sending commands to the extensionkwargs['browser_settings']
-- the (platform defined) browser settings dictionarykwargs['browser_params']
-- the (user defined) browser configuration parameter dictionarykwargs['manager_params']
-- the (user defined) manager configuration parameter dictionary
Second, you must call your custom function from within a command sequence, for example:
from OpenWPM.automation import TaskManager, CommandSequence
def my_custom_function(arg1, arg2, ..., argn, **kwargs):
driver = kwargs['driver']
...
return
URL = 'http://example.com'
manager = TaskManager.TaskManager(manager_params, browser_params)
cs = CommandSequence.CommandSequence(URL)
cs.get(sleep=10, timeout=60)
cs.run_custom_function(my_custom_function, (arg1, arg2, ..., argn))
manager.execute_command_sequence(cs)
manager.close()
where the command sequence is passed the function handle and the positional arguments you want to pass to the command. You can find an example in the platform tests for the command.
Simple API (deprecated)
get
Using manager.get(URL)
will load URL
in a tab. The cookies, flash_cookies, http_request, http_response, and localStorage, and site_visits tables will be modified in the database.
browse
Using manager.browse(URL, num_links)
will load URL
in a tab as manager.get()
would. Once loaded, the browser will visit num_links
, returning back to the homepage after each link. All links selected will be from the same hostname as URL
. This may require a longer timeout, especially when used with bot detection mitigation.
NOTE: All records in the database will have a visit_id
set to the corresponding site_url in the site_visits table, even for resources requested during visits to sub-links on the page. For example, some http requests
may be labeled as coming from http://www.yahoo.com
when they are actually from http://www.yahoo.com/news/
after a homepage link was clicked.