A simple command line app for controlling a GitHub crawler

Перейти к файлу

Jeff McAffer 9708c7fff6 Initial code moved from ospo-ghcrawler		2017-02-19 20:23:04 -08:00
bin	Initial code moved from ospo-ghcrawler	2017-02-19 20:23:04 -08:00
.gitignore	Initial commit	2017-02-19 20:07:00 -08:00
LICENSE	Initial commit	2017-02-19 20:07:02 -08:00
README.md	Initial code moved from ospo-ghcrawler	2017-02-19 20:23:04 -08:00
crawlerClient.js	Initial code moved from ospo-ghcrawler	2017-02-19 20:23:04 -08:00
package.json	Initial code moved from ospo-ghcrawler	2017-02-19 20:23:04 -08:00

README.md

Crawler command line

GHCrawler is utility for walking GitHub APIs and tracking GitHub events. This command line app allows you to control various aspects of a crawler's behavior. There is some overlap in function with the Crawler Dashboard

This project also has a simple Node client library for talking to a crawler.

Controlling a crawler

The cc utility is in the bin directory of this repo. It can be run interactively or as a single command processor. The general format of using the command line is

node cc [options] [command]

where the available options are:

-i -- Run in interactive mode

-s <url> -- Control the crawler service running at the given url. Defaults to http://localhost:3000. You can also set the CRAWLER_SERVICE_URL environment variable.

-t -- The crawler service API token to use. This can also be supplied via the CRAWLER_SERVICE_AUTH_TOKEN environment variable. If not defined in either place, the default "secret" value is used.

and the available commands are:

start [count] -- Start the crawler processing with count concurrent operations. If count is not specified, 1 is the default. On a reasonably fast network a count of 10 to 15 should be sufficient. This also depends on how many tokens you are using.

stop -- Stop crawler processing. The crawler service is left running but it stops pulling requests off the queue.

queue <requests...> -- Queues the given requests for processing. The requests parameter is a list of GitHub "org" and/or "org/repo" names.

orgs <org orgs...> -- Set the crawler's to traverse only the GitHub orgs named in the given list.

config -- Dumps the crawler service's configuration to the console.

tokens <spec...> -- Set the GitHub tokens to be used by the crawler when calling GitHub APIs. The spec value is a list of token specs. Each spec has the form #,... where the token is the GitHub OAuth or Personal Access token and the comma-separated list of traits identify what permissions the token has. The available traits are: public, admin, private. You can list as many tokens and traits as you like. Note that you can also configure the GitHub tokens the CRAWLER_GITHUB_TOKENS environment variable instead before starting the crawler. For example, export CRAWLER_GITHUB_TOKENS="<token1>#public <token2>#admin".

A typical sequence shown in the snippet below configures the crawler with a set of tokens, configures the org filter set and then queues and starts the processing of the org.

> node bin/cc
http://localhost:3000> tokens 43984b2344ca575d0f0e097efd97#public 972bbdfe098098fa9ce082309#admin
http://localhost:3000> orgs contoso-d
http://localhost:3000> queue contoso-d
http://localhost:3000> start 5
http://localhost:3000> exit
>

API

API doc is coming.

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.