Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

crawler data github github-api github-webhooks ospo

Перейти к файлу

Jeff McAffer 5c3fdcfcb5 add stargazers, reviews and tweak collaborators		2017-01-24 19:34:51 -08:00
.vscode	robustness improvements	2016-12-03 16:35:59 -08:00
lib	add stargazers, reviews and tweak collaborators	2017-01-24 19:34:51 -08:00
test	revamp traversal so to use maps	2017-01-20 17:49:25 -08:00
.eslintrc.json	massive rework of traversal	2016-11-30 17:54:48 -08:00
.gitignore	Add istanbul code coverage	2016-11-13 15:34:50 -08:00
LICENSE	open source cleanup: copyrights, stale files, ...	2016-12-29 14:58:25 -08:00
README.md	readme update	2016-12-30 11:43:41 -08:00
index.js	open source cleanup: copyrights, stale files, ...	2016-12-29 14:58:25 -08:00
package.json	0.1.19	2017-01-13 10:36:30 -08:00

README.md

GHCrawler

A robust GitHub API crawler that walks a queue of GitHub entities transitively retrieving and storing their contents. GHCrawler is great for:

Retreiving all GitHub entities related to an org, repo, or user
Efficiently storing and the retrieved entities
Keeping the stored data up to date when used in conjunction with a GitHub event tracker

GHCrawler focuses on successively retrieving and walking GitHub resources supplied on a (set of) queues. Each resource is fetched, processed, plumbed for more resources to fetch and ultimately stored. Discovered resources are themselves queued for further processing. The crawler is careful to not repeatedly fetch the same resource. It makes heavy use of etags and includes GitHub token pooling and rotation to optimize use of your API tokens.

Usage

The crawler itself is not particularly runnable. It needs to be configured with:

Queuing infrastructure that can take and supply requests to process the response from an API URL.
A fetcher that queries APIs with the URL in a given request.
One or more processors that handle requests and the fetched API document.
A store used to store the processed documents.

The best way to get running with the crawler is to look at the OSPO-ghcrawler repo. It has integrations for several queuing and storage technologies as well as examples of how to configure and run a crawler.

Contributing

The project team is more than happy to take contributions and suggestions.

To start working, run npm install in the repository folder to install the required dependencies.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.