Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

crawler data github github-api github-webhooks ospo

Перейти к файлу

Jeff McAffer d36c5ab7e2 Break crawler into three parts		2016-11-10 23:59:42 -08:00
.vscode	Add full on etag support and more!	2016-11-10 17:44:05 -08:00
lib	Break crawler into three parts	2016-11-10 23:59:42 -08:00
test	clean up tests and extra files	2016-11-05 14:56:11 -07:00
.gitignore	update settings and gitignore	2016-11-02 13:07:53 -07:00
LICENSE	Initial commit and package	2016-10-31 23:01:54 -07:00
README.md	Initial commit and package	2016-10-31 23:01:54 -07:00
index.js	Break crawler into three parts	2016-11-10 23:59:42 -08:00
package.json	Add full on etag support and more!	2016-11-10 17:44:05 -08:00

README.md

GHCrawler

A robust GitHub API crawler that walks a queue of GitHub entities transitively retrieving and storing their contents. GHCrawler is great for:

Retreiving all GitHub entities related to an org, repo, or user
Efficiently storing and the retrieved entities
Keeping the stored data up to date when used in conjunction with a GitHub event tracker

GHCrawler focuses on successively retrieving and walking GitHub resources supplied on a queue. Each resource is fetched, analyzed, stored and plumbed for more resources to fetch. Discovered resources are themselves queued for further processing. The crawler is careful to not repeatedly fetch the same resource.

Examples

Coming...

Contributing

The project team is more than happy to take contributions and suggestions.

To start working, run npm install in the repository folder to install the required dependencies.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.