diff --git a/docs/infrastructure/administration.md b/docs/infrastructure/administration.md index b6d282831..3cfca7bfe 100644 --- a/docs/infrastructure/administration.md +++ b/docs/infrastructure/administration.md @@ -1,6 +1,8 @@ # Infrastructure administration -Treeherder has four apps, and those deployments and their databases are managed by cloudOps: +Treeherder has four apps, and those deployments and their databases are managed by cloudOps. All deployments, which the exception of treeherder-taskcluster-staging, ingest data from the same +pulse guardian queues and the environments are mostly the same with a few key differences: production has user generated data from code and performance sheriffs, and the database size is much larger +than the other deployments and the ingestion is slower. - [treeherder-prod](https://treeherder.mozilla.org) - [treeherder-stage](https://treeherder.allizom.org) @@ -21,13 +23,22 @@ Production deploys are a manual process that is performed by a Treeherder admin !!! note To access treeherder-prod in [Jenkins](https://ops-master.jenkinsv2.prod.mozaws.net/job/gcp-pipelines/job/treeherder/job/treeherder-production/), cloudOps need to grant you access and you'll need to follow [these steps](https://github.com/mozilla-services/cloudops-deployment/#accessing-jenkins) to set it up before the url will work. +### Using Prototype + +The `prototype` branch is useful and recommended for testing changes that might impact users - such as schema changes, major rewrites or very large prs, and for modifications to cron jobs or to the data ingestion pipeline. However, it's important to note that any schema changes will need to be reset after testing which might involve having cloudOps manually deleting tables or columns, and potentially modifying the django_migrations table so that it matches the current state of django migration files (in fact, this applies to all deployments). + +Access to push to the prototype branch requires special permission and an admin can grant access in the repository branch settings. + +!!! note + Failed tests that have run on the CI will *not* block deployment to the prototype instance. + ### Reverting deployments If a production promotion needs to be reverted, cloudOps can do it (ping whomever is listed as main contact in #treeherder-ops slack channel) or a Treeherder admin can do it in the Jenkins console. Click on the link of a previous commit (far left column) that was deployed to stage and select "rebuild" button on the left side nav. -### Adding or changing scheduled tasks and environment variables +### Managing scheduled tasks, celery queues and environment variables -Changing either of these involves a kubernetes change; you can either open a pull request with the change to the [cloudops-infra repo](https://github.com/mozilla-services/cloudops-infra) if you have access or file a [bugzilla bug](https://bugzilla.mozilla.org/enter_bug.cgi?product=Cloud%20Services&component=Operations%3A%20Releng) and someone from cloudOps will do it for you. +Changing any of these involves a kubernetes change; you can either open a pull request with the change to the [cloudops-infra repo](https://github.com/mozilla-services/cloudops-infra) if you have access or file a [bugzilla bug](https://bugzilla.mozilla.org/enter_bug.cgi?product=Cloud%20Services&component=Operations%3A%20Releng) and someone from cloudOps will do it for you. ## Database Management - cloudOps diff --git a/docs/infrastructure/data_ingestion.md b/docs/infrastructure/data_ingestion.md new file mode 100644 index 000000000..49c5bb422 --- /dev/null +++ b/docs/infrastructure/data_ingestion.md @@ -0,0 +1,19 @@ +# Data Ingestion + +## Ingestion Pipeline + +Treeherder uses the [Celery](https://docs.celeryproject.org/en/stable/index.html) task queue software, with the RabbitMQ broker, to process taskcluster data that is submitted to the [Pulse Guardian](https://pulseguardian.mozilla.org/) queues. It only subscribes to specific exchanges and only processes pushes and tasks for repositories that are defined in the repository.json [fixture](https://github.com/mozilla/treeherder/blob/master/treeherder/model/fixtures/repository.json). + +All of the code that listens for tasks and pushes, stores them, and kicks off log parsing can be found in the `treeherder/etl` directory. Specific Celery settings, such as pre-defined queues, are defined in [settings.py](https://github.com/mozilla/treeherder/blob/master/treeherder/config/settings.py#L301). + +Treeherder executes `pulse_listener_pushes` and `pulse_listener_tasks` django commands in [entrypoint_prod](https://github.com/mozilla/treeherder/blob/master/docker/entrypoint_prod.sh#L27-L30) that listens to both the main firefox-ci cluster and the community clusters. It adds tasks to the `store_pulse_pushes` and `store_pulse_jobs` queues for `worker_store_pulse_data` to process. The user credentials (treeherder-prod, treeherder-staging and treeherder-prototype) are stored in the `PULSE_URL` env variables; the `PULSE_SOURCE_PUSHES` and `PULSE_SOURCE_TASKS` contain urls and credentials to access both clusters. + +Once tasks are processed, the log parsing is scheduled, and depending on the status of the task and type of repository, it will be sent to different types of [log parsing queues](https://github.com/mozilla/treeherder/blob/master/treeherder/etl/jobs.py#L345-L360). + +The live backing log is parsed for a number of reasons - to extract and store performance data for tests that add PERFORMANCE_DATA objects in the logs and to extract and store failure lines for failed tasks. These failure lines are stored and displayed in the job details panel in the Treeherder jobs view, and are used by code sheriffs to classify intermittent failures against bugzilla bugs. + + All exchange bindings that a Pulse user account is subscribed to can be viewed in the Pulse Guardian admin account (under the queues tab); the RabbitMQ dashboard will show which queues are registered and receiving messages. Troubleshooting steps for various data ingestion problems can be found [here](./troubleshooting.md#scenarios). + +## Adding New Queues or Workers + +Ensure that the docker-compose.yml, entrypoint_prod and settings.py files are updated. You'll also need to ensure that a new worker is added to the cloudOps repo. See [Managing scheduled tasks, celery queues and environment variables](./administration.md#managing-scheduled-tasks-celery-queues-and-environment-variables). diff --git a/mkdocs.yml b/mkdocs.yml index 1219a4c1a..746409074 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -42,6 +42,7 @@ nav: - Backend tasks: 'backend_tasks.md' - Infrastructure: - Administration: 'infrastructure/administration.md' + - Data Ingestion: 'infrastructure/data_ingestion.md' - Troubleshooting: 'infrastructure/troubleshooting.md' - Data policies: - Accessing data: 'accessing_data.md'