firefox-translations-training/.taskcluster.yml

291 строка
16 KiB
YAML
Исходник Обычный вид История

# yamllint disable rule:line-length
# This file is rendered via JSON-e by
# - github events - https://github.com/taskcluster/taskcluster/tree/main/services/github
# - cron tasks - https://hg.mozilla.org/ci/ci-admin/file/default/build-decision/
# - action tasks - taskcluster/taskgraph/actions/registry.py
---
version: 1
reporting: checks-v1
policy:
pullRequests: public
tasks:
- $let:
trustDomain: "translations"
ownerEmail:
$switch:
'tasks_for == "github-push"': '${event.pusher.email}'
'tasks_for[:19] == "github-pull-request"': '${event.pull_request.user.login}@users.noreply.github.com'
'tasks_for in ["cron", "action"]': '${tasks_for}@noreply.mozilla.org'
baseRepoUrl:
$switch:
'tasks_for == "github-push"': '${event.repository.html_url}'
'tasks_for[:19] == "github-pull-request"': '${event.pull_request.base.repo.html_url}'
'tasks_for in ["cron", "action"]': '${repository.url}'
repoUrl:
$switch:
'tasks_for == "github-push"': '${event.repository.html_url}'
'tasks_for[:19] == "github-pull-request"': '${event.pull_request.head.repo.html_url}'
'tasks_for in ["cron", "action"]': '${repository.url}'
project:
$switch:
'tasks_for == "github-push"': '${event.repository.name}'
'tasks_for[:19] == "github-pull-request"': '${event.pull_request.head.repo.name}'
'tasks_for in ["cron", "action"]': '${repository.project}'
head_branch:
$switch:
'tasks_for[:19] == "github-pull-request"': ${event.pull_request.head.ref}
'tasks_for == "github-push"': ${event.ref}
'tasks_for == "github-release"': '${event.release.target_commitish}'
'tasks_for in ["action", "cron"]': '${push.branch}'
base_ref:
$switch:
'tasks_for[:19] == "github-pull-request"': ${event.pull_request.base.ref}
'tasks_for == "github-push" && event.base_ref': ${event.base_ref}
'tasks_for == "github-push"': ${event.ref}
'tasks_for in ["cron", "action"]': '${push.branch}'
head_ref:
$switch:
'tasks_for[:19] == "github-pull-request"': ${event.pull_request.head.ref}
'tasks_for == "github-push"': ${event.ref}
'tasks_for in ["cron", "action"]': '${push.branch}'
base_sha:
$switch:
'tasks_for == "github-push"': '${event.before}'
'tasks_for[:19] == "github-pull-request"': '${event.pull_request.base.sha}'
'tasks_for in ["cron", "action"]': '${push.revision}'
head_sha:
$switch:
'tasks_for == "github-push"': '${event.after}'
'tasks_for[:19] == "github-pull-request"': '${event.pull_request.head.sha}'
'tasks_for in ["cron", "action"]': '${push.revision}'
ownTaskId:
$switch:
'"github" in tasks_for': {$eval: as_slugid("decision_task")}
'tasks_for in ["cron", "action"]': '${ownTaskId}'
pullRequestAction:
$switch:
'tasks_for[:19] == "github-pull-request"': ${event.action}
$default: 'UNDEFINED'
isPullRequest:
$eval: 'tasks_for[:19] == "github-pull-request"'
in:
$if: >
tasks_for in ["action", "cron", "github-push"]
|| (isPullRequest && pullRequestAction in ["opened", "reopened", "synchronize"])
then:
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
$let:
level: 1
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
in:
taskId: {$if: 'tasks_for != "action"', then: '${ownTaskId}'}
taskGroupId:
$if: 'tasks_for == "action"'
then:
'${action.taskGroupId}'
else:
'${ownTaskId}' # same as taskId; this is how automation identifies a decision task
schedulerId: '${trustDomain}-level-${level}'
created: {$fromNow: ''}
deadline: {$fromNow: '1 day'}
expires: {$fromNow: '1 year 1 second'} # 1 second so artifacts expire first
metadata:
$merge:
- owner: "${ownerEmail}"
source: "${repoUrl}/raw/${head_sha}/.taskcluster.yml"
- $switch:
'tasks_for == "github-push" || isPullRequest':
name: "Decision Task (${tasks_for[7:]})" # strip out "github-" from tasks_for
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
description: 'The task that creates all of the other tasks in the task graph'
'tasks_for == "action"':
name: "Action: ${action.title}"
description: |
${action.description}
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
Action triggered by clientID `${clientId}`
$default:
name: "Decision Task for cron job ${cron.job_name}"
description: 'Created by a [cron task](https://firefox-ci-tc.services.mozilla.com/tasks/${cron.task_id})'
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
provisionerId: "${trustDomain}-${level}"
workerType: "decision-gcp"
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
tags:
$switch:
'tasks_for == "github-push" || isPullRequest':
createdForUser: "${ownerEmail}"
kind: decision-task
'tasks_for == "action"':
createdForUser: '${ownerEmail}'
kind: 'action-callback'
'tasks_for == "cron"':
kind: cron-task
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
routes:
$flatten:
- checks
- $if: '!isPullRequest'
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
then:
- tc-treeherder.v2.${project}.${head_sha}
- $switch:
'tasks_for == "github-push"':
- "index.${trustDomain}.v2.${project}.latest.taskgraph.decision"
- "index.${trustDomain}.v2.${project}.revision.${head_sha}.taskgraph.decision"
'tasks_for == "action"':
- "index.${trustDomain}.v2.${project}.revision.${head_sha}.taskgraph.actions.${ownTaskId}"
'tasks_for == "cron"':
- "index.${trustDomain}.v2.${project}.latest.taskgraph.decision-${cron.job_name}"
- "index.${trustDomain}.v2.${project}.revision.${head_sha}.taskgraph.decision-${cron.job_name}"
# list each cron task on this revision, so actions can find them
- 'index.${trustDomain}.v2.${project}.revision.${head_sha}.cron.${ownTaskId}'
$default: []
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
scopes:
$switch:
'tasks_for in ["github-push"]':
$let:
short_head_ref:
$if: 'head_ref[:10] == "refs/tags/"'
then: {$eval: 'head_ref[10:]'}
else:
$if: 'head_ref[:11] == "refs/heads/"'
then: {$eval: 'head_ref[11:]'}
else: ${head_ref}
in:
- 'assume:repo:${repoUrl[8:]}:branch:${short_head_ref}'
'isPullRequest':
- 'assume:repo:github.com/${event.pull_request.base.repo.full_name}:${tasks_for[7:]}'
'tasks_for == "action"':
- 'assume:repo:${repoUrl[8:]}:action:${action.action_perm}'
$default:
- 'assume:repo:${repoUrl[8:]}:cron:${cron.job_name}'
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
dependencies: []
requires: all-completed
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
priority:
$switch:
'tasks_for == "cron"': low
'tasks_for == "github-push"|| isPullRequest': very-low
$default: lowest # tasks_for == 'action'
retries: 5
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
payload:
$let:
normProject:
$eval: 'join(split(project, "-"), "_")'
normProjectUpper:
$eval: 'uppercase(join(split(project, "-"), "_"))'
in:
env:
# run-task uses these to check out the source; the inputs to
# `taskgraph decision` are all on the command line.
$merge:
- ${normProjectUpper}_BASE_REPOSITORY: '${baseRepoUrl}'
${normProjectUpper}_BASE_REF: '${base_ref}'
${normProjectUpper}_BASE_REV: '${base_sha}'
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
${normProjectUpper}_HEAD_REPOSITORY: '${repoUrl}'
${normProjectUpper}_HEAD_REF: '${head_ref}'
${normProjectUpper}_HEAD_REV: '${head_sha}'
${normProjectUpper}_REPOSITORY_TYPE: git
${normProjectUpper}_PIP_REQUIREMENTS: taskcluster/requirements.txt
REPOSITORIES:
$json:
${normProject}: ${normProject}
- $if: 'isPullRequest'
then:
${normProjectUpper}_PULL_REQUEST_NUMBER: '${event.pull_request.number}'
- $if: 'tasks_for == "action"'
then:
ACTION_TASK_GROUP_ID: '${action.taskGroupId}' # taskGroupId of the target task
ACTION_TASK_ID: {$json: {$eval: 'taskId'}} # taskId of the target task (JSON-encoded)
ACTION_INPUT: {$json: {$eval: 'input'}}
ACTION_CALLBACK: '${action.cb_name}'
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
cache:
"${trustDomain}-level-${level}-checkouts-sparse-v2": /builds/worker/checkouts
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
features:
taskclusterProxy: true
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
image: mozillareleases/taskgraph:decision-5483484ad45a3d27a0f5bd05f1c87d90e08df67a3713605d812b851a8a5bd854@sha256:ef132cc5741539f846a85bbe0cebc3c9ead30b8f24c1da46c55363f2170c3993
maxRunTime: 1800
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
command:
- run-task
- '--${normProject}-checkout=/builds/worker/checkouts/src'
- '--'
- bash
- -cx
- $let:
extraArgs: {$if: 'tasks_for == "cron"', then: '${cron.quoted_args}', else: ''}
in:
$if: 'tasks_for == "action"'
then: >
cd /builds/worker/checkouts/src &&
ln -s /builds/worker/artifacts artifacts &&
~/.local/bin/taskgraph action-callback
else: >
cd /builds/worker/checkouts/src &&
ln -s /builds/worker/artifacts artifacts &&
~/.local/bin/taskgraph decision
--pushlog-id='0'
--pushdate='0'
--project='${project}'
--owner='${ownerEmail}'
--level='${level}'
--repository-type=git
--tasks-for='${tasks_for}'
--base-repository='${baseRepoUrl}'
--base-ref='${base_ref}'
--base-rev='${base_sha}'
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
--head-repository='${repoUrl}'
--head-ref='${head_ref}'
--head-rev='${head_sha}'
${extraArgs}
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
artifacts:
'public':
type: 'directory'
path: '/builds/worker/artifacts'
expires: {$fromNow: '1 year'}
'public/docker-contexts':
type: 'directory'
path: '/builds/worker/checkouts/src/docker-contexts'
# This needs to be at least the deadline of the
# decision task + the docker-image task deadlines.
# It is set to a week to allow for some time for
# debugging, but they are not useful long-term.
expires: {$fromNow: '7 day'}
Add initial work on Taskcluster training pipeline (#115) * Adjust ci config for staging repo * Use bhearsum's taskgraph repo for now It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment. * Get rid of hello kind now that we know that Taskcluster works * Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml * Add yamllint config for taskcluster files * Add toolchain tasks for things that we depend on to train language models. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine. * Bump decision task image * Add tasks to fetch a few dataset types I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task. * Add configuration for black and ruff for python formatting * Add `clean` stage of the training pipeline This is largely built on the earlier work done on the `dataset` kind. * Update pipeline scripts to work with Taskcluster These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress) * Add treeherder symbol for decision task We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have. * Add a `train` action task to support kicking off the training pipeline This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements. * Add bicleaner pack fetches These are an additional dependency for the `bicleaner` stage of the pipeline. * Implement `bicleaner` pipeline stage Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters * Raise taskgraph level for pushes, cron, and actions to level 3 Most of this diff is just indentation changes. * Re-adjust ci-config.yml for production repository * Don't set treeherder routes for pull requests * Use standard cache prefixes * Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes * Bump taskgraph version; re-enable pip hash checking * Switch cache attributes to be nested, instead of multiple top level attributes. * Override compression scheme in pipeline steps. * Use setdefault in parameters.py * Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-11 02:16:12 +03:00
extra:
$merge:
- treeherder:
$merge:
- machine:
platform: gecko-decision
- $if: 'tasks_for == "github-push" || isPullRequest'
then:
symbol: D
else:
$if: 'tasks_for == "action"'
then:
groupName: 'action-callback'
groupSymbol: AC
symbol: "${action.symbol}"
else:
groupSymbol: cron
symbol: "${cron.job_symbol}"
- $if: 'tasks_for == "action"'
then:
parent: '${action.taskGroupId}'
action:
name: '${action.name}'
context:
taskGroupId: '${action.taskGroupId}'
taskId: {$eval: 'taskId'}
input: {$eval: 'input'}
clientId: {$eval: 'clientId'}
- $if: 'tasks_for == "cron"'
then:
cron: {$json: {$eval: 'cron'}}
- tasks_for: '${tasks_for}'