gecko-dev/taskcluster/ci/fetch/kind.yml

# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.

loader: taskgraph.loader.transform:loader

transforms:
    - taskgraph.transforms.fetch:transforms
    - taskgraph.transforms.try_job:transforms
    - taskgraph.transforms.job:transforms
    - taskgraph.transforms.task:transforms

jobs-from:
    - toolchains.yml
Bug 1460777 - Taskgraph tasks for retrieving remote content; r=dustin, glandium Currently, many tasks fetch content from the Internets. A problem with that is fetching from the Internets is unreliable: servers may have outages or be slow; content may disappear or change out from under us. The unreliability of 3rd party services poses a risk to Firefox CI. If services aren't available, we could potentially not run some CI tasks. In the worst case, we might not be able to release Firefox. That would be bad. In fact, as I write this, gmplib.org has been unavailable for ~24 hours and Firefox CI is unable to retrieve the GMP source code. As a result, building GCC toolchains is failing. A solution to this is to make tasks more hermetic by depending on fewer network services (which by definition aren't reliable over time and therefore introduce instability). This commit attempts to mitigate some external service dependencies by introducing the fetch task kind. The primary goal of the fetch kind is to obtain remote content and re-expose it as a task artifact. By making external content available as a cached task artifact, we allow dependent tasks to consume this content without touching the service originally providing that content, thus eliminating a run-time dependency and making tasks more hermetic and reproducible over time. We introduce a single "fetch-url" "using" flavor to define tasks that fetch single URLs and then re-expose that URL as an artifact. Powering this is a new, minimal "fetch" Docker image that contains a "fetch-content" Python script that does the work for us. We have added tasks to fetch source archives used to build the GCC toolchains. Fetching remote content and re-exposing it as an artifact is not very useful by itself: the value is in having tasks use those artifacts. We introduce a taskgraph transform that allows tasks to define an array of "fetches." Each entry corresponds to the name of a "fetch" task kind. When present, the corresponding "fetch" task is added as a dependency. And the task ID and artifact path from that "fetch" task is added to the MOZ_FETCHES environment variable of the task depending on it. Our "fetch-content" script has a "task-artifacts" sub-command that tasks can execute to perform retrieval of all artifacts listed in MOZ_FETCHES. To prove all of this works, the code for fetching dependencies when building GCC toolchains has been updated to use `fetch-content`. The now-unused legacy code has been deleted. This commit improves the reliability and efficiency of GCC toolchain tasks. Dependencies now all come from task artifacts and should always be available in the common case. In addition, `fetch-content` downloads and extracts files concurrently. This makes it faster than the serial application which we were previously using. There are some things I don't like about this commit. First, a new Docker image and Python script for downloading URLs feels a bit heavyweight. The Docker image is definitely overkill as things stand. I can eventually justify it because I want to implement support for fetching and repackaging VCS repositories and for caching Debian packages. These will require more packages than what I'm comfortable installing on the base Debian image, therefore justifying a dedicated image. The `fetch-content static-url` sub-command could definitely be implemented as a shell script. But Python is readily available and is more pleasant to maintain than shell, so I wrote it in Python. `fetch-content task-artifacts` is more advanced and writing it in Python is more justified, IMO. FWIW, the script is Python 3 only, which conveniently gives us access to `concurrent.futures`, which facilitates concurrent download. `fetch-content` also duplicates functionality found elsewhere. generic-worker's task payload supports a "mounts" feature which facilitates downloading remote content, including from a task artifact. However, this feature doesn't exist on docker-worker. So we have to implement downloading inside the task rather than at the worker level. I concede that if all workers had generic-worker's "mounts" feature and supported concurrent download, `fetch-content` wouldn't need to exist. `fetch-content` also duplicates functionality of `mach artifact toolchain`. I probably could have used `mach artifact toolchain` instead of writing `fetch-content task-artifacts`. However, I didn't want to introduce the requirement of a VCS checkout. `mach artifact toolchain` has its origins in providing a feature to the build system. And "fetching artifacts from tasks" is a more generic feature than that. I think it should be implemented as a generic feature and not something that is "toolchain" specific. I think the best place for a generic "fetch content" feature is in the worker, where content can be defined in the task payload. But as explained above, that feature isn't universally available. The next best place is probably run-task. run-task already performs generic, very-early task preparation steps, such as performing a VCS checkout. I would like to fold `fetch-content` into run-task and make it all driven by environment variables. But run-task is currently Python 2 and achieving concurrency would involve a bit of programming (or adding package dependencies). I may very well port run-task to Python 3 and then fold fetch-content into it. Or maybe we leave `fetch-content` as a standalone script. MozReview-Commit-ID: AGuTcwNcNJR --HG-- extra : source : 0b941cbdca76fb2fbb98dc5bbc1a0237c69954d0 extra : histedit_source : a3e43bdd8a9a58550bef02fec3be832ca304ea93 2018-06-07 00:37:49 +03:00			`# This Source Code Form is subject to the terms of the Mozilla Public`
			`# License, v. 2.0. If a copy of the MPL was not distributed with this`
			`# file, You can obtain one at http://mozilla.org/MPL/2.0/.`

			`loader: taskgraph.loader.transform:loader`

			`transforms:`
Bug 1467359 - Implement fetch-url jobs using a transform; r=tomprince Previously, the fetch kind was defined by a job "using" flavor. An upcoming commit will need to derive new tasks from fetch job definitions. In order to do this, we require a transform. And this transform would need access to the original data from the job description. But by the time the "using" transform runs, a lot of this data is thrown away. It would be possible to stuff the meaningful metadata inside attributes on the created task. But this would result in the fetch logic being fragmented across multiple Python modules (a fetch-specific transform and a job/using- specific transform). I think it is better to keep all that logic in a single Python module. This commit converts the "using: fetch-url" job transform to a dedicated transform for the fetch kind. Since we're now using a dedicated transform, we no longer need to use the normal job schema for defining fetch jobs. So we refactor the schema a little so it is simpler. I verified the taskgraph output is nearly identical to before by diffing the JSON output of `mach taskgraph full`. Aside from the additions of ['worker']['implementation'] and ['worker']['os'] fields (which seem to be required by a schema somewhere), everything was identical. Differential Revision: https://phabricator.services.mozilla.com/D1574 --HG-- rename : taskcluster/taskgraph/transforms/job/fetch.py => taskcluster/taskgraph/transforms/fetch.py extra : source : d22981772dd995a492b5fdb67c887d55d24654db extra : amend_source : ab472fd2a8d652092814443b8da94a580a5bfd53 2018-06-21 02:25:34 +03:00			`- taskgraph.transforms.fetch:transforms`
Bug 1460777 - Taskgraph tasks for retrieving remote content; r=dustin, glandium Currently, many tasks fetch content from the Internets. A problem with that is fetching from the Internets is unreliable: servers may have outages or be slow; content may disappear or change out from under us. The unreliability of 3rd party services poses a risk to Firefox CI. If services aren't available, we could potentially not run some CI tasks. In the worst case, we might not be able to release Firefox. That would be bad. In fact, as I write this, gmplib.org has been unavailable for ~24 hours and Firefox CI is unable to retrieve the GMP source code. As a result, building GCC toolchains is failing. A solution to this is to make tasks more hermetic by depending on fewer network services (which by definition aren't reliable over time and therefore introduce instability). This commit attempts to mitigate some external service dependencies by introducing the fetch task kind. The primary goal of the fetch kind is to obtain remote content and re-expose it as a task artifact. By making external content available as a cached task artifact, we allow dependent tasks to consume this content without touching the service originally providing that content, thus eliminating a run-time dependency and making tasks more hermetic and reproducible over time. We introduce a single "fetch-url" "using" flavor to define tasks that fetch single URLs and then re-expose that URL as an artifact. Powering this is a new, minimal "fetch" Docker image that contains a "fetch-content" Python script that does the work for us. We have added tasks to fetch source archives used to build the GCC toolchains. Fetching remote content and re-exposing it as an artifact is not very useful by itself: the value is in having tasks use those artifacts. We introduce a taskgraph transform that allows tasks to define an array of "fetches." Each entry corresponds to the name of a "fetch" task kind. When present, the corresponding "fetch" task is added as a dependency. And the task ID and artifact path from that "fetch" task is added to the MOZ_FETCHES environment variable of the task depending on it. Our "fetch-content" script has a "task-artifacts" sub-command that tasks can execute to perform retrieval of all artifacts listed in MOZ_FETCHES. To prove all of this works, the code for fetching dependencies when building GCC toolchains has been updated to use `fetch-content`. The now-unused legacy code has been deleted. This commit improves the reliability and efficiency of GCC toolchain tasks. Dependencies now all come from task artifacts and should always be available in the common case. In addition, `fetch-content` downloads and extracts files concurrently. This makes it faster than the serial application which we were previously using. There are some things I don't like about this commit. First, a new Docker image and Python script for downloading URLs feels a bit heavyweight. The Docker image is definitely overkill as things stand. I can eventually justify it because I want to implement support for fetching and repackaging VCS repositories and for caching Debian packages. These will require more packages than what I'm comfortable installing on the base Debian image, therefore justifying a dedicated image. The `fetch-content static-url` sub-command could definitely be implemented as a shell script. But Python is readily available and is more pleasant to maintain than shell, so I wrote it in Python. `fetch-content task-artifacts` is more advanced and writing it in Python is more justified, IMO. FWIW, the script is Python 3 only, which conveniently gives us access to `concurrent.futures`, which facilitates concurrent download. `fetch-content` also duplicates functionality found elsewhere. generic-worker's task payload supports a "mounts" feature which facilitates downloading remote content, including from a task artifact. However, this feature doesn't exist on docker-worker. So we have to implement downloading inside the task rather than at the worker level. I concede that if all workers had generic-worker's "mounts" feature and supported concurrent download, `fetch-content` wouldn't need to exist. `fetch-content` also duplicates functionality of `mach artifact toolchain`. I probably could have used `mach artifact toolchain` instead of writing `fetch-content task-artifacts`. However, I didn't want to introduce the requirement of a VCS checkout. `mach artifact toolchain` has its origins in providing a feature to the build system. And "fetching artifacts from tasks" is a more generic feature than that. I think it should be implemented as a generic feature and not something that is "toolchain" specific. I think the best place for a generic "fetch content" feature is in the worker, where content can be defined in the task payload. But as explained above, that feature isn't universally available. The next best place is probably run-task. run-task already performs generic, very-early task preparation steps, such as performing a VCS checkout. I would like to fold `fetch-content` into run-task and make it all driven by environment variables. But run-task is currently Python 2 and achieving concurrency would involve a bit of programming (or adding package dependencies). I may very well port run-task to Python 3 and then fold fetch-content into it. Or maybe we leave `fetch-content` as a standalone script. MozReview-Commit-ID: AGuTcwNcNJR --HG-- extra : source : 0b941cbdca76fb2fbb98dc5bbc1a0237c69954d0 extra : histedit_source : a3e43bdd8a9a58550bef02fec3be832ca304ea93 2018-06-07 00:37:49 +03:00			`- taskgraph.transforms.try_job:transforms`
			`- taskgraph.transforms.job:transforms`
			`- taskgraph.transforms.task:transforms`

			`jobs-from:`
			`- toolchains.yml`