Граф коммитов

344 Коммитов

Автор SHA1 Сообщение Дата
Greg Tatum 157fc5e8ae
Update the training continuation docs (#540) 2024-04-29 14:36:14 -05:00
Bastien Abadie 658d6c2d88
Taskcluster publication (#501)
* Configure Taskcluster secret for w&b through taskgraph transform

* Direct secret usage

* Support recursive join on tags

* Do not use taskcluster filtering with --from-stream

* Enable verbose mode

* More debugging lines

* Redirect opustrainer stderr to stdout

* Log marian lines in verbose output

* Fix test

* Log raw lines & add prefix to our own logging

* Correct project name

* Convert Taskcluster trigger from transform to to kind

* Fix lint

* Setup tracking on all training tasks

* Set WandB group & run name

* Skip publication on unit tests

* Make perplexity optional

* Update test fixture

* Add training parameter to control publication

* Move trigger control to python

* Use task config to get wandb names

* Bashism

* Use taskcluster group logic to build task name

* Run on WANDB_PUBLICATION=false, but do not publish

* Expose weight & biases tags
2024-04-29 10:32:23 -07:00
Ben Hearsum (he/him) 327509c0ff
Revert change to generic-worker for CPU tasks (#536)
We're hitting some odd issues with caches that need to be worked out. Eg:
error: cache /builds/worker/checkouts is not empty and is missing a .cacherequires file; the cache names for this task are likely mis-configured or TASKCLUSTER_CACHES is not set properly

(from https://firefox-ci-tc.services.mozilla.com/tasks/IvbeCQBuRuKIOaeOIGEfHg/runs/7)
2024-04-25 15:43:36 -04:00
Ben Hearsum (he/him) f8fb37637e
Revert unnecessary change to docker image (#535)
I made this in #533 to ensure I got a full test run in that PR. However, there was no need for this to make it to main. Let's back this out to avoid changing cache digests there.
2024-04-25 13:49:30 -04:00
Ben Hearsum (he/him) d68edc08f3
Switch CPU tasks to generic-worker/d2g images (fixes #473) (#533)
* Switch CPU tasks to generic-worker/d2g images (fixes #473)

This switches us from the deprecated docker-worker to generic-worker. generic-worker provides a translation layer for docker-worker tasks that avoids the need to change any payloads. (It will download specified images and run payload commands in them, rather than on the host machine.)

Upgrading to this new image will give us memory monitoring capabilities on the CPU workers because the new image has the GCP Ops Agent installed on it.

* Invalidate docker image cache to force rebuilds of all docker tasks on d2g workers
2024-04-24 09:07:32 -04:00
Ben Hearsum (he/him) 145a84ace3
Fix parameters to use correct target tasks method (#526)
As things are now, the `small` parameters will never generate training tasks. (The `large` params already use the correct target tasks method.)
2024-04-15 21:45:31 -04:00
Greg Tatum e8c6f2e8d3
Remove the Makefile and replace it with a Taskfile (#510) 2024-04-09 16:11:13 -05:00
Greg Tatum fa56c7b298
Add parallel stats to the analyze task (#500) 2024-04-09 15:38:16 -05:00
Greg Tatum f60c657596
Update the docs for training continuation to use yaml (#516) 2024-04-09 13:54:18 -05:00
Valentin Rigal 48176f6a90
Prevent duplicating group_logs from experiments entrypoint (#511)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-04-03 12:01:07 -07:00
Valentin Rigal f49e9c35be
Parse logs with Marian 1.12 (#512)
* Add test data

* Update parser

* Add tests
2024-04-03 10:35:37 -07:00
Evgeny Pavlov a3bb87c069
Update docs for OpusTrainer and alignments (#504)
* Update OpusTrainer docs with inline noise

* Update and refactor documentation for pipeline steps
2024-03-28 18:18:23 -07:00
Evgeny Pavlov fab87a7a70
Add support of inline noise data augmentation (#502)
* Add eflomal based aligner

* Use new aligner for shortlist

* Remove old aligner

* Add Taskcluster steps for whitespace tokenized alignments

* Move file to a renamed directory

* Use Tags modifier in training

* Update tests for alignments and shortlist

* Add support of inline noise augmentation in data importer

* Do not use slow inline noise augmentation in devset on CI

* Remove the old alignments task

* Add a test for student alignments

* Fix alignments in training tests

* Return matplotlib module after merge

* Rename functions

* Add more comments in the code

* Remove compression env

* Relock poetry
2024-03-28 18:10:02 -07:00
Evgeny Pavlov 3774779cb7
Add Marian server for model testing (#492)
* Compile marian server

* Add Marian server for testing

* Reformat

* Update utils/marian_client.py

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>

* Make port configurable

* Relock poetry

---------

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
2024-03-28 15:53:16 -07:00
Evgeny Pavlov 7a15b5e97a
Update Marian to v1.12.14 2d067afb 2024-02-16 (#491) 2024-03-28 15:23:04 -07:00
Valentin Rigal 55ab1fc486
Add argument to override runs (#498)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-03-28 11:51:06 -07:00
Valentin Rigal 418a1e4d55
Consistent dataset names (#494)
* Consistent evaluation tags parsing

* Add test

* Support backwards training task label

* Support evaluation task with suffix

* Support suffixes with form -1/2

---------

Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-03-28 11:46:14 -07:00
Ben Hearsum (he/him) 015a74df64
Bump taskgraph version to 7.4.0. (#497)
This picks up an optimization that should fix #487.
2024-03-28 10:02:13 -04:00
Greg Tatum 830e5b12ac
Add a dockerignore file (#499) 2024-03-26 14:19:50 -05:00
Valentin Rigal 8c7bff6f00
Skip checking existing runs for new projects (#496)
* Skip checking existing runs for new projects

Follow up of #484
Otherwise a value Error was raised by the wandb client

* Suggestion

---------

Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-03-26 11:34:47 -07:00
Greg Tatum 78977402e0
Analysis task that provides the word distribution (#477) 2024-03-26 13:28:43 -05:00
EvaBardou 5e99eb34f9
Allow to use taskcluster.Secrets service to retrieve W&B secret API Key (#493)
* Allow to use taskcluster.Secrets service to retrieve W&B secret API Key

* Properly parse TC secret
2024-03-26 10:30:42 -07:00
Evgeny Pavlov 36ff36e534
Add tests for training (#489)
* Install OpenBLAS to run Marian

* Fix Marian's version

* Support running training with run_task

* Add extra args for Marian

* Make train.sh compatible with CPU

* Remove redundant export

* Add tests for training

* Fix formatting

* Fetch cuda libs

* Document regex

* Compile marian to use on CPU for tests

* Fix formatting

* Fix comment

* Make the file names consistent
2024-03-21 14:22:39 -07:00
Evgeny Pavlov 36e56b7bdb
Fix pretraining (#485) 2024-03-20 14:45:46 -07:00
Evgeny Pavlov a359723e41
Fix compatibility (#480) 2024-03-20 14:28:02 -07:00
Evgeny Pavlov 1f12387866
Fix downloading evals (#479) 2024-03-20 14:16:10 -07:00
Evgeny Pavlov a4d25cb760
Run tests with toolchain binaries under Docker (#478)
* Add tests for alignments

* Use toolchain in CI tests

* Compile toolchain locally under Docker

* Add commands to build and run docker locally

* Trigger CI

* Install missing libs

* Clarify exporting variables for ARM processors

* Add a workaround for poetry not seeing python

* Clarify the reason for not installing packages in docker image
2024-03-20 12:57:42 -07:00
Valentin Rigal fb9531f0b5
Avoid duplicated runs during W&B publication (#484)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-03-19 12:37:42 -07:00
Valentin Rigal 8646fc6b02
Avoid duplicated backward runs publication (#483) 2024-03-19 12:32:31 -07:00
Valentin Rigal 97f7d00fbd
Support failures retrieving task artifacts (#482) 2024-03-18 14:27:20 -07:00
Greg Tatum 89c24d892f
Add a taskcluster downloader for models (#475)
* Drive-by: Fix the logging of parameters in the downloads

* Update utility to provide a default output path, and list all download modes

* Add a taskcluster downloader for models
2024-03-11 17:06:54 -05:00
Greg Tatum 65ca580a16
Add support for custom corpora through remote URLs (#420) 2024-03-06 13:03:40 -06:00
Ben Hearsum (he/him) a17ed9db12
Add non-spot 1tb CPU worker option (#471) 2024-03-05 13:28:48 -05:00
Ben Hearsum (he/him) e227bf8292
add 1tb cpu only workers (#470) 2024-03-04 18:44:06 -05:00
Valentin Rigal 8012cd30cf
Update parser documentation (#462)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-02-29 14:57:22 -08:00
EvaBardou 62a74a9241
Allow group traversal from training tasks dependencies during Taskcluster task group publication (#461)
* Allow group traversal from training tasks dependencies during Taskcluster task group publication

* Fix lint + Apply Valentin's suggestions

* Comment to clarify the usage of a shared set variable

* Do not publish empty groups
2024-02-29 09:04:17 -08:00
Ben Hearsum (he/him) ed41ca2cc1
Set TERM for docker worker images (#464) 2024-02-27 21:30:09 -05:00
Greg Tatum 19e46e5120
Add shufflers as utilities (#467) 2024-02-27 16:11:27 -06:00
Valentin Rigal 3706913c88
Link metrics from labels in addition to TC dependencies (#465)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-02-26 14:58:53 -08:00
Ben Hearsum (he/him) 9ae9d9ad5f
Use spot instances for PRs (#457) 2024-02-22 19:20:01 -05:00
Valentin Rigal 3f135aa115
Taskcluster task group publication (#406)
* Base taskcluster task group publication

* Move tag parser to utils module

* Support metrics

* Support multiple teacher training

* Fix parsing for evaluation folder

* Generic group logs parser

* Parse extra evaluation tasks and publish group_logs fake run

* Publish Marian config on runs

* Publish marian config on runs instead of experiment config

* Rebase vrigal:publish-experiment-config

* Publish experiment config on group_logs
2024-02-16 09:05:01 -08:00
Evgeny Pavlov 58cce071ef
Support typos and noise modifiers (#428)
* Update opustrainer

* Adjust configs

* Add evaluation modifiers

* Reduce noise

* Add tests for typos and noise

* Fix typos augmenter

* Fix linting issues

* Update docs

* Update opustrainer

* Adjust configs

* Add evaluation modifiers

* Reduce noise

* Add tests for typos and noise

* Fix typos augmenter

* Fix linting issues

* Update docs

* Fix test

* Update opus trainer

* Remove noise parameters from config

* Update opustrainer with fixes

* Run linter

* Fix tests after merge

* Disable noise for student

* Update lockfile

* Fix formatting

* Disable typos for student

* Rename assert functions

* Switch back to faster validation

* Document decision on using augmentations

* Fix typo
2024-02-15 15:33:24 -08:00
Evgeny Pavlov 092fd98deb
Fix random seed (#445)
* Use different random seeds for the teachers

* Fix substitution

* Pass random seed to Marian
2024-02-15 12:49:33 -08:00
Evgeny Pavlov e8ac81d9b8
Work aroud fast text model downloading failures (#435)
* Decrease max run time

* Add retry

* Remove todo

* Fix indentation
2024-02-15 10:37:08 -08:00
Evgeny Pavlov 190358a923
Fix linting for tracking (#441)
* Fix linting pythonpath for tracking

* Add pythonpath to the rest of the commands

* Remove pythonpath

* Update lockfile

* Fix wandb directory in tests
2024-02-15 09:21:33 -08:00
Ben Hearsum (he/him) 34c4e01bd6
Enable 'train' action for PRs against a supported repository (#447)
* Enable 'train' action for PRs against a supported repository

* Fix scope repo url for actions
2024-02-15 11:13:22 -05:00
Ben Hearsum (he/him) e6ec0d5474
Add support for triggering actions from PR decision tasks. (#442) 2024-02-14 19:09:54 -05:00
Ben Hearsum (he/him) 70fede467f
Add the ability to run starting from a specific task (fixes #227) (#377)
* Add the ability to run starting from a specific task (fixes #227)

A couple of example runs with this:
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/YHAr0HzwSSe4pe5Yh9dIlg uses https://firefox-ci-tc.services.mozilla.com/tasks/groups/JjNp3KcyTUObUtOA9BgK5g as its `previous-group-id` with `start-stage: train-backwards` and `target-stage: train-teacher` - and ends up running `train-backwards, `translate-mono-trg`, `collect-mono-trg`, and `train-teacher`.
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/Sm0YV_8LQP-EOE8Nz6G5Lw uses the above group as its `previous-group-id` with `start-stage: train-teacher` and `target-stage: all`. Note that it ended up depending on tasks from both the above group and the one that it was based on, and ended up scheduling `train-teacher` and everything after it (I didn't bother letting them all run - I think the scheduling is enough to verify this).

Big thanks to @gabrielBusta for suggesting this implementation!

* Update poetry dependencies to pull in newer taskgraph version
2024-02-14 09:07:07 -05:00
Ben Hearsum (he/him) 4a5fc1f8c7
fix: set --output-file correctly in taskgraph-diff (#413)
At the moment we end up with files that sit beside the output directory rather than in it. In CI, this means that we don't get the diffs uploaded as artifacts.
2024-02-13 13:19:51 -05:00
Greg Tatum ffa6d77902
Add a structured logging script (#437)
* Add a structured logger to bicleaner

* Adjust the pythonpath for the download_pack.py script
2024-02-12 16:08:02 -06:00