Граф коммитов

8 Коммитов

Автор SHA1 Сообщение Дата
Greg Tatum f1668c1a1c
Merge corpus rewrite to python (#851) 2024-10-17 13:54:31 -05:00
Greg Tatum fd46f35df5
Add a memory logger (#821) 2024-09-04 13:14:26 -05:00
Greg Tatum 06001d6f8c
Rewrite merge mono and add support for an OPUS monolingual importer (#787) 2024-08-30 10:35:02 -05:00
Greg Tatum 01f9e3cc94 Use a forked OpusCleaner in `release` (#693)
* Use a forked opuscleaner

* Move to mozilla

* Remove hashes as they are not supported for URLs
2024-06-25 15:38:23 -05:00
Evgeny Pavlov b253a1ce6b
Fix install opuscleaner (#350)
* Update and enable opuscleaner

* Remove comment
2024-01-10 12:02:22 -08:00
Evgeny Pavlov 2d4530d0f5
Always split corpus to a fixed number of parts (#308)
* Always split corpus to a fixed number of parts

* Fix splitting

* Rewrite corpus splitting in Python

* Replace in taskcluster

* Add tests

* Unify compression tool with Taskcluster

* Move zstd installation to docker image

* Disable opuscleaner in CI

* Compress chunks

* Fix file names

* Remove zeros from file index

* Start file index with 1

* Fix corpus splitting

* Add a link to an issue

* Generate script description from doc

* Use new test dir

* Use new test dir

* Test command line args

* Clarify expected files

* Add logging
2023-12-19 15:25:33 -08:00
Evgeny Pavlov 1d1b4922bb
Fix OpusCleaner on ccmatrix (#222)
* Update opus cleaner

* Fix skipping bicleaner

* Switch to a small model
2023-10-31 15:13:00 -07:00
Evgeny Pavlov e9102a37ef
Integrate OpusCleaner (#163)
* Initial integration of opus cleaner

* Support custom filters

* Use opus cleaner in pipeline

* Fix env

* Fix filter generation

* Add more rules

* Fix elrc filter

* Fix env

* Fix frequent patterns filter

* Switch to reading from stdin

* Add a feature flag for opus cleaner

* Fix condition

* Add extra test for non empty files

* Integrate with TC

* Run linter

* Fix step config

* Fix step config

* Fix step config

* Fix step config

* Fix command

* Fix path

* Update OpusCleaner

* Remove warning

* Log filtered length

* Add opuscleaner logs

* Add comments

* Fix using custom filters

* Extract function

* Change the CI target back

* Fix file path

* Replace conda with poetry

* Add doc

* Add more comments

* Rename example filter

* Test corpus

* Fix filter name

* Use opus dataset instead of mtdata

* Make CI faster

* Add sections to makefile

* Fix custom filter search

* Redirect stderr to stdout

* Fix usage of custom config

* Fix config name

* Change back to all
2023-09-26 15:29:07 -07:00