From 60852a7e54b5383b26ffa235c9cdd5459b239212 Mon Sep 17 00:00:00 2001 From: "Francesco Lodolo (:flod)" Date: Thu, 5 Jan 2023 15:23:58 +0000 Subject: [PATCH] Bug 1808224 - Improve scripts and documentation to edit the en-US dictionary, r=RyanVM ## edit-dictionary.sh Instead of editing the .dic file directly, allow user to provide a list of words. The script expands the existing .dic file, adds the new words and compress it again using the affix rules. Numerals at the beginning of the file and "no suggestion" words need to be special-cased, since compressing the word list creates different results. ## make-new-dict.sh Extract suggestions exclusions from the existing Mozilla dictionary, then add them back to the dictionary generated by SCOWL. This removes the need to maintain an external list of exclusions (mozilla-exclusions.txt). It also allows to exclude these offensive words from both lists of added and removed words by Mozilla. Also: - Break if the scowl folder is missing. - Remove backup folders to make sure the install script can't be run twice. ## install-new-dict.sh Break if the scowl folder is missing. Differential Revision: https://phabricator.services.mozilla.com/D165883 --- extensions/spellcheck/docs/index.rst | 77 +++++++++++++------ .../dictionary-sources/edit-dictionary.sh | 74 ++++++++++++++++-- .../dictionary-sources/install-new-dict.sh | 24 +++--- .../dictionary-sources/make-new-dict.sh | 64 ++++++++++----- .../dictionary-sources/mozilla-exclusions.txt | 0 5 files changed, 181 insertions(+), 58 deletions(-) delete mode 100644 extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/mozilla-exclusions.txt diff --git a/extensions/spellcheck/docs/index.rst b/extensions/spellcheck/docs/index.rst index 2f83dc156e83..697b6303ced6 100644 --- a/extensions/spellcheck/docs/index.rst +++ b/extensions/spellcheck/docs/index.rst @@ -13,39 +13,82 @@ For more information about Hunspell or the affix file format, you can check Requesting to add new words to the en-US dictionary =================================================== -If you’d like to add new words to the dictionary, you can `file a bug`_. Try to -provide information on the terms you want to add, in particular references to -external sources that confirm the usage of the term. +If you’d like to add new words to the dictionary, you can `file a bug`_: + +* Try to provide information on the terms you want to add, in particular + references to external sources that confirm the usage of the term (e.g. + Merriam-Webster or Oxford online dictionaries). +* Include all possible forms, e.g. plural for nouns, different tenses for verbs. Adding new words to the en-US dictionary ======================================== -This section describes the process for adding a word to the dictionary: +This section describes the process for adding new words to the dictionary: #. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick Reference`), if you don’t already have one, and make sure you can build it successfully. -#. Get into the dictionary sources directory using this command: - ``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources`` +#. Move in the dictionary sources directory using this command: + ``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``. +#. Identify the current version of SCOWL by checking the file + ``README_en_US.txt`` (at the beginning of the file there is a line similar to + ``Generated from SCOWL Version 2020.12.07``, where ``2020.12.07`` is the + SCOWL version). +#. Download the same version of the dictionary from the `SCOWL`_ homepage or + `SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory. + Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``. #. There’s a special script used for editing dictionaries. The script only works if you have the environment variable ``EDITOR`` set to the executable of an editor program; if you don’t have it set, you can use ``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can substitute it with another editor), or you can just type ``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified. -#. Add and remove words in the dictionary file, then quit the editor. + + Copy and paste the full list of words, then save and quit the editor. It’s + not necessary to put the words in alphabetical order, as it will be corrected + by the script. +#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make + sure it runs without errors. For more details on this script, see the + `make-new-dict.sh`_ section. +#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For + example, make sure that the size is about the same as the original dictionary + (or slightly larger). +#. If everything looks correct, use ``sh install-new-dict.sh`` to copy the + generated file in the right position. #. Build Firefox and test your updated dictionary. Once you’re satisfied, use the process described in :ref:`write_a_patch` to create a patch. -Note that the update script will modify 2 files, and both need to be committed: +Note that the update script will modify 2 versions of the dictionary, and both +need to be committed: -* ``en-US.dic``: the dictionary actually shipping in the build and uses +* ``en-US.dic``: the dictionary actually shipping in the build, it uses ISO-8859-1 encoding. * ``utf8/en-US.dic``: a version of the same dictionary with UTF-8 encoding. This is used to work around issues with Phabricator, and it allows to display actual changes in the diff. +Exclude words from suggestions +============================== + +It’s possible to completely exclude words from suggested alternatives by adding +an affix rule ``!`` at the end of the definition in the ``.dic`` file. For +example: + +* ``bum`` would be changed to ``bum/!`` (note the additional forward slash). +* ``bum/MS`` would be changed to ``bum/MS!``. + +In order to exclude a word from suggestions, follow the instructions available +in `Adding new words to the en-US dictionary`_. Instead of running the +``edit-dictionary.sh`` script (point 5), use a text editor to edit the file +``en-US.dic`` directly, then proceed with the remaining instructions. + +.. warning:: + + Make sure to open ``en-US.dic`` with the correct encoding. For example, Visual + Studio Code will try to open it as ``UTF-8``, and it needs to be reopened with + encoding ``Western (ISO 8859-1)``. + Upgrading dictionary to a new upstream version of SCOWL ======================================================= @@ -56,11 +99,11 @@ used to generate the files for the en-US dictionary. The working directory for this process is ``extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``. -#. Download the latest version of the dictionary from `SCOWL`_ homepage or +#. Download the latest version of the dictionary from the `SCOWL`_ homepage or `SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory. Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``. #. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make - sure it runs without any errors. For more details on this script, see the + sure it runs without errors. For more details on this script, see the `make-new-dict.sh`_ section. #. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For example, make sure that the size is about the same as the original dictionary @@ -72,16 +115,6 @@ The working directory for this process is Info about the file structure ============================= -mozilla-exclusions.txt ----------------------- - -``mozilla-exclusions.txt`` is used to explicitly exclude some words from -suggestions. The ``make-new-dict.sh`` script will add them to the dictionary file -with the ``/!`` flag. - -Terms should be added to this file with exactly the same format used in the .dic -file, including affix rules if available. - mozilla-specific.txt -------------------- @@ -153,6 +186,6 @@ The script: .. _SCOWL: http://wordlist.aspell.net -.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20checker +.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20Checker%3A%20en-US%20Dictionary .. _SourceForce: https://sourceforge.net/projects/wordlist/files/SCOWL/ .. _bug 237921: https://bugzilla.mozilla.org/show_bug.cgi?id=237921 diff --git a/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/edit-dictionary.sh b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/edit-dictionary.sh index 7687f8e72789..e72654e84db2 100755 --- a/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/edit-dictionary.sh +++ b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/edit-dictionary.sh @@ -6,11 +6,41 @@ set -e +WKDIR="`pwd`" +SPELLER="$WKDIR/scowl/speller" + +munch() { + $SPELLER/munch-list munch $1 | sort -u +} + +expand() { + grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u +} + +if [ ! -d "$SPELLER" ]; then + echo "The 'scowl' folder is missing. Check the documentation at" + echo "https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html" + exit 1 +fi + if [ -z "$EDITOR" ]; then echo 'Need to set the $EDITOR environment variable to your favorite editor.' exit 1 fi +# Open the editor and allow the user to type or paste words +echo "Editor is going to open, you can add the list of words. Quit the editor to finish editing." +echo "Press Enter to begin." +read foo +$EDITOR temp-list.txt + +if [ ! -f temp-list.txt ]; then + echo "The content of the editor hasn't been saved." + exit 1 +fi +# Remove empty lines +sed -i "" "/^$/d" temp-list.txt + # Copy the current en-US dictionary and strip the first line that contains # the count. tail -n +2 ../en-US.dic > en-US.stripped @@ -19,16 +49,44 @@ tail -n +2 ../en-US.dic > en-US.stripped iconv -f iso-8859-1 -t utf-8 en-US.stripped > en-US.utf8 rm en-US.stripped -# Open the hunspell dictionary and let the user edit it -echo "Now the dictionary is going to be opened for you to edit. Quit the editor to finish editing." -echo "Press Enter to begin." -read foo -$EDITOR en-US.utf8 +# Save to a temporary file words excluded from suggestions, and numerals, +# since the munched result is different for both. +grep '!$' < utf8/en-US-utf8.dic > en-US-nosug.txt +grep '^[0-9][a-z/]' < utf8/en-US-utf8.dic > en-US-numerals.txt + +# Expand the dictionary to a word list +expand ../en-US.aff < en-US.utf8 > en-US-wordlist.txt +rm en-US.utf8 + +# Add the new words +cat temp-list.txt >> en-US-wordlist.txt +rm temp-list.txt + +# Remove numerals from the expanded wordlist +grep -v '^[0-9]' < en-US-wordlist.txt > en-US-wordlist-nonum.txt +rm en-US-wordlist.txt + +# Run the wordlist through the munch script, to compress the dictionary where +# possible (using affix rules). +munch ../en-US.aff < en-US-wordlist-nonum.txt > en-US-munched.dic +rm en-US-wordlist-nonum.txt + +# Remove words that should not be suggested +while IFS='/' read -ra line +do + sed -E -i "" "\:^$line($|/.*):d" en-US-munched.dic +done < "en-US-nosug.txt" + +# Add back suggestion exclusions and numerals from the original .dic file +cat en-US-nosug.txt >> en-US-munched.dic +cat en-US-numerals.txt >> en-US-munched.dic +rm en-US-nosug.txt +rm en-US-numerals.txt # Add back the line count and sort the lines -wc -l < en-US.utf8 | tr -d '[:blank:]' > en-US.dic -LC_ALL=C sort en-US.utf8 >> en-US.dic -rm -f en-US.utf8 +wc -l < en-US-munched.dic | tr -d '[:blank:]' > en-US.dic +LC_ALL=C sort en-US-munched.dic >> en-US.dic +rm -f en-US-munched.dic # Convert back to ISO-8859-1 iconv -f utf-8 -t iso-8859-1 en-US.dic > ../en-US.dic diff --git a/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/install-new-dict.sh b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/install-new-dict.sh index 26ce06dec269..9e2f37a16f25 100755 --- a/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/install-new-dict.sh +++ b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/install-new-dict.sh @@ -10,14 +10,19 @@ set -e WKDIR="`pwd`" -export SCOWL="$WKDIR/scowl/" -SUPPORT_DIR="$WKDIR/support_files/" -SPELLER="$SCOWL/speller" +SUPPORT_DIR="$WKDIR/support_files" +SPELLER="$WKDIR/scowl/speller" -if [ -e "$SUPPORT_DIR/orig-bk" ]; then - echo "$0: directory '$SUPPORT_DIR/orig-bk' exists." 1>&2 - exit 0 -fi +# Stop if backup folders already exist, because it means that this script +# has already been run once. +FOLDERS=( "orig-bk" "mozilla-bk") +for f in ${FOLDERS[@]}; do + if [ -d "$SUPPORT_DIR/$f" ]; then + echo "Backup folder already present: $f" + echo "Run make-new-dict.sh before running this script." + exit 1 + fi +done mv orig "$SUPPORT_DIR/orig-bk" mkdir orig @@ -26,12 +31,13 @@ cp $SPELLER/en_US-custom.dic $SPELLER/en_US-custom.aff $SPELLER/README_en_US-cus mkdir "$SUPPORT_DIR/mozilla-bk" mv ../en-US.dic ../en-US.aff ../README_en_US.txt "$SUPPORT_DIR/mozilla-bk" -# Convert the affix file to ISO-8859-1 +# The affix file is ISO-8859-1, but still need to change the character set to +# ISO-8859-1 and remove conversion rules. cp en_US-mozilla.aff utf8/en-US-utf8.aff sed -i "" -e '/^ICONV/d' -e 's/^SET UTF-8$/SET ISO8859-1/' en_US-mozilla.aff # Convert the dictionary to ISO-8859-1 -mv en_US-mozilla.dic utf8/en-US-utf8.dic +cp en_US-mozilla.dic utf8/en-US-utf8.dic iconv -f utf-8 -t iso-8859-1 < utf8/en-US-utf8.dic > en_US-mozilla.dic cp en_US-mozilla.aff ../en-US.aff diff --git a/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict.sh b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict.sh index 1a190e8c5b10..de06dfe3d6d2 100755 --- a/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict.sh +++ b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict.sh @@ -21,17 +21,23 @@ export LC_CTYPE=C export LC_COLLATE=C WKDIR="`pwd`" +ORIG="$WKDIR/orig" +SUPPORT_DIR="$WKDIR/support_files" +SPELLER="$WKDIR/scowl/speller" +# This is required by scowl scripts export SCOWL="$WKDIR/scowl/" -ORIG="$WKDIR/orig/" -SUPPORT_DIR="$WKDIR/support_files/" -SPELLER="$SCOWL/speller" - expand() { grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u } +if [ ! -d "$SPELLER" ]; then + echo "The 'scowl' folder is missing. Check the documentation at" + echo "https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html" + exit 1 +fi + mkdir -p $SUPPORT_DIR cd $SPELLER MK_LIST="../mk-list -v1 --accents=both en_US 60" @@ -49,8 +55,16 @@ expand $SPELLER/en.aff < $SPELLER/en.dic.supp > $SUPPORT_DIR/0-special.txt # Input is UTF-8, expand expects ISO-8859-1 so use iconv iconv -f utf-8 -t iso-8859-1 $ORIG/en_US-custom.dic | expand $ORIG/en_US-custom.aff > $SUPPORT_DIR/1-base.txt -# The existing Mozilla dictionary is already in ISO-8859-1 -expand ../en-US.aff < ../en-US.dic > $SUPPORT_DIR/2-mozilla.txt +# Store suggestion exclusions (ending with !) defined in current Mozilla dictionary. +# Save both the compressed (munched) and expanded version. +grep '!$' ../en-US.dic > $SUPPORT_DIR/2-mozilla-nosug-munched.txt +expand ../en-US.aff < $SUPPORT_DIR/2-mozilla-nosug-munched.txt > $SUPPORT_DIR/2-mozilla-nosug.txt + +# Remove suggestion exclusions and expand the existing Mozilla dictionary. +# The existing Mozilla dictionary is already in ISO-8859-1. +grep -v '!$' < ../en-US.dic > $SUPPORT_DIR/en-US-nosug.dic +expand ../en-US.aff < $SUPPORT_DIR/en-US-nosug.dic > $SUPPORT_DIR/2-mozilla.txt +rm $SUPPORT_DIR/en-US-nosug.dic # Input is UTF-8, expand expects ISO-8859-1 so use iconv iconv -f utf-8 -t iso-8859-1 $SPELLER/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > $SUPPORT_DIR/3-upstream.txt @@ -72,16 +86,13 @@ comm -23 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt | cat - # Note: the output of make-hunspell-dict is UTF-8 cat $SUPPORT_DIR/4-patched.txt | comm -23 - $SUPPORT_DIR/0-special.txt | $SPELLER/make-hunspell-dict -one en_US-mozilla /dev/null -# Exclude specific words from suggestions -while IFS= read -r line -do - # If the string already contains an affix, just add !, otherwise add /! - if [[ "$line" == *"/"* ]]; then - sed -i "" "s|^$line$|$line!|" en_US-mozilla.dic - else - sed -i "" "s|^$line$|$line/!|" en_US-mozilla.dic - fi -done < "mozilla-exclusions.txt" +# Add back Mozilla suggestion exclusions. Need to convert the file from +# ISO-8859-1 to UTF-8 first, then add back the line count and reorder. +tail -n +2 en_US-mozilla.dic > en_US-mozilla-complete.dic +iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/2-mozilla-nosug-munched.txt >> en_US-mozilla-complete.dic +wc -l < en_US-mozilla-complete.dic | tr -d '[:blank:]' > en_US-mozilla.dic +LC_ALL=C sort en_US-mozilla-complete.dic >> en_US-mozilla.dic +rm -f en_US-mozilla-complete.dic # Sanity check should yield identical results #comm -23 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/3-upstream.txt > $SUPPORT_DIR/3-upstream-remover.txt @@ -91,12 +102,27 @@ done < "mozilla-exclusions.txt" expand ../en-US.aff < mozilla-specific.txt > 5-mozilla-specific.txt # Update Mozilla removed and added wordlists based on the new upstream -# dictionary, save them as UTF-8 and not ISO-8951-1 -comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed.txt +# dictionary, save them as UTF-8 and not ISO-8951-1. +# Ignore words excluded from suggestions for both files. +comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed-tmp.txt +comm -23 $SUPPORT_DIR/5-mozilla-removed-tmp.txt $SUPPORT_DIR/2-mozilla-nosug.txt > $SUPPORT_DIR/5-mozilla-removed.txt +rm $SUPPORT_DIR/5-mozilla-removed-tmp.txt iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-removed.txt > 5-mozilla-removed.txt -comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added.txt + +comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added-tmp.txt +comm -23 $SUPPORT_DIR/5-mozilla-added-tmp.txt $SUPPORT_DIR/2-mozilla-nosug.txt > $SUPPORT_DIR/5-mozilla-added.txt +rm $SUPPORT_DIR/5-mozilla-added-tmp.txt iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-added.txt > 5-mozilla-added.txt # Clean up some files rm hunspell-en_US-mozilla.zip rm nosug + +# Remove backup folders in preparation for the install-new-dict script +FOLDERS=( "orig-bk" "mozilla-bk") +for f in ${FOLDERS[@]}; do + if [ -d "$SUPPORT_DIR/$f" ]; then + echo "Removing backup folder $f" + rm -rf "$SUPPORT_DIR/$f" + fi +done diff --git a/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/mozilla-exclusions.txt b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/mozilla-exclusions.txt deleted file mode 100644 index e69de29bb2d1..000000000000