Bug 1808224 - Improve scripts and documentation to edit the en-US dictionary, r=RyanVM

## edit-dictionary.sh

Instead of editing the .dic file directly, allow user to provide a list of
words. The script expands the existing .dic file, adds the new words and
compress it again using the affix rules.

Numerals at the beginning of the file and "no suggestion" words need to be
special-cased, since compressing the word list creates different results.

## make-new-dict.sh

Extract suggestions exclusions from the existing Mozilla dictionary, then
add them back to the dictionary generated by SCOWL. This removes the need
to maintain an external list of exclusions (mozilla-exclusions.txt).
It also allows to exclude these offensive words from both lists of added
and removed words by Mozilla.

Also:
- Break if the scowl folder is missing.
- Remove backup folders to make sure the install script can't be run twice.

## install-new-dict.sh

Break if the scowl folder is missing.

Differential Revision: https://phabricator.services.mozilla.com/D165883
This commit is contained in:
Francesco Lodolo (:flod) 2023-01-05 15:23:58 +00:00
Родитель 90c1a75237
Коммит 60852a7e54
5 изменённых файлов: 181 добавлений и 58 удалений

Просмотреть файл

@ -13,39 +13,82 @@ For more information about Hunspell or the affix file format, you can check
Requesting to add new words to the en-US dictionary
===================================================
If youd like to add new words to the dictionary, you can `file a bug`_. Try to
provide information on the terms you want to add, in particular references to
external sources that confirm the usage of the term.
If youd like to add new words to the dictionary, you can `file a bug`_:
* Try to provide information on the terms you want to add, in particular
references to external sources that confirm the usage of the term (e.g.
Merriam-Webster or Oxford online dictionaries).
* Include all possible forms, e.g. plural for nouns, different tenses for verbs.
Adding new words to the en-US dictionary
========================================
This section describes the process for adding a word to the dictionary:
This section describes the process for adding new words to the dictionary:
#. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick
Reference`), if you dont already have one, and make sure you can build it
successfully.
#. Get into the dictionary sources directory using this command:
``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``
#. Move in the dictionary sources directory using this command:
``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
#. Identify the current version of SCOWL by checking the file
``README_en_US.txt`` (at the beginning of the file there is a line similar to
``Generated from SCOWL Version 2020.12.07``, where ``2020.12.07`` is the
SCOWL version).
#. Download the same version of the dictionary from the `SCOWL`_ homepage or
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
#. Theres a special script used for editing dictionaries. The script
only works if you have the environment variable ``EDITOR`` set to the
executable of an editor program; if you dont have it set, you can use
``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can
substitute it with another editor), or you can just type
``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified.
#. Add and remove words in the dictionary file, then quit the editor.
Copy and paste the full list of words, then save and quit the editor. Its
not necessary to put the words in alphabetical order, as it will be corrected
by the script.
#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
sure it runs without errors. For more details on this script, see the
`make-new-dict.sh`_ section.
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
example, make sure that the size is about the same as the original dictionary
(or slightly larger).
#. If everything looks correct, use ``sh install-new-dict.sh`` to copy the
generated file in the right position.
#. Build Firefox and test your updated dictionary. Once youre
satisfied, use the process described in :ref:`write_a_patch` to create a
patch.
Note that the update script will modify 2 files, and both need to be committed:
Note that the update script will modify 2 versions of the dictionary, and both
need to be committed:
* ``en-US.dic``: the dictionary actually shipping in the build and uses
* ``en-US.dic``: the dictionary actually shipping in the build, it uses
ISO-8859-1 encoding.
* ``utf8/en-US.dic``: a version of the same dictionary with UTF-8 encoding. This
is used to work around issues with Phabricator, and it allows to display
actual changes in the diff.
Exclude words from suggestions
==============================
Its possible to completely exclude words from suggested alternatives by adding
an affix rule ``!`` at the end of the definition in the ``.dic`` file. For
example:
* ``bum`` would be changed to ``bum/!`` (note the additional forward slash).
* ``bum/MS`` would be changed to ``bum/MS!``.
In order to exclude a word from suggestions, follow the instructions available
in `Adding new words to the en-US dictionary`_. Instead of running the
``edit-dictionary.sh`` script (point 5), use a text editor to edit the file
``en-US.dic`` directly, then proceed with the remaining instructions.
.. warning::
Make sure to open ``en-US.dic`` with the correct encoding. For example, Visual
Studio Code will try to open it as ``UTF-8``, and it needs to be reopened with
encoding ``Western (ISO 8859-1)``.
Upgrading dictionary to a new upstream version of SCOWL
=======================================================
@ -56,11 +99,11 @@ used to generate the files for the en-US dictionary.
The working directory for this process is
``extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
#. Download the latest version of the dictionary from `SCOWL`_ homepage or
#. Download the latest version of the dictionary from the `SCOWL`_ homepage or
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
sure it runs without any errors. For more details on this script, see the
sure it runs without errors. For more details on this script, see the
`make-new-dict.sh`_ section.
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
example, make sure that the size is about the same as the original dictionary
@ -72,16 +115,6 @@ The working directory for this process is
Info about the file structure
=============================
mozilla-exclusions.txt
----------------------
``mozilla-exclusions.txt`` is used to explicitly exclude some words from
suggestions. The ``make-new-dict.sh`` script will add them to the dictionary file
with the ``/!`` flag.
Terms should be added to this file with exactly the same format used in the .dic
file, including affix rules if available.
mozilla-specific.txt
--------------------
@ -153,6 +186,6 @@ The script:
.. _SCOWL: http://wordlist.aspell.net
.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20checker
.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20Checker%3A%20en-US%20Dictionary
.. _SourceForce: https://sourceforge.net/projects/wordlist/files/SCOWL/
.. _bug 237921: https://bugzilla.mozilla.org/show_bug.cgi?id=237921

Просмотреть файл

@ -6,11 +6,41 @@
set -e
WKDIR="`pwd`"
SPELLER="$WKDIR/scowl/speller"
munch() {
$SPELLER/munch-list munch $1 | sort -u
}
expand() {
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
}
if [ ! -d "$SPELLER" ]; then
echo "The 'scowl' folder is missing. Check the documentation at"
echo "https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html"
exit 1
fi
if [ -z "$EDITOR" ]; then
echo 'Need to set the $EDITOR environment variable to your favorite editor.'
exit 1
fi
# Open the editor and allow the user to type or paste words
echo "Editor is going to open, you can add the list of words. Quit the editor to finish editing."
echo "Press Enter to begin."
read foo
$EDITOR temp-list.txt
if [ ! -f temp-list.txt ]; then
echo "The content of the editor hasn't been saved."
exit 1
fi
# Remove empty lines
sed -i "" "/^$/d" temp-list.txt
# Copy the current en-US dictionary and strip the first line that contains
# the count.
tail -n +2 ../en-US.dic > en-US.stripped
@ -19,16 +49,44 @@ tail -n +2 ../en-US.dic > en-US.stripped
iconv -f iso-8859-1 -t utf-8 en-US.stripped > en-US.utf8
rm en-US.stripped
# Open the hunspell dictionary and let the user edit it
echo "Now the dictionary is going to be opened for you to edit. Quit the editor to finish editing."
echo "Press Enter to begin."
read foo
$EDITOR en-US.utf8
# Save to a temporary file words excluded from suggestions, and numerals,
# since the munched result is different for both.
grep '!$' < utf8/en-US-utf8.dic > en-US-nosug.txt
grep '^[0-9][a-z/]' < utf8/en-US-utf8.dic > en-US-numerals.txt
# Expand the dictionary to a word list
expand ../en-US.aff < en-US.utf8 > en-US-wordlist.txt
rm en-US.utf8
# Add the new words
cat temp-list.txt >> en-US-wordlist.txt
rm temp-list.txt
# Remove numerals from the expanded wordlist
grep -v '^[0-9]' < en-US-wordlist.txt > en-US-wordlist-nonum.txt
rm en-US-wordlist.txt
# Run the wordlist through the munch script, to compress the dictionary where
# possible (using affix rules).
munch ../en-US.aff < en-US-wordlist-nonum.txt > en-US-munched.dic
rm en-US-wordlist-nonum.txt
# Remove words that should not be suggested
while IFS='/' read -ra line
do
sed -E -i "" "\:^$line($|/.*):d" en-US-munched.dic
done < "en-US-nosug.txt"
# Add back suggestion exclusions and numerals from the original .dic file
cat en-US-nosug.txt >> en-US-munched.dic
cat en-US-numerals.txt >> en-US-munched.dic
rm en-US-nosug.txt
rm en-US-numerals.txt
# Add back the line count and sort the lines
wc -l < en-US.utf8 | tr -d '[:blank:]' > en-US.dic
LC_ALL=C sort en-US.utf8 >> en-US.dic
rm -f en-US.utf8
wc -l < en-US-munched.dic | tr -d '[:blank:]' > en-US.dic
LC_ALL=C sort en-US-munched.dic >> en-US.dic
rm -f en-US-munched.dic
# Convert back to ISO-8859-1
iconv -f utf-8 -t iso-8859-1 en-US.dic > ../en-US.dic

Просмотреть файл

@ -10,14 +10,19 @@
set -e
WKDIR="`pwd`"
export SCOWL="$WKDIR/scowl/"
SUPPORT_DIR="$WKDIR/support_files/"
SPELLER="$SCOWL/speller"
SUPPORT_DIR="$WKDIR/support_files"
SPELLER="$WKDIR/scowl/speller"
if [ -e "$SUPPORT_DIR/orig-bk" ]; then
echo "$0: directory '$SUPPORT_DIR/orig-bk' exists." 1>&2
exit 0
fi
# Stop if backup folders already exist, because it means that this script
# has already been run once.
FOLDERS=( "orig-bk" "mozilla-bk")
for f in ${FOLDERS[@]}; do
if [ -d "$SUPPORT_DIR/$f" ]; then
echo "Backup folder already present: $f"
echo "Run make-new-dict.sh before running this script."
exit 1
fi
done
mv orig "$SUPPORT_DIR/orig-bk"
mkdir orig
@ -26,12 +31,13 @@ cp $SPELLER/en_US-custom.dic $SPELLER/en_US-custom.aff $SPELLER/README_en_US-cus
mkdir "$SUPPORT_DIR/mozilla-bk"
mv ../en-US.dic ../en-US.aff ../README_en_US.txt "$SUPPORT_DIR/mozilla-bk"
# Convert the affix file to ISO-8859-1
# The affix file is ISO-8859-1, but still need to change the character set to
# ISO-8859-1 and remove conversion rules.
cp en_US-mozilla.aff utf8/en-US-utf8.aff
sed -i "" -e '/^ICONV/d' -e 's/^SET UTF-8$/SET ISO8859-1/' en_US-mozilla.aff
# Convert the dictionary to ISO-8859-1
mv en_US-mozilla.dic utf8/en-US-utf8.dic
cp en_US-mozilla.dic utf8/en-US-utf8.dic
iconv -f utf-8 -t iso-8859-1 < utf8/en-US-utf8.dic > en_US-mozilla.dic
cp en_US-mozilla.aff ../en-US.aff

Просмотреть файл

@ -21,17 +21,23 @@ export LC_CTYPE=C
export LC_COLLATE=C
WKDIR="`pwd`"
ORIG="$WKDIR/orig"
SUPPORT_DIR="$WKDIR/support_files"
SPELLER="$WKDIR/scowl/speller"
# This is required by scowl scripts
export SCOWL="$WKDIR/scowl/"
ORIG="$WKDIR/orig/"
SUPPORT_DIR="$WKDIR/support_files/"
SPELLER="$SCOWL/speller"
expand() {
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
}
if [ ! -d "$SPELLER" ]; then
echo "The 'scowl' folder is missing. Check the documentation at"
echo "https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html"
exit 1
fi
mkdir -p $SUPPORT_DIR
cd $SPELLER
MK_LIST="../mk-list -v1 --accents=both en_US 60"
@ -49,8 +55,16 @@ expand $SPELLER/en.aff < $SPELLER/en.dic.supp > $SUPPORT_DIR/0-special.txt
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
iconv -f utf-8 -t iso-8859-1 $ORIG/en_US-custom.dic | expand $ORIG/en_US-custom.aff > $SUPPORT_DIR/1-base.txt
# The existing Mozilla dictionary is already in ISO-8859-1
expand ../en-US.aff < ../en-US.dic > $SUPPORT_DIR/2-mozilla.txt
# Store suggestion exclusions (ending with !) defined in current Mozilla dictionary.
# Save both the compressed (munched) and expanded version.
grep '!$' ../en-US.dic > $SUPPORT_DIR/2-mozilla-nosug-munched.txt
expand ../en-US.aff < $SUPPORT_DIR/2-mozilla-nosug-munched.txt > $SUPPORT_DIR/2-mozilla-nosug.txt
# Remove suggestion exclusions and expand the existing Mozilla dictionary.
# The existing Mozilla dictionary is already in ISO-8859-1.
grep -v '!$' < ../en-US.dic > $SUPPORT_DIR/en-US-nosug.dic
expand ../en-US.aff < $SUPPORT_DIR/en-US-nosug.dic > $SUPPORT_DIR/2-mozilla.txt
rm $SUPPORT_DIR/en-US-nosug.dic
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
iconv -f utf-8 -t iso-8859-1 $SPELLER/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > $SUPPORT_DIR/3-upstream.txt
@ -72,16 +86,13 @@ comm -23 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt | cat -
# Note: the output of make-hunspell-dict is UTF-8
cat $SUPPORT_DIR/4-patched.txt | comm -23 - $SUPPORT_DIR/0-special.txt | $SPELLER/make-hunspell-dict -one en_US-mozilla /dev/null
# Exclude specific words from suggestions
while IFS= read -r line
do
# If the string already contains an affix, just add !, otherwise add /!
if [[ "$line" == *"/"* ]]; then
sed -i "" "s|^$line$|$line!|" en_US-mozilla.dic
else
sed -i "" "s|^$line$|$line/!|" en_US-mozilla.dic
fi
done < "mozilla-exclusions.txt"
# Add back Mozilla suggestion exclusions. Need to convert the file from
# ISO-8859-1 to UTF-8 first, then add back the line count and reorder.
tail -n +2 en_US-mozilla.dic > en_US-mozilla-complete.dic
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/2-mozilla-nosug-munched.txt >> en_US-mozilla-complete.dic
wc -l < en_US-mozilla-complete.dic | tr -d '[:blank:]' > en_US-mozilla.dic
LC_ALL=C sort en_US-mozilla-complete.dic >> en_US-mozilla.dic
rm -f en_US-mozilla-complete.dic
# Sanity check should yield identical results
#comm -23 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/3-upstream.txt > $SUPPORT_DIR/3-upstream-remover.txt
@ -91,12 +102,27 @@ done < "mozilla-exclusions.txt"
expand ../en-US.aff < mozilla-specific.txt > 5-mozilla-specific.txt
# Update Mozilla removed and added wordlists based on the new upstream
# dictionary, save them as UTF-8 and not ISO-8951-1
comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed.txt
# dictionary, save them as UTF-8 and not ISO-8951-1.
# Ignore words excluded from suggestions for both files.
comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed-tmp.txt
comm -23 $SUPPORT_DIR/5-mozilla-removed-tmp.txt $SUPPORT_DIR/2-mozilla-nosug.txt > $SUPPORT_DIR/5-mozilla-removed.txt
rm $SUPPORT_DIR/5-mozilla-removed-tmp.txt
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-removed.txt > 5-mozilla-removed.txt
comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added.txt
comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added-tmp.txt
comm -23 $SUPPORT_DIR/5-mozilla-added-tmp.txt $SUPPORT_DIR/2-mozilla-nosug.txt > $SUPPORT_DIR/5-mozilla-added.txt
rm $SUPPORT_DIR/5-mozilla-added-tmp.txt
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-added.txt > 5-mozilla-added.txt
# Clean up some files
rm hunspell-en_US-mozilla.zip
rm nosug
# Remove backup folders in preparation for the install-new-dict script
FOLDERS=( "orig-bk" "mozilla-bk")
for f in ${FOLDERS[@]}; do
if [ -d "$SUPPORT_DIR/$f" ]; then
echo "Removing backup folder $f"
rm -rf "$SUPPORT_DIR/$f"
fi
done