зеркало из https://github.com/mozilla/gecko-dev.git
Bug 1808224 - Improve scripts and documentation to edit the en-US dictionary, r=RyanVM
## edit-dictionary.sh Instead of editing the .dic file directly, allow user to provide a list of words. The script expands the existing .dic file, adds the new words and compress it again using the affix rules. Numerals at the beginning of the file and "no suggestion" words need to be special-cased, since compressing the word list creates different results. ## make-new-dict.sh Extract suggestions exclusions from the existing Mozilla dictionary, then add them back to the dictionary generated by SCOWL. This removes the need to maintain an external list of exclusions (mozilla-exclusions.txt). It also allows to exclude these offensive words from both lists of added and removed words by Mozilla. Also: - Break if the scowl folder is missing. - Remove backup folders to make sure the install script can't be run twice. ## install-new-dict.sh Break if the scowl folder is missing. Differential Revision: https://phabricator.services.mozilla.com/D165883
This commit is contained in:
Родитель
90c1a75237
Коммит
60852a7e54
|
@ -13,39 +13,82 @@ For more information about Hunspell or the affix file format, you can check
|
|||
Requesting to add new words to the en-US dictionary
|
||||
===================================================
|
||||
|
||||
If you’d like to add new words to the dictionary, you can `file a bug`_. Try to
|
||||
provide information on the terms you want to add, in particular references to
|
||||
external sources that confirm the usage of the term.
|
||||
If you’d like to add new words to the dictionary, you can `file a bug`_:
|
||||
|
||||
* Try to provide information on the terms you want to add, in particular
|
||||
references to external sources that confirm the usage of the term (e.g.
|
||||
Merriam-Webster or Oxford online dictionaries).
|
||||
* Include all possible forms, e.g. plural for nouns, different tenses for verbs.
|
||||
|
||||
Adding new words to the en-US dictionary
|
||||
========================================
|
||||
|
||||
This section describes the process for adding a word to the dictionary:
|
||||
This section describes the process for adding new words to the dictionary:
|
||||
|
||||
#. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick
|
||||
Reference`), if you don’t already have one, and make sure you can build it
|
||||
successfully.
|
||||
#. Get into the dictionary sources directory using this command:
|
||||
``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``
|
||||
#. Move in the dictionary sources directory using this command:
|
||||
``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
|
||||
#. Identify the current version of SCOWL by checking the file
|
||||
``README_en_US.txt`` (at the beginning of the file there is a line similar to
|
||||
``Generated from SCOWL Version 2020.12.07``, where ``2020.12.07`` is the
|
||||
SCOWL version).
|
||||
#. Download the same version of the dictionary from the `SCOWL`_ homepage or
|
||||
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
|
||||
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
|
||||
#. There’s a special script used for editing dictionaries. The script
|
||||
only works if you have the environment variable ``EDITOR`` set to the
|
||||
executable of an editor program; if you don’t have it set, you can use
|
||||
``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can
|
||||
substitute it with another editor), or you can just type
|
||||
``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified.
|
||||
#. Add and remove words in the dictionary file, then quit the editor.
|
||||
|
||||
Copy and paste the full list of words, then save and quit the editor. It’s
|
||||
not necessary to put the words in alphabetical order, as it will be corrected
|
||||
by the script.
|
||||
#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
|
||||
sure it runs without errors. For more details on this script, see the
|
||||
`make-new-dict.sh`_ section.
|
||||
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
|
||||
example, make sure that the size is about the same as the original dictionary
|
||||
(or slightly larger).
|
||||
#. If everything looks correct, use ``sh install-new-dict.sh`` to copy the
|
||||
generated file in the right position.
|
||||
#. Build Firefox and test your updated dictionary. Once you’re
|
||||
satisfied, use the process described in :ref:`write_a_patch` to create a
|
||||
patch.
|
||||
|
||||
Note that the update script will modify 2 files, and both need to be committed:
|
||||
Note that the update script will modify 2 versions of the dictionary, and both
|
||||
need to be committed:
|
||||
|
||||
* ``en-US.dic``: the dictionary actually shipping in the build and uses
|
||||
* ``en-US.dic``: the dictionary actually shipping in the build, it uses
|
||||
ISO-8859-1 encoding.
|
||||
* ``utf8/en-US.dic``: a version of the same dictionary with UTF-8 encoding. This
|
||||
is used to work around issues with Phabricator, and it allows to display
|
||||
actual changes in the diff.
|
||||
|
||||
Exclude words from suggestions
|
||||
==============================
|
||||
|
||||
It’s possible to completely exclude words from suggested alternatives by adding
|
||||
an affix rule ``!`` at the end of the definition in the ``.dic`` file. For
|
||||
example:
|
||||
|
||||
* ``bum`` would be changed to ``bum/!`` (note the additional forward slash).
|
||||
* ``bum/MS`` would be changed to ``bum/MS!``.
|
||||
|
||||
In order to exclude a word from suggestions, follow the instructions available
|
||||
in `Adding new words to the en-US dictionary`_. Instead of running the
|
||||
``edit-dictionary.sh`` script (point 5), use a text editor to edit the file
|
||||
``en-US.dic`` directly, then proceed with the remaining instructions.
|
||||
|
||||
.. warning::
|
||||
|
||||
Make sure to open ``en-US.dic`` with the correct encoding. For example, Visual
|
||||
Studio Code will try to open it as ``UTF-8``, and it needs to be reopened with
|
||||
encoding ``Western (ISO 8859-1)``.
|
||||
|
||||
Upgrading dictionary to a new upstream version of SCOWL
|
||||
=======================================================
|
||||
|
||||
|
@ -56,11 +99,11 @@ used to generate the files for the en-US dictionary.
|
|||
The working directory for this process is
|
||||
``extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
|
||||
|
||||
#. Download the latest version of the dictionary from `SCOWL`_ homepage or
|
||||
#. Download the latest version of the dictionary from the `SCOWL`_ homepage or
|
||||
`SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
|
||||
Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
|
||||
#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
|
||||
sure it runs without any errors. For more details on this script, see the
|
||||
sure it runs without errors. For more details on this script, see the
|
||||
`make-new-dict.sh`_ section.
|
||||
#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
|
||||
example, make sure that the size is about the same as the original dictionary
|
||||
|
@ -72,16 +115,6 @@ The working directory for this process is
|
|||
Info about the file structure
|
||||
=============================
|
||||
|
||||
mozilla-exclusions.txt
|
||||
----------------------
|
||||
|
||||
``mozilla-exclusions.txt`` is used to explicitly exclude some words from
|
||||
suggestions. The ``make-new-dict.sh`` script will add them to the dictionary file
|
||||
with the ``/!`` flag.
|
||||
|
||||
Terms should be added to this file with exactly the same format used in the .dic
|
||||
file, including affix rules if available.
|
||||
|
||||
mozilla-specific.txt
|
||||
--------------------
|
||||
|
||||
|
@ -153,6 +186,6 @@ The script:
|
|||
|
||||
|
||||
.. _SCOWL: http://wordlist.aspell.net
|
||||
.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20checker
|
||||
.. _file a bug: https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Spelling%20Checker%3A%20en-US%20Dictionary
|
||||
.. _SourceForce: https://sourceforge.net/projects/wordlist/files/SCOWL/
|
||||
.. _bug 237921: https://bugzilla.mozilla.org/show_bug.cgi?id=237921
|
||||
|
|
|
@ -6,11 +6,41 @@
|
|||
|
||||
set -e
|
||||
|
||||
WKDIR="`pwd`"
|
||||
SPELLER="$WKDIR/scowl/speller"
|
||||
|
||||
munch() {
|
||||
$SPELLER/munch-list munch $1 | sort -u
|
||||
}
|
||||
|
||||
expand() {
|
||||
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
|
||||
}
|
||||
|
||||
if [ ! -d "$SPELLER" ]; then
|
||||
echo "The 'scowl' folder is missing. Check the documentation at"
|
||||
echo "https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -z "$EDITOR" ]; then
|
||||
echo 'Need to set the $EDITOR environment variable to your favorite editor.'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Open the editor and allow the user to type or paste words
|
||||
echo "Editor is going to open, you can add the list of words. Quit the editor to finish editing."
|
||||
echo "Press Enter to begin."
|
||||
read foo
|
||||
$EDITOR temp-list.txt
|
||||
|
||||
if [ ! -f temp-list.txt ]; then
|
||||
echo "The content of the editor hasn't been saved."
|
||||
exit 1
|
||||
fi
|
||||
# Remove empty lines
|
||||
sed -i "" "/^$/d" temp-list.txt
|
||||
|
||||
# Copy the current en-US dictionary and strip the first line that contains
|
||||
# the count.
|
||||
tail -n +2 ../en-US.dic > en-US.stripped
|
||||
|
@ -19,16 +49,44 @@ tail -n +2 ../en-US.dic > en-US.stripped
|
|||
iconv -f iso-8859-1 -t utf-8 en-US.stripped > en-US.utf8
|
||||
rm en-US.stripped
|
||||
|
||||
# Open the hunspell dictionary and let the user edit it
|
||||
echo "Now the dictionary is going to be opened for you to edit. Quit the editor to finish editing."
|
||||
echo "Press Enter to begin."
|
||||
read foo
|
||||
$EDITOR en-US.utf8
|
||||
# Save to a temporary file words excluded from suggestions, and numerals,
|
||||
# since the munched result is different for both.
|
||||
grep '!$' < utf8/en-US-utf8.dic > en-US-nosug.txt
|
||||
grep '^[0-9][a-z/]' < utf8/en-US-utf8.dic > en-US-numerals.txt
|
||||
|
||||
# Expand the dictionary to a word list
|
||||
expand ../en-US.aff < en-US.utf8 > en-US-wordlist.txt
|
||||
rm en-US.utf8
|
||||
|
||||
# Add the new words
|
||||
cat temp-list.txt >> en-US-wordlist.txt
|
||||
rm temp-list.txt
|
||||
|
||||
# Remove numerals from the expanded wordlist
|
||||
grep -v '^[0-9]' < en-US-wordlist.txt > en-US-wordlist-nonum.txt
|
||||
rm en-US-wordlist.txt
|
||||
|
||||
# Run the wordlist through the munch script, to compress the dictionary where
|
||||
# possible (using affix rules).
|
||||
munch ../en-US.aff < en-US-wordlist-nonum.txt > en-US-munched.dic
|
||||
rm en-US-wordlist-nonum.txt
|
||||
|
||||
# Remove words that should not be suggested
|
||||
while IFS='/' read -ra line
|
||||
do
|
||||
sed -E -i "" "\:^$line($|/.*):d" en-US-munched.dic
|
||||
done < "en-US-nosug.txt"
|
||||
|
||||
# Add back suggestion exclusions and numerals from the original .dic file
|
||||
cat en-US-nosug.txt >> en-US-munched.dic
|
||||
cat en-US-numerals.txt >> en-US-munched.dic
|
||||
rm en-US-nosug.txt
|
||||
rm en-US-numerals.txt
|
||||
|
||||
# Add back the line count and sort the lines
|
||||
wc -l < en-US.utf8 | tr -d '[:blank:]' > en-US.dic
|
||||
LC_ALL=C sort en-US.utf8 >> en-US.dic
|
||||
rm -f en-US.utf8
|
||||
wc -l < en-US-munched.dic | tr -d '[:blank:]' > en-US.dic
|
||||
LC_ALL=C sort en-US-munched.dic >> en-US.dic
|
||||
rm -f en-US-munched.dic
|
||||
|
||||
# Convert back to ISO-8859-1
|
||||
iconv -f utf-8 -t iso-8859-1 en-US.dic > ../en-US.dic
|
||||
|
|
|
@ -10,14 +10,19 @@
|
|||
set -e
|
||||
|
||||
WKDIR="`pwd`"
|
||||
export SCOWL="$WKDIR/scowl/"
|
||||
SUPPORT_DIR="$WKDIR/support_files/"
|
||||
SPELLER="$SCOWL/speller"
|
||||
SUPPORT_DIR="$WKDIR/support_files"
|
||||
SPELLER="$WKDIR/scowl/speller"
|
||||
|
||||
if [ -e "$SUPPORT_DIR/orig-bk" ]; then
|
||||
echo "$0: directory '$SUPPORT_DIR/orig-bk' exists." 1>&2
|
||||
exit 0
|
||||
fi
|
||||
# Stop if backup folders already exist, because it means that this script
|
||||
# has already been run once.
|
||||
FOLDERS=( "orig-bk" "mozilla-bk")
|
||||
for f in ${FOLDERS[@]}; do
|
||||
if [ -d "$SUPPORT_DIR/$f" ]; then
|
||||
echo "Backup folder already present: $f"
|
||||
echo "Run make-new-dict.sh before running this script."
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
mv orig "$SUPPORT_DIR/orig-bk"
|
||||
mkdir orig
|
||||
|
@ -26,12 +31,13 @@ cp $SPELLER/en_US-custom.dic $SPELLER/en_US-custom.aff $SPELLER/README_en_US-cus
|
|||
mkdir "$SUPPORT_DIR/mozilla-bk"
|
||||
mv ../en-US.dic ../en-US.aff ../README_en_US.txt "$SUPPORT_DIR/mozilla-bk"
|
||||
|
||||
# Convert the affix file to ISO-8859-1
|
||||
# The affix file is ISO-8859-1, but still need to change the character set to
|
||||
# ISO-8859-1 and remove conversion rules.
|
||||
cp en_US-mozilla.aff utf8/en-US-utf8.aff
|
||||
sed -i "" -e '/^ICONV/d' -e 's/^SET UTF-8$/SET ISO8859-1/' en_US-mozilla.aff
|
||||
|
||||
# Convert the dictionary to ISO-8859-1
|
||||
mv en_US-mozilla.dic utf8/en-US-utf8.dic
|
||||
cp en_US-mozilla.dic utf8/en-US-utf8.dic
|
||||
iconv -f utf-8 -t iso-8859-1 < utf8/en-US-utf8.dic > en_US-mozilla.dic
|
||||
|
||||
cp en_US-mozilla.aff ../en-US.aff
|
||||
|
|
|
@ -21,17 +21,23 @@ export LC_CTYPE=C
|
|||
export LC_COLLATE=C
|
||||
|
||||
WKDIR="`pwd`"
|
||||
ORIG="$WKDIR/orig"
|
||||
SUPPORT_DIR="$WKDIR/support_files"
|
||||
SPELLER="$WKDIR/scowl/speller"
|
||||
|
||||
# This is required by scowl scripts
|
||||
export SCOWL="$WKDIR/scowl/"
|
||||
|
||||
ORIG="$WKDIR/orig/"
|
||||
SUPPORT_DIR="$WKDIR/support_files/"
|
||||
SPELLER="$SCOWL/speller"
|
||||
|
||||
expand() {
|
||||
grep -v '^[0-9]\+$' | $SPELLER/munch-list expand $1 | sort -u
|
||||
}
|
||||
|
||||
if [ ! -d "$SPELLER" ]; then
|
||||
echo "The 'scowl' folder is missing. Check the documentation at"
|
||||
echo "https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p $SUPPORT_DIR
|
||||
cd $SPELLER
|
||||
MK_LIST="../mk-list -v1 --accents=both en_US 60"
|
||||
|
@ -49,8 +55,16 @@ expand $SPELLER/en.aff < $SPELLER/en.dic.supp > $SUPPORT_DIR/0-special.txt
|
|||
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
|
||||
iconv -f utf-8 -t iso-8859-1 $ORIG/en_US-custom.dic | expand $ORIG/en_US-custom.aff > $SUPPORT_DIR/1-base.txt
|
||||
|
||||
# The existing Mozilla dictionary is already in ISO-8859-1
|
||||
expand ../en-US.aff < ../en-US.dic > $SUPPORT_DIR/2-mozilla.txt
|
||||
# Store suggestion exclusions (ending with !) defined in current Mozilla dictionary.
|
||||
# Save both the compressed (munched) and expanded version.
|
||||
grep '!$' ../en-US.dic > $SUPPORT_DIR/2-mozilla-nosug-munched.txt
|
||||
expand ../en-US.aff < $SUPPORT_DIR/2-mozilla-nosug-munched.txt > $SUPPORT_DIR/2-mozilla-nosug.txt
|
||||
|
||||
# Remove suggestion exclusions and expand the existing Mozilla dictionary.
|
||||
# The existing Mozilla dictionary is already in ISO-8859-1.
|
||||
grep -v '!$' < ../en-US.dic > $SUPPORT_DIR/en-US-nosug.dic
|
||||
expand ../en-US.aff < $SUPPORT_DIR/en-US-nosug.dic > $SUPPORT_DIR/2-mozilla.txt
|
||||
rm $SUPPORT_DIR/en-US-nosug.dic
|
||||
|
||||
# Input is UTF-8, expand expects ISO-8859-1 so use iconv
|
||||
iconv -f utf-8 -t iso-8859-1 $SPELLER/en_US-custom.dic | expand $SPELLER/en_US-custom.aff > $SUPPORT_DIR/3-upstream.txt
|
||||
|
@ -72,16 +86,13 @@ comm -23 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt | cat -
|
|||
# Note: the output of make-hunspell-dict is UTF-8
|
||||
cat $SUPPORT_DIR/4-patched.txt | comm -23 - $SUPPORT_DIR/0-special.txt | $SPELLER/make-hunspell-dict -one en_US-mozilla /dev/null
|
||||
|
||||
# Exclude specific words from suggestions
|
||||
while IFS= read -r line
|
||||
do
|
||||
# If the string already contains an affix, just add !, otherwise add /!
|
||||
if [[ "$line" == *"/"* ]]; then
|
||||
sed -i "" "s|^$line$|$line!|" en_US-mozilla.dic
|
||||
else
|
||||
sed -i "" "s|^$line$|$line/!|" en_US-mozilla.dic
|
||||
fi
|
||||
done < "mozilla-exclusions.txt"
|
||||
# Add back Mozilla suggestion exclusions. Need to convert the file from
|
||||
# ISO-8859-1 to UTF-8 first, then add back the line count and reorder.
|
||||
tail -n +2 en_US-mozilla.dic > en_US-mozilla-complete.dic
|
||||
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/2-mozilla-nosug-munched.txt >> en_US-mozilla-complete.dic
|
||||
wc -l < en_US-mozilla-complete.dic | tr -d '[:blank:]' > en_US-mozilla.dic
|
||||
LC_ALL=C sort en_US-mozilla-complete.dic >> en_US-mozilla.dic
|
||||
rm -f en_US-mozilla-complete.dic
|
||||
|
||||
# Sanity check should yield identical results
|
||||
#comm -23 $SUPPORT_DIR/1-base.txt $SUPPORT_DIR/3-upstream.txt > $SUPPORT_DIR/3-upstream-remover.txt
|
||||
|
@ -91,12 +102,27 @@ done < "mozilla-exclusions.txt"
|
|||
expand ../en-US.aff < mozilla-specific.txt > 5-mozilla-specific.txt
|
||||
|
||||
# Update Mozilla removed and added wordlists based on the new upstream
|
||||
# dictionary, save them as UTF-8 and not ISO-8951-1
|
||||
comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed.txt
|
||||
# dictionary, save them as UTF-8 and not ISO-8951-1.
|
||||
# Ignore words excluded from suggestions for both files.
|
||||
comm -12 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-removed.txt > $SUPPORT_DIR/5-mozilla-removed-tmp.txt
|
||||
comm -23 $SUPPORT_DIR/5-mozilla-removed-tmp.txt $SUPPORT_DIR/2-mozilla-nosug.txt > $SUPPORT_DIR/5-mozilla-removed.txt
|
||||
rm $SUPPORT_DIR/5-mozilla-removed-tmp.txt
|
||||
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-removed.txt > 5-mozilla-removed.txt
|
||||
comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added.txt
|
||||
|
||||
comm -13 $SUPPORT_DIR/3-upstream.txt $SUPPORT_DIR/2-mozilla-added.txt > $SUPPORT_DIR/5-mozilla-added-tmp.txt
|
||||
comm -23 $SUPPORT_DIR/5-mozilla-added-tmp.txt $SUPPORT_DIR/2-mozilla-nosug.txt > $SUPPORT_DIR/5-mozilla-added.txt
|
||||
rm $SUPPORT_DIR/5-mozilla-added-tmp.txt
|
||||
iconv -f iso-8859-1 -t utf-8 $SUPPORT_DIR/5-mozilla-added.txt > 5-mozilla-added.txt
|
||||
|
||||
# Clean up some files
|
||||
rm hunspell-en_US-mozilla.zip
|
||||
rm nosug
|
||||
|
||||
# Remove backup folders in preparation for the install-new-dict script
|
||||
FOLDERS=( "orig-bk" "mozilla-bk")
|
||||
for f in ${FOLDERS[@]}; do
|
||||
if [ -d "$SUPPORT_DIR/$f" ]; then
|
||||
echo "Removing backup folder $f"
|
||||
rm -rf "$SUPPORT_DIR/$f"
|
||||
fi
|
||||
done
|
||||
|
|
Загрузка…
Ссылка в новой задаче