kitsune/docs/elastic_search.md

306 строки
9.5 KiB
Markdown

# Search
## Development tips
### Adding fields to a live index
Elastic supports adding new fields to an existing mapping,
along with some other operations:
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/mapping.html#add-field-mapping
To know whether a change you make to a Document will work in prod,
try it locally having already set up the mapping:
```
./manage.py es_init --limit TestDocument
... make changes to TestDocument ...
./manage.py es_init --limit TestDocument
```
If that fails with an error,
you'll need to create a new index with the new mapping,
and reindex everything into that index.
However if it succeeds then it should also work on prod.
Once the changes are deployed to prod,
and the mapping is updated with `es_init`,
some documents may need to be reindexed.
This is because we disable dynamic mapping in `SumoDocument`,
to prevent a dynamic mapping of the wrong type being set up before `es_init` was able to be run during a deployment.
So to ensure no data is missing from the index,
run something like:
```
./manage.py es_reindex --limit TestDocument --updated-after <datetime of deploy> --updated-before <datetime of mapping update>
```
### Indexing performance
When adding or editing elastic documents,
you might want to add the `--print-sql-count` argument when testing out your changes,
to see how many SQL queries are being executed:
```sh
CELERY_TASK_ALWAYS_EAGER=True ./manage.py es_reindex --print-sql-count --sql-chunk-size=100 --count=100
```
If the result is much less than 100,
then you have a well optimized document for indexing.
However, if the result is some multiple of 100,
then unfortunately one or more SQL queries are being executed for each instance being indexed.
Consider using some combination of
[`select_related`](https://docs.djangoproject.com/en/dev/ref/models/querysets/#select-related),
[`prefetch_related`](https://docs.djangoproject.com/en/dev/ref/models/querysets/#prefetch-related)
or [annotations](https://docs.djangoproject.com/en/dev/ref/models/querysets/#annotate)
to bring that number down.
### Datetimes and timezones
As a first step in our migration to using timezone-aware datetimes throughout the application,
all datetimes stored in Elastic should be timezone-aware,
so as to avoid having to migrate them later.
If inheriting from `SumoDocument`,
any naive datetime set in a `Date` field will be automatically converted into a timezone-aware datetime,
with the naive datetime assumed to be in the application's `TIME_ZONE`.
To avoid loss of precision around DST switches,
where possible aware datetimes should be set.
To generate an aware datetime do:
```python
import datetime, timezone
datetime.now(timezone.utc)
```
This should be used instead of
[`django.utils.timezone.now()`](https://docs.djangoproject.com/en/2.2/ref/utils/#django.utils.timezone.now)
as that returns a naive or aware datetime depending on the value of `USE_TZ`, whereas we want datetimes in Elastic to always be timezone-aware.
### Print ElasticSearch queries in your development console
You can set the following variable in your .env file to enable the logging of the queries that are sent to your local ElasticSearch instance.
```
ES_ENABLE_CONSOLE_LOGGING=True
```
### Simulate slow and out of order query responses
To test how Instant Search behaves with slow and out of order responses you can add a snippet like this:
```
from time import sleep
from random import randint
sleep(randint(1, 10))
```
to `kitsune.search.views.simple_search`.
### Synonyms
The `kitsune/search/dictionaries/synonyms` path contains a text file for each of our search-enabled locales,
where synonyms are in the
[Solr format](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-synonym-graph-tokenfilter.html#_solr_synonyms_2).
`expand` defaults to `True`,
so synonyms with no explicit mapping resolve to all elements in the list.
That is to say:
```
start, open, run
```
is equivalent to:
```
start, open, run => start, open, run
```
It's also worth noting that these synonyms are applied at _query_ time,
not index time.
That is to say,
if a document contained the phrase:
> Firefox won't play music.
and we had a synonym set up as:
```
music => music, audio
```
Then the search query:
> firefox won't play audio
would **not** match that document.
#### Hyponyms and hypernyms (subtypes and supertypes)
The synonym files can also be used to define relations between
[hyponyms and hypernyms (subtypes and supertypes)](https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy).
For example,
a user searching for or posting about a problem with Facebook could use the phrase "Facebook isn't working",
or "social media isn't working".
Another user searching for or posting about a problem with Twitter could use the phrase "Twitter isn't working",
or "social media isn't working".
A simple synonym definition like:
```
social, facebook, face book, twitter
```
isn't sufficient here,
as a user querying about a problem with Facebook clearly doesn't have one with Twitter.
Similarly a rule like:
```
social => social, facebook, face book, twitter
```
only captures the case where a user has posted about Facebook not working and searched for social media not working,
not the reverse.
So in this case a set of synonyms should be defined,
like so:
```
social, facebook, face book
social, twitter
```
With the hypernyms (supertypes) defined across all lines,
and the hyponyms (subtypes) defined on one line.
This way,
a search for "social" would also become one for "facebook", "face book" and "twitter".
Whereas a search for "twitter" would also become one for "social",
but _not_ "facebook" or "face book".
#### Interaction with the rest of the analysis chain
All the analyzers above the synonym token filter in the analyzer chain are also applied to the synonyms,
such as our tokenizers, stemmers and stop word filters.
This means it's not necessary to specify the plural or conjugated forms of words,
as post-analysis they _should_ end up as the same token.
Hyphen-separated and space separated words will analyze to the same set of tokens.
For instance in en-US,
all these synonyms would do nothing at all:
```
de activate, de-activate
load, loading, loaded
bug, bugs
```
##### Stop words
Synonyms containing stop words (such as "in" or "on") must be treated with care,
as the stop words will also be filtered out of the synonyms.
For example,
these two rules produce the same result in the en-US analysis chain:
```
addon, add on
addon, add
```
So a [character mapping](#character-mappings) should be used to turn phrases containing those stop words into ones which don't.
Those resulting phrases can then be used in the synonyms definition.
#### Applying to all locales
There's also an `_all.txt` file,
which specifies synonyms which should be applied across _all_ locales.
Suitable synonyms here include brand names or specific technical terms which won't tend to be localized.
#### Updating
In development synonyms can be updated very easily.
Save your changes in the text file and run:
```
./manage.py es_init --reload-search-analyzers
```
If no other changes were made to the index configurations,
then this should apply successfully,
and your already-indexed data will persist within the index and not require any indexing
(because these synonyms are applied at query time).
##### On production
The synonym files need to be put in a bundle and uploaded to the Elastic Cloud.
Run the `bin/create_elastic_bundle.sh` script to create a zip file with the appropriate directory structure.
(You'll need to have `zip` installed for this command to work.)
Then,
either [create](https://www.elastic.co/guide/en/cloud/current/ec-custom-bundles.html#ec-add-your-plugin) an extension,
or [update](https://www.elastic.co/guide/en/cloud/current/ec-custom-bundles.html#ec-update-bundles-and-plugins) the previously created extension.
And in either case,
[update the deployment configuration](https://www.elastic.co/guide/en/cloud/current/ec-custom-bundles.html#ec-update-bundles)
with the custom extension.
```eval_rst
.. Note::
When updating the deployment after updating an already-existing extension,
Elastic Cloud may say that no changes are being applied.
That isn't true,
and through testing it seems like the extension is being updated,
and the search analyzers are being reloaded automatically.
From testing,
this seems to be the only approach to update and reload synonyms on the Elastic Cloud.
Updating the extension,
restarting the cluster and using the reload-search-analyzers command *won't* work.
Thankfully there's an open issue upstream to make managing synonyms easier with an API:
https://github.com/elastic/elasticsearch/issues/38523
```
### Character mappings
Character mappings _cannot_ be dynamically updated,
this is because they're applied at index time.
So any changes to a character mapping requires a re-index.
Taking the addon example from above,
we'd want to create character mappings like:
```
[
"add on => addon",
"add-on => addon",
]
```
Post-tokenization `addon` doesn't contain an `on` token,
so this is a suitable phrase to replace with.
Unlike synonyms,
character mappings are applied before any other part of the analysis chain,
so space separated and hyphen-separated phrases need to both be added.
In theory plural and conjugated forms of words also need to be specified,
however in practice plural words tend to be covered by the singular replacement as well
(e.g. "add on" is a substring in "add ons",
so "add ons" is replaced by "addons")
and there is marginal benefit to defining _every single_ conjugation of a verb.