kitsune/docs/elastic_search.md

9.5 KiB

Search

Development tips

Adding fields to a live index

Elastic supports adding new fields to an existing mapping, along with some other operations: https://www.elastic.co/guide/en/elasticsearch/reference/7.9/mapping.html#add-field-mapping

To know whether a change you make to a Document will work in prod, try it locally having already set up the mapping:

./manage.py es_init --limit TestDocument

... make changes to TestDocument ...

./manage.py es_init --limit TestDocument

If that fails with an error, you'll need to create a new index with the new mapping, and reindex everything into that index.

However if it succeeds then it should also work on prod.

Once the changes are deployed to prod, and the mapping is updated with es_init, some documents may need to be reindexed. This is because we disable dynamic mapping in SumoDocument, to prevent a dynamic mapping of the wrong type being set up before es_init was able to be run during a deployment.

So to ensure no data is missing from the index, run something like:

./manage.py es_reindex --limit TestDocument --updated-after <datetime of deploy> --updated-before <datetime of mapping update>

Indexing performance

When adding or editing elastic documents, you might want to add the --print-sql-count argument when testing out your changes, to see how many SQL queries are being executed:

CELERY_TASK_ALWAYS_EAGER=True ./manage.py es_reindex --print-sql-count --sql-chunk-size=100 --count=100

If the result is much less than 100, then you have a well optimized document for indexing. However, if the result is some multiple of 100, then unfortunately one or more SQL queries are being executed for each instance being indexed. Consider using some combination of select_related, prefetch_related or annotations to bring that number down.

Datetimes and timezones

As a first step in our migration to using timezone-aware datetimes throughout the application, all datetimes stored in Elastic should be timezone-aware, so as to avoid having to migrate them later.

If inheriting from SumoDocument, any naive datetime set in a Date field will be automatically converted into a timezone-aware datetime, with the naive datetime assumed to be in the application's TIME_ZONE.

To avoid loss of precision around DST switches, where possible aware datetimes should be set. To generate an aware datetime do:

import datetime, timezone

datetime.now(timezone.utc)

This should be used instead of django.utils.timezone.now() as that returns a naive or aware datetime depending on the value of USE_TZ, whereas we want datetimes in Elastic to always be timezone-aware.

Print ElasticSearch queries in your development console

You can set the following variable in your .env file to enable the logging of the queries that are sent to your local ElasticSearch instance.

ES_ENABLE_CONSOLE_LOGGING=True

Simulate slow and out of order query responses

To test how Instant Search behaves with slow and out of order responses you can add a snippet like this:

from time import sleep
from random import randint
sleep(randint(1, 10))

to kitsune.search.views.simple_search.

Synonyms

The kitsune/search/dictionaries/synonyms path contains a text file for each of our search-enabled locales, where synonyms are in the Solr format.

expand defaults to True, so synonyms with no explicit mapping resolve to all elements in the list. That is to say:

start, open, run

is equivalent to:

start, open, run => start, open, run

It's also worth noting that these synonyms are applied at query time, not index time.

That is to say, if a document contained the phrase:

Firefox won't play music.

and we had a synonym set up as:

music => music, audio

Then the search query:

firefox won't play audio

would not match that document.

Hyponyms and hypernyms (subtypes and supertypes)

The synonym files can also be used to define relations between hyponyms and hypernyms (subtypes and supertypes).

For example, a user searching for or posting about a problem with Facebook could use the phrase "Facebook isn't working", or "social media isn't working". Another user searching for or posting about a problem with Twitter could use the phrase "Twitter isn't working", or "social media isn't working".

A simple synonym definition like:

social, facebook, face book, twitter

isn't sufficient here, as a user querying about a problem with Facebook clearly doesn't have one with Twitter.

Similarly a rule like:

social => social, facebook, face book, twitter

only captures the case where a user has posted about Facebook not working and searched for social media not working, not the reverse.

So in this case a set of synonyms should be defined, like so:

social, facebook, face book
social, twitter

With the hypernyms (supertypes) defined across all lines, and the hyponyms (subtypes) defined on one line.

This way, a search for "social" would also become one for "facebook", "face book" and "twitter". Whereas a search for "twitter" would also become one for "social", but not "facebook" or "face book".

Interaction with the rest of the analysis chain

All the analyzers above the synonym token filter in the analyzer chain are also applied to the synonyms, such as our tokenizers, stemmers and stop word filters.

This means it's not necessary to specify the plural or conjugated forms of words, as post-analysis they should end up as the same token.

Hyphen-separated and space separated words will analyze to the same set of tokens.

For instance in en-US, all these synonyms would do nothing at all:

de activate, de-activate
load, loading, loaded
bug, bugs
Stop words

Synonyms containing stop words (such as "in" or "on") must be treated with care, as the stop words will also be filtered out of the synonyms.

For example, these two rules produce the same result in the en-US analysis chain:

addon, add on
addon, add

So a character mapping should be used to turn phrases containing those stop words into ones which don't. Those resulting phrases can then be used in the synonyms definition.

Applying to all locales

There's also an _all.txt file, which specifies synonyms which should be applied across all locales. Suitable synonyms here include brand names or specific technical terms which won't tend to be localized.

Updating

In development synonyms can be updated very easily. Save your changes in the text file and run:

./manage.py es_init --reload-search-analyzers

If no other changes were made to the index configurations, then this should apply successfully, and your already-indexed data will persist within the index and not require any indexing (because these synonyms are applied at query time).

On production

The synonym files need to be put in a bundle and uploaded to the Elastic Cloud.

Run the bin/create_elastic_bundle.sh script to create a zip file with the appropriate directory structure. (You'll need to have zip installed for this command to work.)

Then, either create an extension, or update the previously created extension.

And in either case, update the deployment configuration with the custom extension.

.. Note::
  When updating the deployment after updating an already-existing extension,
  Elastic Cloud may say that no changes are being applied.
  That isn't true,
  and through testing it seems like the extension is being updated,
  and the search analyzers are being reloaded automatically.

  From testing,
  this seems to be the only approach to update and reload synonyms on the Elastic Cloud.
  Updating the extension,
  restarting the cluster and using the reload-search-analyzers command *won't* work.

  Thankfully there's an open issue upstream to make managing synonyms easier with an API:
  https://github.com/elastic/elasticsearch/issues/38523

Character mappings

Character mappings cannot be dynamically updated, this is because they're applied at index time. So any changes to a character mapping requires a re-index.

Taking the addon example from above, we'd want to create character mappings like:

[
  "add on => addon",
  "add-on => addon",
]

Post-tokenization addon doesn't contain an on token, so this is a suitable phrase to replace with.

Unlike synonyms, character mappings are applied before any other part of the analysis chain, so space separated and hyphen-separated phrases need to both be added.

In theory plural and conjugated forms of words also need to be specified, however in practice plural words tend to be covered by the singular replacement as well (e.g. "add on" is a substring in "add ons", so "add ons" is replaced by "addons") and there is marginal benefit to defining every single conjugation of a verb.