Document how our search works. (#8713)

Fixes #8090

This is simply stating the facts about how search currently works. This doesn't mention any improvement ideas in the docs directly.

In addition to that, personal improvement ideas:

* Remove `listed_authors.name` from the scoring, make search against authors only possible via separate field queries, e.g `Search addons author:Chris` or so. This may simplify scoring against the add-on name a lot
* https://github.com/mozilla/addons-server/issues/6815 (Remove prefix boosting to simplify scoring)
* Maybe just use https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html instead of all our custom rules here and there?
* Remove negative boosts https://github.com/mozilla/addons-server/issues/6838
This commit is contained in:
Christopher Grebs 2018-07-06 14:18:55 +02:00 коммит произвёл GitHub
Родитель 3b1e8bfa04
Коммит 224859a641
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
4 изменённых файлов: 102 добавлений и 4 удалений

Просмотреть файл

@ -264,8 +264,8 @@ This endpoint allows you to fetch a specific add-on by id, slug or guid.
For backwards-compatibility reasons, the value for type of Theme
currently live on production addons.mozilla.org is ``persona``
(Lightweight Theme). ``theme`` refers to a deprecated XUL Complete Theme.
New webextension packaged non-dynamic themes are ``statictheme`.
(Lightweight Theme). ``theme`` refers to a deprecated XUL Complete Theme.
New webextension packaged non-dynamic themes are ``statictheme``.
============== ==========================================================
Value Description

Просмотреть файл

@ -42,8 +42,8 @@ Detail
This endpoint allows you to fetch a single collection by its ``slug``.
It returns any ``public`` collection by the specified user. You can access
a non-``public`` collection only if it was authored by you, the authenticated user.
If you have ``Admin:Curation`` permission you can see any collection belonging
to the ``mozilla`` user.
If you have ``Admin:Curation`` permission you can see any collection belonging
to the ``mozilla`` user.
.. http:get:: /api/v4/accounts/account/(int:user_id|string:username)/collections/(string:collection_slug)/

Просмотреть файл

@ -19,5 +19,6 @@ Development
services
translations
style
search
docs
../../../README.rst

Просмотреть файл

@ -0,0 +1,97 @@
.. _search:
============================
How does search on AMO work?
============================
.. note::
This is documenting our current state of how search is implemented in addons-server.
We will be using this to plan future improvements so please note that we are
aware that the process written below is not perfect and has bugs here and there.
Please see https://github.com/orgs/mozilla/projects/17#card-10287357 for more planning.
General structure
=================
Our search contais Add-ons (``addons`` index) and Add-on compatibility report documents (``compat`` index).
In addition to that we store the following data:
* Add-on Versions (`Indexer / Serializer <https://github.com/mozilla/addons-server/blob/master/src/olympia/addons/indexers.py#L215-L237>`_)
* Files for each Add-on Version
* Compatibility information for each Add-on Version
As well as
* Authors
* Previews (image links)
* Translations (`Translations mapping generation <https://github.com/mozilla/addons-server/blob/master/src/olympia/amo/indexers.py#L40-L136>`_)
And various other add-on related properties. See the `Add-on Indexer / Serializer <https://github.com/mozilla/addons-server/blob/master/src/olympia/addons/indexers.py#L215-L237>`_ for more details.
Our search can be reached either via the API through :ref:`/api/v4/addons/search/ <addon-search>` or :ref:`/api/v4/addons/autocomplete/ <addon-autocomplete>` which are used by our addons-frontend as well as via our legacy pages (used much less).
Both use similar filtering and scoring code. For legacy reasons they're not identical though, we should try to focus on our API-based search though since the legacy search will be removed once support for Thunderbird and Seamonkey will be moved to a new platform.
The legacy search uses ElasticSearch to query the data and then requests the actual model objects from the database. The newer API-based search only hits ElasticSearch and uses data directly stored from ES which is much more efficient.
Flow of a search query through AMO
==================================
Let's assume we search on addons-frontend (not legacy) the search query hits the API and get's handled by ``AddonSearchView`` which directly queries ElasticSearch and doesn't involve the database at all.
There are a few filters that are described in the :ref:`/api/v4/addons/search/ docs <addon-search>` but most of them are not very relevant for raw search queries. Examples are filters by guid, platform, category, add-on type or appversion.
Much more relevant for raw add-on searches (And this is primarily used when you use the search on the frontend) is ``SearchQueryFilter``.
It composes various rules to define a more or less usable ranking:
Primary rules
-------------
These are the ones using the strongest boosts, so they are only applied
to a specific set of fields like the name, the slug and authors.
**Applied rules** (merged via ``should``):
1. Prefer ``term`` matches on ``name.raw`` (``boost=100``) - our attempt to implement exact matches
2. Prefer phrase matches that allows swapped terms (``boost=4``, ``slop=1``)
3. If a query is < 20 characters long, try to do a fuzzy match on the search query (``boost=2``, ``prefix_length=4``, ``fuzziness=AUTO``)
4. Then text matches, using the standard text analyzer (``boost=3``, ``analyzer=standard``, ``operator=and``)
5. Then text matches, using a language specific analyzer (``boost=2.5``)
6. Then look for the query as a prefix (``boost=1.5``)
7. If we have a matching analyzer, add a query to ``name_l10n_{LANG}`` (``boost=2.5``, ``operator=and``)
Rules 4, 5 and 6 are added for ``name`` and ``listed_authors.name``.
Secondary rules
---------------
These are the ones using the weakest boosts, they are applied to fields
containing more text like description, summary and tags.
**Applied rules** (merged via ``should``):
1. Look for phrase matches inside the summary (``boost=0.8``)
2. Look for phrase matches inside the summary using language specific
analyzer (``boost=0.6``)
3. Look for phrase matches inside the description (``boost=0.3``).
4. Look for phrase matches inside the description using language
specific analyzer (``boost=0.6``).
5. Look for matches inside tags (``boost=0.1``).
6. Append a separate ``match`` query for every word to boost tag matches (``boost=0.1``)
General query flow:
-------------------
1. Fetch current translation
2. Fetch locale specific analyzer (`List of analyzers <https://github.com/mozilla/addons-server/blob/master/src/olympia/constants/search.py#L15-L61>`_)
3. Merge primary and secondary *should* rules
4. Create a ``function_score`` query that uses a ``field_value_factor`` function on ``boost``
5. Add a specific boost for webextension related add-ons