[ENG-6189] trovesearch denormalized #828

aaxelb · 2024-10-25T20:37:58Z

add trovesearch_denorm index strategy to speed up /trove/index-card-search and /trove/index-value-search requests
- used by default when the TROVESEARCH_DENORMILY feature flag is up
- stop using nested fields (what a wild choice that was)
  - to support indexing values at arbitrary paths, use dynamic templates
  - to support value-search with cardSearchFilter/cardSearchText params, index additional docs for each iri value in an indexcard (each with a full copy of the card's filterable fields -- this is the denormalization part)
- include integer values (for sorting only, so far)
allow sorting by property-path (not only by a single property)

and assorted improvements from meanderings in #824

move the definition of available index-strategies from settings to static code (to fix having code that depends on settings being just so)
replace static methods on IndexStrategy with functions in share.search.index_strategy
add admin view to view mappings for an index
update abstract Elastic8IndexStrategy to support indexing multiple docs per index-card (required for denormalization above)
make tests reusable for index-strategies supporting trove-search
consolidate some logic shared across trove-y index strategies to reusable utils in share.search.index_strategy._trovesearch_util
more accurate type annotations

coveralls · 2024-10-25T20:43:36Z

coverage: 91.13% (+0.5%) from 90.61%
when pulling ad5a52b on aaxelb:feature/6189-denormal
into ed13c34 on CenterForOpenScience:develop.

mfraezz

Generally looks pretty good. A couple questions and some potential tidying, but no obvious bugs. Good test coverage.

Pass complete

mfraezz · 2024-10-28T16:45:17Z

project/settings.py

-    ELASTICSEARCH8_CLUSTER_SETTINGS = {
-        'URL': ELASTICSEARCH8_URL,
-        'AUTH': (
-            (ELASTICSEARCH8_USERNAME, ELASTICSEARCH8_SECRET)
-            if ELASTICSEARCH8_SECRET is not None
-            else None
-        ),
-        'CERT_PATH': ELASTICSEARCH8_CERT_PATH,
-    }


Do these CLUSTER_SETTINGS not need to be specified?

not anymore -- one of the cleanup changes i kept from #824 was removing the index-strategy config from django settings and instead making (for example) the base elastic8 strategy use settings.ELASTICSEARCH8_URL directly

will be worth a fresh rethink if index strategies are ever abstracted into a reusable form, but as it is seemed worse the more strategies i added

mfraezz · 2024-11-04T16:43:21Z

tests/share/search/index_strategy/test_strategy_selection.py

-            # bad calls:
-            with pytest.raises(IndexStrategyError):
-                IndexStrategy.get_for_sharev2_search('bad-request')
-            with pytest.raises(IndexStrategyError):
-                IndexStrategy.get_for_sharev2_search()
-            with pytest.raises(IndexStrategyError):
-                IndexStrategy.get_for_sharev2_search(requested_name=None)
+        with pytest.raises(IndexStrategyError):
+            get_index_for_sharev2_search('bad-request')


Are any of the other error cases still worth testing?

the other two aren't errors anymore; they now return a default that depends on settings and feature flags

...should add tests for that, tho

mfraezz · 2024-11-04T16:45:17Z

tests/share/search/index_strategy/test_trove_indexcard_flats.py

-
-
-class TestTroveIndexcardFlats(RealElasticTestCase):
+class TestTroveIndexcardFlats(_common_trovesearch_tests.CommonTrovesearchTests):


👍 to DRYing tests

mfraezz · 2024-11-04T16:58:48Z

trove/vocab/trove.py

@@ -235,7 +235,7 @@ def trove_browse_link(iri: str):
            TROVE.valueSearchFilter,
            TROVE.pageSize,
            TROVE.pageCursor,
-            TROVE.sort,
+            # TROVE.sort,


Value-search doesn't support sorting now?

never did, tbh -- as implemented, has always ignored any sort param and sorted from most- to least-used, and changing that seemed like it'd raise more questions and be more difficult than correcting the docs

mfraezz · 2024-11-04T17:08:12Z

share/search/index_strategy/trovesearch_denorm.py

+        # delete all per-value docs (to account for missing values)
+        self.es8_client.delete_by_query(
+            index=list(indexnames),
+            query={'bool': {'must': [
+                {'terms': {'card.card_pk': messages_chunk.target_ids_chunk}},
+                {'exists': {'field': 'iri_value.value_iri'}},
+            ]}},
+        )
+        # (possible optimization: instead, hold onto doc_ids and (in `after_chunk`?)
+        #                         delete_by_query excluding those)


For my own understanding, can you explain what/why this is doing? When messages come in chunked (e.g. for an index backfill), this will delete existing PKs from that incoming chunk if there's a value_iri?

the denorm index has a document for the card itself and additional documents for each iri value in the card contents (with the card object duplicated across all of them, so a value-search can filter by anything on the card)

{card: ...} {card: ..., iri_value: ...} {card: ..., iri_value: ...} {card: ..., iri_value: ...} ...

after indexed like that, if a subsequent update is missing any previous iri_value (say, a subject or contributor is removed), the corresponding value docs need to be deleted -- the easy/naive path was this delete_by_query to drop all value docs for the chunk of cards that are about to be indexed

(note it's not just for backfills -- the urgent queue is chunked as well, tho you'll more often get chunks of size one)

mfraezz · 2024-11-04T17:12:35Z

share/search/index_strategy/trovesearch_denorm.py

+                    index=self.indexname,
+                )
+            except elasticsearch8.TransportError as error:
+                raise exceptions.IndexStrategyError() from error  # TODO: error messaging


Minor: how annoying would it be to add error messaging instead of the TODOs here and above?

mfraezz · 2024-11-04T17:15:21Z

share/search/index_strategy/trovesearch_denorm.py

+        if _iri_aggs:
+            _buckets = _iri_aggs['buckets']
+            _bucket_count = len(_buckets)
+            # WARNING: terribly inefficient pagination (part two)


Any guesses as to how terribly-inefficient -- worth benchmarking? It looks fairly similar to the FlatsIndexStrat's pagination

it is indeed same pagination as in flats -- "terribly inefficient" was a label i applied while coding (and copy-pasted here), but in practice it's not too much slower to load N pages and discard N-1 of them (since N is small in practice)

maybe "terribly hacky" is a better descriptor

mfraezz · 2024-11-04T17:35:02Z

share/search/index_strategy/trovesearch_denorm.py

+# cursor implementations
+
+@dataclasses.dataclass
+class _SimpleCursor:


Some of the other dataclasses in here are very-similar-but-somewhat-different-from some of the dataclasses found in indexcard_flats, but it looks like the only change in this one as compared to index_card_flats._SimpleCursor is the assert in from_page_param.

Should this (and/or others, potentially) be DRY'd up a bit?

my intent was/is to delete trove_indexcard_flats as soon as trovesearch_denorm is in stable use, so tried to avoid putting much effort into updating it further -- but agree there are a few things (like cursors) that would make more sense reused than duplicated, yeah

aaxelb added 6 commits October 25, 2024 14:12

agglomulated changes from quest/minimal-nest

c02d47f

wip (removing unneeded changes)

ba78a9b

add trovesearch_denorm index strategy

a738a7c

tidy, consolidate, and index supplements

be7ad6d

add tests for supplement and sorting

354458e

support sorting along paths and on integers

c65a4d0

update sort docs

b315c98

aaxelb force-pushed the feature/6189-denormal branch from d925cc4 to b315c98 Compare October 25, 2024 21:03

aaxelb marked this pull request as ready for review October 25, 2024 21:08

aaxelb changed the title ~~[wip][ENG-6189] trovesearch denormalized~~ [ENG-6189] trovesearch denormalized Oct 25, 2024

aaxelb mentioned this pull request Oct 25, 2024

[wip/analysis][ENG-6075] trove-search performance analysis #824

Closed

fix: card.pk => card.card_pk

ad5a52b

mfraezz reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENG-6189] trovesearch denormalized #828

[ENG-6189] trovesearch denormalized #828

aaxelb commented Oct 25, 2024 •

edited

Loading

coveralls commented Oct 25, 2024 •

edited

Loading

mfraezz left a comment

mfraezz Oct 28, 2024

aaxelb Nov 4, 2024

mfraezz Nov 4, 2024

aaxelb Nov 4, 2024

mfraezz Nov 4, 2024

mfraezz Nov 4, 2024

aaxelb Nov 4, 2024

mfraezz Nov 4, 2024

aaxelb Nov 4, 2024

mfraezz Nov 4, 2024

mfraezz Nov 4, 2024

aaxelb Nov 4, 2024

mfraezz Nov 4, 2024

aaxelb Nov 4, 2024



		class TestTroveIndexcardFlats(RealElasticTestCase):
		class TestTroveIndexcardFlats(_common_trovesearch_tests.CommonTrovesearchTests):

[ENG-6189] trovesearch denormalized #828

Are you sure you want to change the base?

[ENG-6189] trovesearch denormalized #828

Conversation

aaxelb commented Oct 25, 2024 • edited Loading

coveralls commented Oct 25, 2024 • edited Loading

mfraezz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaxelb commented Oct 25, 2024 •

edited

Loading

coveralls commented Oct 25, 2024 •

edited

Loading