Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENG-6189] trovesearch denormalized #828

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

aaxelb
Copy link
Contributor

@aaxelb aaxelb commented Oct 25, 2024

  • add trovesearch_denorm index strategy to speed up /trove/index-card-search and /trove/index-value-search requests
    • used by default when the TROVESEARCH_DENORMILY feature flag is up
    • stop using nested fields (what a wild choice that was)
      • to support indexing values at arbitrary paths, use dynamic templates
      • to support value-search with cardSearchFilter/cardSearchText params, index additional docs for each iri value in an indexcard (each with a full copy of the card's filterable fields -- this is the denormalization part)
    • include integer values (for sorting only, so far)
  • allow sorting by property-path (not only by a single property)

and assorted improvements from meanderings in #824

  • move the definition of available index-strategies from settings to static code (to fix having code that depends on settings being just so)
  • replace static methods on IndexStrategy with functions in share.search.index_strategy
  • add admin view to view mappings for an index
  • update abstract Elastic8IndexStrategy to support indexing multiple docs per index-card (required for denormalization above)
  • make tests reusable for index-strategies supporting trove-search
  • consolidate some logic shared across trove-y index strategies to reusable utils in share.search.index_strategy._trovesearch_util
  • more accurate type annotations

@coveralls
Copy link

coveralls commented Oct 25, 2024

Coverage Status

coverage: 91.13% (+0.5%) from 90.61%
when pulling ad5a52b on aaxelb:feature/6189-denormal
into ed13c34 on CenterForOpenScience:develop.

@aaxelb aaxelb marked this pull request as ready for review October 25, 2024 21:08
@aaxelb aaxelb changed the title [wip][ENG-6189] trovesearch denormalized [ENG-6189] trovesearch denormalized Oct 25, 2024
Copy link
Member

@mfraezz mfraezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks pretty good. A couple questions and some potential tidying, but no obvious bugs. Good test coverage.

Pass complete :octocat:

Comment on lines -335 to -343
ELASTICSEARCH8_CLUSTER_SETTINGS = {
'URL': ELASTICSEARCH8_URL,
'AUTH': (
(ELASTICSEARCH8_USERNAME, ELASTICSEARCH8_SECRET)
if ELASTICSEARCH8_SECRET is not None
else None
),
'CERT_PATH': ELASTICSEARCH8_CERT_PATH,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these CLUSTER_SETTINGS not need to be specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not anymore -- one of the cleanup changes i kept from #824 was removing the index-strategy config from django settings and instead making (for example) the base elastic8 strategy use settings.ELASTICSEARCH8_URL directly

will be worth a fresh rethink if index strategies are ever abstracted into a reusable form, but as it is seemed worse the more strategies i added

Comment on lines -63 to +70
# bad calls:
with pytest.raises(IndexStrategyError):
IndexStrategy.get_for_sharev2_search('bad-request')
with pytest.raises(IndexStrategyError):
IndexStrategy.get_for_sharev2_search()
with pytest.raises(IndexStrategyError):
IndexStrategy.get_for_sharev2_search(requested_name=None)
with pytest.raises(IndexStrategyError):
get_index_for_sharev2_search('bad-request')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the other error cases still worth testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other two aren't errors anymore; they now return a default that depends on settings and feature flags

...should add tests for that, tho



class TestTroveIndexcardFlats(RealElasticTestCase):
class TestTroveIndexcardFlats(_common_trovesearch_tests.CommonTrovesearchTests):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to DRYing tests

@@ -235,7 +235,7 @@ def trove_browse_link(iri: str):
TROVE.valueSearchFilter,
TROVE.pageSize,
TROVE.pageCursor,
TROVE.sort,
# TROVE.sort,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value-search doesn't support sorting now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never did, tbh -- as implemented, has always ignored any sort param and sorted from most- to least-used, and changing that seemed like it'd raise more questions and be more difficult than correcting the docs

Comment on lines +150 to +159
# delete all per-value docs (to account for missing values)
self.es8_client.delete_by_query(
index=list(indexnames),
query={'bool': {'must': [
{'terms': {'card.card_pk': messages_chunk.target_ids_chunk}},
{'exists': {'field': 'iri_value.value_iri'}},
]}},
)
# (possible optimization: instead, hold onto doc_ids and (in `after_chunk`?)
# delete_by_query excluding those)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, can you explain what/why this is doing? When messages come in chunked (e.g. for an index backfill), this will delete existing PKs from that incoming chunk if there's a value_iri?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the denorm index has a document for the card itself and additional documents for each iri value in the card contents (with the card object duplicated across all of them, so a value-search can filter by anything on the card)

{card: ...}
{card: ..., iri_value: ...}
{card: ..., iri_value: ...}
{card: ..., iri_value: ...}
...

after indexed like that, if a subsequent update is missing any previous iri_value (say, a subject or contributor is removed), the corresponding value docs need to be deleted -- the easy/naive path was this delete_by_query to drop all value docs for the chunk of cards that are about to be indexed

(note it's not just for backfills -- the urgent queue is chunked as well, tho you'll more often get chunks of size one)

index=self.indexname,
)
except elasticsearch8.TransportError as error:
raise exceptions.IndexStrategyError() from error # TODO: error messaging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: how annoying would it be to add error messaging instead of the TODOs here and above?

if _iri_aggs:
_buckets = _iri_aggs['buckets']
_bucket_count = len(_buckets)
# WARNING: terribly inefficient pagination (part two)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any guesses as to how terribly-inefficient -- worth benchmarking? It looks fairly similar to the FlatsIndexStrat's pagination

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is indeed same pagination as in flats -- "terribly inefficient" was a label i applied while coding (and copy-pasted here), but in practice it's not too much slower to load N pages and discard N-1 of them (since N is small in practice)

maybe "terribly hacky" is a better descriptor

# cursor implementations

@dataclasses.dataclass
class _SimpleCursor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the other dataclasses in here are very-similar-but-somewhat-different-from some of the dataclasses found in indexcard_flats, but it looks like the only change in this one as compared to index_card_flats._SimpleCursor is the assert in from_page_param.

Should this (and/or others, potentially) be DRY'd up a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my intent was/is to delete trove_indexcard_flats as soon as trovesearch_denorm is in stable use, so tried to avoid putting much effort into updating it further -- but agree there are a few things (like cursors) that would make more sense reused than duplicated, yeah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants