-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENG-6189] trovesearch denormalized #828
base: develop
Are you sure you want to change the base?
Conversation
d925cc4
to
b315c98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks pretty good. A couple questions and some potential tidying, but no obvious bugs. Good test coverage.
Pass complete
ELASTICSEARCH8_CLUSTER_SETTINGS = { | ||
'URL': ELASTICSEARCH8_URL, | ||
'AUTH': ( | ||
(ELASTICSEARCH8_USERNAME, ELASTICSEARCH8_SECRET) | ||
if ELASTICSEARCH8_SECRET is not None | ||
else None | ||
), | ||
'CERT_PATH': ELASTICSEARCH8_CERT_PATH, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these CLUSTER_SETTINGS
not need to be specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not anymore -- one of the cleanup changes i kept from #824 was removing the index-strategy config from django settings and instead making (for example) the base elastic8 strategy use settings.ELASTICSEARCH8_URL
directly
will be worth a fresh rethink if index strategies are ever abstracted into a reusable form, but as it is seemed worse the more strategies i added
# bad calls: | ||
with pytest.raises(IndexStrategyError): | ||
IndexStrategy.get_for_sharev2_search('bad-request') | ||
with pytest.raises(IndexStrategyError): | ||
IndexStrategy.get_for_sharev2_search() | ||
with pytest.raises(IndexStrategyError): | ||
IndexStrategy.get_for_sharev2_search(requested_name=None) | ||
with pytest.raises(IndexStrategyError): | ||
get_index_for_sharev2_search('bad-request') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of the other error cases still worth testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other two aren't errors anymore; they now return a default that depends on settings and feature flags
...should add tests for that, tho
|
||
|
||
class TestTroveIndexcardFlats(RealElasticTestCase): | ||
class TestTroveIndexcardFlats(_common_trovesearch_tests.CommonTrovesearchTests): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to DRYing tests
@@ -235,7 +235,7 @@ def trove_browse_link(iri: str): | |||
TROVE.valueSearchFilter, | |||
TROVE.pageSize, | |||
TROVE.pageCursor, | |||
TROVE.sort, | |||
# TROVE.sort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value-search doesn't support sorting now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never did, tbh -- as implemented, has always ignored any sort
param and sorted from most- to least-used, and changing that seemed like it'd raise more questions and be more difficult than correcting the docs
# delete all per-value docs (to account for missing values) | ||
self.es8_client.delete_by_query( | ||
index=list(indexnames), | ||
query={'bool': {'must': [ | ||
{'terms': {'card.card_pk': messages_chunk.target_ids_chunk}}, | ||
{'exists': {'field': 'iri_value.value_iri'}}, | ||
]}}, | ||
) | ||
# (possible optimization: instead, hold onto doc_ids and (in `after_chunk`?) | ||
# delete_by_query excluding those) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, can you explain what/why this is doing? When messages come in chunked (e.g. for an index backfill), this will delete existing PKs from that incoming chunk if there's a value_iri
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the denorm
index has a document for the card itself and additional documents for each iri value in the card contents (with the card
object duplicated across all of them, so a value-search can filter by anything on the card)
{card: ...}
{card: ..., iri_value: ...}
{card: ..., iri_value: ...}
{card: ..., iri_value: ...}
...
after indexed like that, if a subsequent update is missing any previous iri_value
(say, a subject or contributor is removed), the corresponding value docs need to be deleted -- the easy/naive path was this delete_by_query
to drop all value docs for the chunk of cards that are about to be indexed
(note it's not just for backfills -- the urgent queue is chunked as well, tho you'll more often get chunks of size one)
index=self.indexname, | ||
) | ||
except elasticsearch8.TransportError as error: | ||
raise exceptions.IndexStrategyError() from error # TODO: error messaging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: how annoying would it be to add error messaging instead of the TODOs
here and above?
if _iri_aggs: | ||
_buckets = _iri_aggs['buckets'] | ||
_bucket_count = len(_buckets) | ||
# WARNING: terribly inefficient pagination (part two) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any guesses as to how terribly-inefficient -- worth benchmarking? It looks fairly similar to the FlatsIndexStrat
's pagination
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is indeed same pagination as in flats
-- "terribly inefficient" was a label i applied while coding (and copy-pasted here), but in practice it's not too much slower to load N pages and discard N-1 of them (since N is small in practice)
maybe "terribly hacky" is a better descriptor
# cursor implementations | ||
|
||
@dataclasses.dataclass | ||
class _SimpleCursor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the other dataclasses
in here are very-similar-but-somewhat-different-from some of the dataclasses
found in indexcard_flats
, but it looks like the only change in this one as compared to index_card_flats._SimpleCursor
is the assert
in from_page_param
.
Should this (and/or others, potentially) be DRY'd up a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my intent was/is to delete trove_indexcard_flats
as soon as trovesearch_denorm
is in stable use, so tried to avoid putting much effort into updating it further -- but agree there are a few things (like cursors) that would make more sense reused than duplicated, yeah
trovesearch_denorm
index strategy to speed up/trove/index-card-search
and/trove/index-value-search
requestsTROVESEARCH_DENORMILY
feature flag is upnested
fields (what a wild choice that was)cardSearchFilter
/cardSearchText
params, index additional docs for each iri value in an indexcard (each with a full copy of the card's filterable fields -- this is the denormalization part)and assorted improvements from meanderings in #824
IndexStrategy
with functions inshare.search.index_strategy
Elastic8IndexStrategy
to support indexing multiple docs per index-card (required for denormalization above)share.search.index_strategy._trovesearch_util