Get llm_filter to support document structure + similarity sorting for elements #876

baitsguy · 2024-10-04T17:48:14Z

1. Support document model in llm_filter and some performance improvements
llm_filter will work on a DocSet made of original Documents (i.e. pre-exploded state) when using use_elements=True

Element iteration "early stopping" - a record (document) passes the filter if any of the document's elements pass the llm_filter condition, we don't evaluate every element. This helps for documents which would be a hit, but doesn't for documents which would not (because we would still iterate over every element)
Optionally, it sorts elements by a similarity score to get potential llm hits sooner
keep_none flag to dictate behavior for a missing property value

None of these are perfect but they align with the use cases we have so far and can keep iterating.

2. Sorted elements returned by OpenSearchReader in reconstructed documents
Use element.element_index field to sort elements when reconstructing a document. Does an explicit sort after each reconstruction which is okay based on the magnitude of data.

Note: the ignore_doc_structure flag is to prevent current usage from breaking, once we update other code to use the document model we can remove/flip it

… elements

lib/sycamore/sycamore/connectors/opensearch/opensearch_reader.py

lib/sycamore/sycamore/docset.py

mdwelsh

A few nits but LGTM

lib/sycamore/sycamore/docset.py

lib/sycamore/sycamore/connectors/opensearch/opensearch_reader.py

baitsguy added 3 commits October 4, 2024 10:46

Get llm_filter to support document structure + similarity sorting for…

ab4ba87

… elements

doc property filter

17a2a82

PR fix

32ad281

baitsguy requested review from mdwelsh and bsowell October 4, 2024 18:35

baitsguy marked this pull request as ready for review October 4, 2024 18:39

mdwelsh reviewed Oct 4, 2024

View reviewed changes

pr comments

c6186a8

baitsguy requested a review from mdwelsh October 4, 2024 22:14

mdwelsh approved these changes Oct 4, 2024

View reviewed changes

pr comments

96ebadd

baitsguy enabled auto-merge (squash) October 4, 2024 23:37

baitsguy merged commit d3c4718 into main Oct 4, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get llm_filter to support document structure + similarity sorting for elements #876

Get llm_filter to support document structure + similarity sorting for elements #876

baitsguy commented Oct 4, 2024 •

edited

Loading

mdwelsh left a comment

Get llm_filter to support document structure + similarity sorting for elements #876

Get llm_filter to support document structure + similarity sorting for elements #876

Conversation

baitsguy commented Oct 4, 2024 • edited Loading

mdwelsh left a comment

Choose a reason for hiding this comment

baitsguy commented Oct 4, 2024 •

edited

Loading