Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastisearch 'fields' parameter not usable in similarity_search* methods #62

Open
krauhen opened this issue Jan 30, 2025 · 1 comment
Open

Comments

@krauhen
Copy link

krauhen commented Jan 30, 2025

Problem

If I use some of the similarity_search* methods i can not retrieve all fields of my Elasticsearch Index. The custom doc_builder method has no effect because the fields i would like to return are already not in the hits that get returned by the underlying search call.

I investigated and found out that the fields parameter of the underlying search method is not correctly treated.
This is for example one of the places where this happens:

I tested it out, if you copy all the content of the similarity_search* method and provide the field list and also specify a custom doc_builder you can retrieve other fields then just your text_field and metadata.

Example code

...
vector_store = ...
query_embedding = ...
k = ...

response = vector_store.similarity_search_by_vector_with_relevance_scores(
    embedding=query_embedding,
    k=k
)

Hot fix

...
def custom_doc_builder(hit: Dict) -> Document:
    doc = Document(
        page_content=hit["_source"].get("text", ""),
        metadata=hit["_source"].get("metadata", {}),
    )
    doc.metadata["EMBEDDING_VECTOR"] = hit["_source"].get("EMBEDDING_VECTOR", "")
    return doc

vector_store = ...
query_embedding = ...
k = ...
filter = ...
custom_query = ...

fields = ["text", "metadata", "EMBEDDING_VECTOR"]

hits = vector_store._store.search(
    query=None,
    query_vector=query_embedding,
    k=k,
    filter=filter,
    fields=fields
    custom_query=custom_query,
)
docs = _hits_to_docs_scores(
    hits=hits,
    content_field=vector_store.query_field,
    doc_builder=custom_doc_builder,
)

Soution/Fix

So i would suggest to update this line:


to this:

...
def similarity_search_by_vector_with_relevance_scores(
    self,
    embedding: List[float],
    k: int = 4,
    filter: Optional[List[Dict]] = None,
    fields: List[str] = None,                     # <===
    *,
    custom_query: Optional[
        Callable[[Dict[str, Any], Optional[str]], Dict[str, Any]]
    ] = None,
    doc_builder: Optional[Callable[[Dict], Document]] = None,
    **kwargs: Any,
) -> List[Tuple[Document, float]]:
...
hits = self._store.search(
    query=None,
    query_vector=embedding,
    k=k,
    filter=filter,
    fields=fields,                                      # <===
    custom_query=custom_query,
)
...

in every similarity_search* occurence in this file.

If you see that the same I can create the PR with the fix.

EDIT

I thought about it and i think in general there should be a standard document builder that looks more like this:

# Specify in user code
fields = ["text", "metadata", "EMBEDDING_VECTOR"]
....
def doc_builder(hit: Dict, fields: List[str]) -> Document:
    doc = Document(
        page_content=hit["_source"].get(content_field, ""),
        metadata=hit["_source"].get("metadata", {}),
    )
    for field_key in fields:
        doc.metadata[field_key]  = hit["_source"].get(field_key, None)
    return doc

Then one can simply specify the fields and the documents come out with all fields. Also I think a this point some how the id of the elements should also be specified to be returned.
Now this is the implementation:

def default_doc_builder(hit: Dict) -> Document:

@krauhen krauhen changed the title Elastisearch 'fields' parameter not usable Elastisearch 'fields' parameter not usable in similarity_search* methods Jan 30, 2025
@krauhen
Copy link
Author

krauhen commented Jan 30, 2025

The changes are online here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant