Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(memory-store, agents-api): Fix distance and confidence values for vector search #1035

Merged
merged 3 commits into from
Jan 9, 2025

Conversation

Ahmad-mtos
Copy link
Contributor

@Ahmad-mtos Ahmad-mtos commented Jan 9, 2025

PR Type

Bug fix, Enhancement


Description

  • Fixed the calculation of confidence values in vector search queries.

  • Added a new SQL function search_by_vector for vector-based document search.

  • Introduced input validation and improved query logic in the search_by_vector function.

  • Updated migration files to include the new SQL function.


Changes walkthrough 📝

Relevant files
Bug fix
search_docs_by_embedding.py
Fix confidence value in embedding search                                 

agents-api/agents_api/queries/docs/search_docs_by_embedding.py

  • Corrected the confidence value passed to the query.
+1/-1     
search_docs_hybrid.py
Fix confidence value in hybrid search                                       

agents-api/agents_api/queries/docs/search_docs_hybrid.py

  • Corrected the confidence value passed to the hybrid search query.
+1/-1     
Enhancement
000022_vector_search.down.sql
Add migration to drop vector search function                         

memory-store/migrations/000022_vector_search.down.sql

  • Added a migration to drop the search_by_vector function.
+5/-0     
000022_vector_search.up.sql
Add SQL function for vector-based document search               

memory-store/migrations/000022_vector_search.up.sql

  • Added a new SQL function search_by_vector.
  • Included input validation for parameters like k and confidence.
  • Implemented logic for owner and metadata filtering.
  • Defined the query execution for vector-based document search.
  • +91/-0   

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information


    Important

    Fixes confidence value in vector search and adds new SQL function for vector-based document search with input validation.

    • Bug Fix:
      • Corrected confidence value in search_docs_by_embedding.py and search_docs_hybrid.py.
    • Enhancement:
      • Added search_by_vector SQL function in 000022_vector_search.up.sql with input validation and query logic for vector-based document search.
      • Added migration to drop search_by_vector function in 000022_vector_search.down.sql.

    This description was created by Ellipsis for fc527fa. It will automatically update as commits are pushed.

    Copy link
    Contributor

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 Security concerns

    SQL Injection:
    The function uses dynamic SQL with format() to construct the query. While the current usage appears safe since the variables owner_filter_sql and metadata_filter_sql are internally constructed, extra care should be taken to ensure no user input can influence these SQL fragments.

    ⚡ Recommended focus areas for review

    Distance Calculation

    The distance calculation formula ((1 - (d.embedding <=> $1)) + 1) * 0.5 appears twice in the query. Consider extracting it to a CTE or variable to avoid potential inconsistencies and improve maintainability.

        ((1 - (d.embedding <=> $1)) + 1) * 0.5 as distance,
        d.embedding,
        d.metadata,
        doc_owners.owner_type,
        doc_owners.owner_id
    FROM docs_embeddings d
    LEFT JOIN doc_owners ON d.doc_id = doc_owners.doc_id
    WHERE d.developer_id = $7
    AND ((1 - (d.embedding <=> $1)) + 1) * 0.5 >= $2
    Error Handling

    The owner array length validation may have a logic error. The condition checks for array length <= 0 but this should likely be a separate check from the array length equality comparison.

    IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND
        array_length(owner_types, 1) != array_length(owner_ids, 1) AND
        array_length(owner_types, 1) <= 0 THEN
        RAISE EXCEPTION 'owner_types and owner_ids arrays must have the same length';
    END IF;
    Query Performance

    The query fetches 4 times more candidates than needed (k * 4) before applying DISTINCT. This could impact performance with large result sets. Consider if this multiplication factor is optimal.

        LIMIT ($3 * 4)  -- Get more candidates than needed
    )
    SELECT DISTINCT ON (doc_id) *
    FROM ranked_docs
    ORDER BY doc_id, distance DESC
    LIMIT $3',

    Copy link
    Contributor

    qodo-merge-pro-for-open-source bot commented Jan 9, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Incorrect mathematical transformation of cosine similarity scores leads to inaccurate search results

    The distance calculation formula ((1 - (d.embedding <=> $1)) + 1) * 0.5 is incorrect
    and will produce skewed similarity scores. The cosine distance from the <=> operator is
    already normalized between 0 and 1, so the transformation is unnecessary and
    mathematically wrong.

    memory-store/migrations/000022_vector_search.up.sql [57]

    -((1 - (d.embedding <=> $1)) + 1) * 0.5 as distance,
    +(1 - (d.embedding <=> $1)) as distance,
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: The current transformation of cosine similarity is mathematically incorrect and would significantly distort search results by applying unnecessary scaling to already normalized scores.

    9
    ✅ Array length validation logic contains a contradiction that could allow invalid input
    Suggestion Impact:The commit modified the array length validation logic, but took a different approach by removing the redundant array_length <= 0 check entirely

    code diff:

         IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND
    -        array_length(owner_types, 1) != array_length(owner_ids, 1) AND
    -        array_length(owner_types, 1) <= 0 THEN
    +        array_length(owner_types, 1) != array_length(owner_ids, 1) THEN

    The owner filter validation has incorrect logic. The condition
    array_length(owner_types, 1) <= 0 should be moved outside the AND clause as it
    contradicts the previous length comparison check.

    memory-store/migrations/000022_vector_search.up.sql [26-28]

    -IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND
    -    array_length(owner_types, 1) != array_length(owner_ids, 1) AND
    -    array_length(owner_types, 1) <= 0 THEN
    +IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND (
    +    array_length(owner_types, 1) != array_length(owner_ids, 1) OR
    +    array_length(owner_types, 1) <= 0) THEN
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The current validation logic is flawed as it combines contradictory conditions with AND, potentially allowing invalid array lengths to pass validation, which could lead to runtime errors.

    8
    General
    Suboptimal sorting in DISTINCT ON clause may exclude better search matches

    The DISTINCT ON clause with ORDER BY doc_id could potentially eliminate better
    matches in favor of worse ones when multiple records exist for the same doc_id. The
    distance should be the primary sorting criterion.

    memory-store/migrations/000022_vector_search.up.sql [71-73]

     SELECT DISTINCT ON (doc_id) *
     FROM ranked_docs
    -ORDER BY doc_id, distance DESC
    +ORDER BY doc_id, distance DESC, owner_type
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: While the current sorting works, the suggested change could provide marginally better results by ensuring the highest similarity matches are retained when duplicates exist.

    5

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ❌ Changes requested. Reviewed everything up to db3e2f2 in 1 minute and 44 seconds

    More details
    • Looked at 139 lines of code in 4 files
    • Skipped 0 files when reviewing.
    • Skipped posting 2 drafted comments based on config settings.
    1. memory-store/migrations/000022_vector_search.up.sql:57
    • Draft comment:
      Consider calculating the normalized distance once and reusing it to avoid repetition and potential maintenance issues.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The SQL function search_by_vector uses a formula to calculate distance which seems to be a normalization of the distance metric. This formula is repeated multiple times in the code, which can lead to maintenance issues if it needs to be changed. It would be better to calculate it once and reuse the result.
    2. memory-store/migrations/000022_vector_search.up.sql:3
    • Draft comment:
      This appears to be a duplicate of the existing search_by_vector function. The only differences are in score normalization, which doesn't affect the core functionality. Consider modifying the existing function if changes are needed.

    • function search_by_vector (000018_doc_search.up.sql)

    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable:
      This appears to be a migration file that's deliberately replacing an existing function with an updated version that changes how vector similarity scores are calculated and normalized. Migration files are meant to evolve the database schema and functions over time. The fact that it's using CREATE OR REPLACE suggests this is intentional, not accidental duplication.
      The comment raises a valid point about similarity between functions, but may be missing the purpose of database migrations.
      Since this is in a migration file using CREATE OR REPLACE, the similarity is intentional - it's meant to update the existing function with new score normalization logic.
      This comment should be deleted as it misunderstands the purpose of migration files, which are meant to evolve database functions over time.

    Workflow ID: wflow_CzbnGakj4aDrz7nk


    Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    memory-store/migrations/000022_vector_search.up.sql Outdated Show resolved Hide resolved
    Base automatically changed from x/openai-embed to dev January 9, 2025 11:01
    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on fc527fa in 34 seconds

    More details
    • Looked at 14 lines of code in 1 files
    • Skipped 0 files when reviewing.
    • Skipped posting 1 drafted comments based on config settings.
    1. memory-store/migrations/000022_vector_search.up.sql:27
    • Draft comment:
      Consider adding a check to ensure owner_types and owner_ids are not empty arrays. This will prevent potential issues when these arrays are empty but not null.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable:
      The original code had a check for array length <= 0, which was removed. However, this check was logically incorrect since array_length returns NULL for empty arrays in PostgreSQL, not 0 or negative numbers. The remaining check for length mismatch between owner_types and owner_ids is sufficient since it will handle NULL cases. Empty arrays would be handled gracefully by the ANY operator in the query.
      Maybe empty arrays could cause unexpected behavior in the owner filtering logic? The ANY operator's behavior with empty arrays should be verified.
      In PostgreSQL, the ANY operator with an empty array returns false for all comparisons, which is actually the desired behavior here - it would effectively filter out all results, which makes sense if no owners are specified.
      The comment should be deleted. The removed check was incorrect, and empty arrays are handled appropriately by PostgreSQL's ANY operator in the query.

    Workflow ID: wflow_efGy5W5A6PTQa222


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    @whiterabbit1983 whiterabbit1983 merged commit cf9ffab into dev Jan 9, 2025
    9 of 10 checks passed
    @whiterabbit1983 whiterabbit1983 deleted the x/search-by-vector branch January 9, 2025 11:31
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants