fix(memory-store, agents-api): Fix distance and confidence values for vector search #1035

Ahmad-mtos · 2025-01-09T10:38:47Z

PR Type

Bug fix, Enhancement

Description

Fixed the calculation of confidence values in vector search queries.
Added a new SQL function search_by_vector for vector-based document search.
Introduced input validation and improved query logic in the search_by_vector function.
Updated migration files to include the new SQL function.

Changes walkthrough 📝

Relevant files

Bug fix

search_docs_by_embedding.py `Fix confidence value in embedding search` agents-api/agents_api/queries/docs/search_docs_by_embedding.py Corrected the confidence value passed to the query.	+1/-1
search_docs_hybrid.py `Fix confidence value in hybrid search` agents-api/agents_api/queries/docs/search_docs_hybrid.py Corrected the confidence value passed to the hybrid search query.	+1/-1

Enhancement

000022_vector_search.down.sql `Add migration to drop vector search function` memory-store/migrations/000022_vector_search.down.sql Added a migration to drop the `search_by_vector` function.	+5/-0
000022_vector_search.up.sql `Add SQL function for vector-based document search` memory-store/migrations/000022_vector_search.up.sql Added a new SQL function `search_by_vector`. Included input validation for parameters like `k` and `confidence`. Implemented logic for owner and metadata filtering. Defined the query execution for vector-based document search.	+91/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

Important

Fixes confidence value in vector search and adds new SQL function for vector-based document search with input validation.

Bug Fix:
- Corrected confidence value in search_docs_by_embedding.py and search_docs_hybrid.py.
Enhancement:
- Added search_by_vector SQL function in 000022_vector_search.up.sql with input validation and query logic for vector-based document search.
- Added migration to drop search_by_vector function in 000022_vector_search.down.sql.

^{This description was created by}^{for fc527fa. It will automatically update as commits are pushed.}

qodo-merge-pro-for-open-source · 2025-01-09T10:39:13Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns SQL Injection: The function uses dynamic SQL with format() to construct the query. While the current usage appears safe since the variables owner_filter_sql and metadata_filter_sql are internally constructed, extra care should be taken to ensure no user input can influence these SQL fragments.
⚡ Recommended focus areas for review Distance Calculation The distance calculation formula `((1 - (d.embedding <=> $1)) + 1) * 0.5` appears twice in the query. Consider extracting it to a CTE or variable to avoid potential inconsistencies and improve maintainability. ((1 - (d.embedding <=> $1)) + 1) * 0.5 as distance, d.embedding, d.metadata, doc_owners.owner_type, doc_owners.owner_id FROM docs_embeddings d LEFT JOIN doc_owners ON d.doc_id = doc_owners.doc_id WHERE d.developer_id = $7 AND ((1 - (d.embedding <=> $1)) + 1) * 0.5 >= $2 Error Handling The owner array length validation may have a logic error. The condition checks for array length <= 0 but this should likely be a separate check from the array length equality comparison. IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND array_length(owner_types, 1) != array_length(owner_ids, 1) AND array_length(owner_types, 1) <= 0 THEN RAISE EXCEPTION 'owner_types and owner_ids arrays must have the same length'; END IF; Query Performance The query fetches 4 times more candidates than needed (k * 4) before applying DISTINCT. This could impact performance with large result sets. Consider if this multiplication factor is optimal. LIMIT ($3 * 4) -- Get more candidates than needed ) SELECT DISTINCT ON (doc_id) * FROM ranked_docs ORDER BY doc_id, distance DESC LIMIT $3',

qodo-merge-pro-for-open-source · 2025-01-09T10:39:42Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Possible issue	Incorrect mathematical transformation of cosine similarity scores leads to inaccurate search results The distance calculation formula `((1 - (d.embedding <=> $1)) + 1) * 0.5` is incorrect and will produce skewed similarity scores. The cosine distance from the <=> operator is already normalized between 0 and 1, so the transformation is unnecessary and mathematically wrong. memory-store/migrations/000022_vector_search.up.sql [57] -((1 - (d.embedding <=> $1)) + 1) * 0.5 as distance, +(1 - (d.embedding <=> $1)) as distance, Apply this suggestion Suggestion importance[1-10]: 9 Why: The current transformation of cosine similarity is mathematically incorrect and would significantly distort search results by applying unnecessary scaling to already normalized scores.	9
Possible issue	✅ Array length validation logic contains a contradiction that could allow invalid input Suggestion Impact: The commit modified the array length validation logic, but took a different approach by removing the redundant array_length <= 0 check entirely code diff: IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND - array_length(owner_types, 1) != array_length(owner_ids, 1) AND - array_length(owner_types, 1) <= 0 THEN + array_length(owner_types, 1) != array_length(owner_ids, 1) THEN The owner filter validation has incorrect logic. The condition `array_length(owner_types, 1) <= 0` should be moved outside the AND clause as it contradicts the previous length comparison check. memory-store/migrations/000022_vector_search.up.sql [26-28] -IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND - array_length(owner_types, 1) != array_length(owner_ids, 1) AND - array_length(owner_types, 1) <= 0 THEN +IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND ( + array_length(owner_types, 1) != array_length(owner_ids, 1) OR + array_length(owner_types, 1) <= 0) THEN Apply this suggestion Suggestion importance[1-10]: 8 Why: The current validation logic is flawed as it combines contradictory conditions with AND, potentially allowing invalid array lengths to pass validation, which could lead to runtime errors.	8
General	Suboptimal sorting in DISTINCT ON clause may exclude better search matches The DISTINCT ON clause with ORDER BY doc_id could potentially eliminate better matches in favor of worse ones when multiple records exist for the same doc_id. The distance should be the primary sorting criterion. memory-store/migrations/000022_vector_search.up.sql [71-73] SELECT DISTINCT ON (doc_id) * FROM ranked_docs -ORDER BY doc_id, distance DESC +ORDER BY doc_id, distance DESC, owner_type Apply this suggestion Suggestion importance[1-10]: 5 Why: While the current sorting works, the suggested change could provide marginally better results by ensuring the highest similarity matches are retained when duplicates exist.	5

ellipsis-dev

❌ Changes requested. Reviewed everything up to db3e2f2 in 1 minute and 44 seconds

More details

Looked at 139 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 2 drafted comments based on config settings.

1. memory-store/migrations/000022_vector_search.up.sql:57

Draft comment:
Consider calculating the normalized distance once and reusing it to avoid repetition and potential maintenance issues.
Reason this comment was not posted:
Confidence changes required: 50%
The SQL function search_by_vector uses a formula to calculate distance which seems to be a normalization of the distance metric. This formula is repeated multiple times in the code, which can lead to maintenance issues if it needs to be changed. It would be better to calculate it once and reuse the result.

2. memory-store/migrations/000022_vector_search.up.sql:3

Draft comment:
This appears to be a duplicate of the existing search_by_vector function. The only differences are in score normalization, which doesn't affect the core functionality. Consider modifying the existing function if changes are needed.
function search_by_vector (000018_doc_search.up.sql)
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
This appears to be a migration file that's deliberately replacing an existing function with an updated version that changes how vector similarity scores are calculated and normalized. Migration files are meant to evolve the database schema and functions over time. The fact that it's using CREATE OR REPLACE suggests this is intentional, not accidental duplication.
The comment raises a valid point about similarity between functions, but may be missing the purpose of database migrations.
Since this is in a migration file using CREATE OR REPLACE, the similarity is intentional - it's meant to update the existing function with new score normalization logic.
This comment should be deleted as it misunderstands the purpose of migration files, which are meant to evolve database functions over time.

Workflow ID: wflow_CzbnGakj4aDrz7nk

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

memory-store/migrations/000022_vector_search.up.sql

ellipsis-dev

👍 Looks good to me! Incremental review on fc527fa in 34 seconds

More details

Looked at 14 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. memory-store/migrations/000022_vector_search.up.sql:27

Draft comment:
Consider adding a check to ensure owner_types and owner_ids are not empty arrays. This will prevent potential issues when these arrays are empty but not null.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The original code had a check for array length <= 0, which was removed. However, this check was logically incorrect since array_length returns NULL for empty arrays in PostgreSQL, not 0 or negative numbers. The remaining check for length mismatch between owner_types and owner_ids is sufficient since it will handle NULL cases. Empty arrays would be handled gracefully by the ANY operator in the query.
Maybe empty arrays could cause unexpected behavior in the owner filtering logic? The ANY operator's behavior with empty arrays should be verified.
In PostgreSQL, the ANY operator with an empty array returns false for all comparisons, which is actually the desired behavior here - it would effectively filter out all results, which makes sense if no owners are specified.
The comment should be deleted. The removed check was incorrect, and empty arrays are handled appropriately by PostgreSQL's ANY operator in the query.

Workflow ID: wflow_efGy5W5A6PTQa222

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

fix(memory-store, agents-api): Fix distance and confidence values

db3e2f2

qodo-merge-pro-for-open-source bot added the Review effort [1-5]: 4 label Jan 9, 2025

ellipsis-dev bot reviewed Jan 9, 2025

View reviewed changes

memory-store/migrations/000022_vector_search.up.sql Outdated Show resolved Hide resolved

Base automatically changed from x/openai-embed to dev January 9, 2025 11:01

whiterabbit1983 and others added 2 commits January 9, 2025 14:01

Merge branch 'dev' into x/search-by-vector

ca3cdf3

hotfix

fc527fa

ellipsis-dev bot reviewed Jan 9, 2025

View reviewed changes

whiterabbit1983 merged commit cf9ffab into dev Jan 9, 2025
9 of 10 checks passed

whiterabbit1983 deleted the x/search-by-vector branch January 9, 2025 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory-store, agents-api): Fix distance and confidence values for vector search #1035

fix(memory-store, agents-api): Fix distance and confidence values for vector search #1035

Ahmad-mtos commented Jan 9, 2025 •

edited by ellipsis-dev bot

Loading

qodo-merge-pro-for-open-source bot commented Jan 9, 2025

qodo-merge-pro-for-open-source bot commented Jan 9, 2025 •

edited

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

fix(memory-store, agents-api): Fix distance and confidence values for vector search #1035

fix(memory-store, agents-api): Fix distance and confidence values for vector search #1035

Conversation

Ahmad-mtos commented Jan 9, 2025 • edited by ellipsis-dev bot Loading

PR Type

Description

Changes walkthrough 📝

qodo-merge-pro-for-open-source bot commented Jan 9, 2025

PR Reviewer Guide 🔍

qodo-merge-pro-for-open-source bot commented Jan 9, 2025 • edited Loading

PR Code Suggestions ✨

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Ahmad-mtos commented Jan 9, 2025 •

edited by ellipsis-dev bot

Loading

qodo-merge-pro-for-open-source bot commented Jan 9, 2025 •

edited

Loading