Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(agents-api): Add Doc sql queries #979

Merged
merged 20 commits into from
Dec 23, 2024
Merged

Conversation

Vedantsahai18
Copy link
Member

@Vedantsahai18 Vedantsahai18 commented Dec 20, 2024

PR Type

Enhancement, Tests


Description

  • Implemented comprehensive document management system with:
    • Creation, retrieval, listing, and deletion operations
    • Advanced search capabilities including hybrid search, text search, and embedding-based search
    • Maximal Marginal Relevance (MMR) algorithm for diverse results
  • Added extensive test coverage for document operations
  • Optimized database schema and indexes across multiple tables
  • Added document embedding metadata support in models and API
  • Improved database functions with caching and hybrid search capabilities

Changes walkthrough 📝

Relevant files
Tests
1 files
test_docs_queries.py
Rewrite document query tests for PostgreSQL                           

agents-api/tests/test_docs_queries.py

  • Rewrote test cases for document operations using PostgreSQL
  • Added comprehensive tests for user and agent document operations
  • Added tests for document search functionality
  • Improved test coverage with more assertions and edge cases
  • +277/-163
    Enhancement
    19 files
    search_docs_hybrid.py
    Add hybrid document search functionality                                 

    agents-api/agents_api/queries/docs/search_docs_hybrid.py

  • Added hybrid search functionality combining text and embedding search
  • Implemented score normalization and fusion logic
  • Added configurable weighting between text and embedding scores
  • Added type hints and documentation
  • +158/-0 
    create_doc.py
    Implement document creation functionality                               

    agents-api/agents_api/queries/docs/create_doc.py

  • Implemented document creation with metadata and embeddings support
  • Added support for single and batch content creation
  • Added owner association logic
  • Added error handling and validation
  • +181/-0 
    list_docs.py
    Add document listing functionality                                             

    agents-api/agents_api/queries/docs/list_docs.py

  • Added document listing with filtering and pagination
  • Implemented metadata filtering
  • Added sorting options and owner filtering
  • Added error handling and input validation
  • +152/-0 
    mmr.py
    Add Maximal Marginal Relevance search                                       

    agents-api/agents_api/queries/docs/mmr.py

  • Implemented Maximal Marginal Relevance (MMR) algorithm
  • Added cosine similarity calculation
  • Added configurable diversity parameter
  • Added optimization for large result sets
  • +109/-0 
    search_docs_by_text.py
    Add text-based document search                                                     

    agents-api/agents_api/queries/docs/search_docs_by_text.py

  • Implemented text-based document search
  • Added metadata filtering support
  • Added language-specific search capabilities
  • Added error handling and validation
  • +91/-0   
    search_docs_by_embedding.py
    Add embedding-based document search                                           

    agents-api/agents_api/queries/docs/search_docs_by_embedding.py

  • Implemented vector-based document search
  • Added confidence threshold filtering
  • Added metadata filtering support
  • Added error handling and validation
  • +86/-0   
    get_doc.py
    Add document retrieval functionality                                         

    agents-api/agents_api/queries/docs/get_doc.py

  • Implemented document retrieval functionality
  • Added support for retrieving embeddings
  • Added content aggregation logic
  • Added error handling
  • +86/-0   
    delete_doc.py
    Add document deletion functionality                                           

    agents-api/agents_api/queries/docs/delete_doc.py

  • Implemented document deletion with ownership validation
  • Added cascading deletion of related records
  • Added error handling and validation
  • Added soft delete support
  • +79/-0   
    000018_doc_search.up.sql
    Add document search database functions                                     

    memory-store/migrations/000018_doc_search.up.sql

  • Added document search functions and triggers
  • Implemented caching for embeddings
  • Added hybrid search capabilities
  • Added text and vector search functions
  • +71/-87 
    000006_docs.up.sql
    Add document storage database schema                                         

    memory-store/migrations/000006_docs.up.sql

  • Created document tables and indexes
  • Added ownership and metadata support
  • Added full-text search capabilities
  • Added validation triggers
  • +30/-58 
    000008_tools.up.sql
    Optimize tools table schema and constraints                           

    memory-store/migrations/000008_tools.up.sql

  • Removed task_version column from tools table
  • Modified primary key to only include developer_id, agent_id, tool_id
  • Added unique constraint for agent_id, name, task_id
  • Removed redundant indexes
  • +5/-10   
    000009_sessions.up.sql
    Optimize sessions table indexes                                                   

    memory-store/migrations/000009_sessions.up.sql

  • Removed redundant session_id index
  • Modified session_lookup index to include participant_type
  • Removed redundant session_lookup_by_session index
  • +4/-10   
    000003_users.up.sql
    Remove redundant indexes from users table                               

    memory-store/migrations/000003_users.up.sql

  • Removed redundant user_id sorted index
  • Removed redundant developer_id index
  • Cleaned up formatting
  • +4/-9     
    openapi-1.0.0.yaml
    Add document embedding metadata fields                                     

    typespec/tsp-output/@typespec/openapi3/openapi-1.0.0.yaml

  • Added embeddings as required field with null default
  • Added new fields: modality, language, embedding_model,
    embedding_dimensions
  • +20/-0   
    000011_executions.up.sql
    Optimize executions table indexes                                               

    memory-store/migrations/000011_executions.up.sql

  • Removed redundant execution_id sorted index
  • Modified task_id index to include task_version
  • +2/-5     
    models.tsp
    Add document embedding metadata model fields                         

    typespec/docs/models.tsp

  • Modified embeddings field to be nullable with null default
  • Added new document metadata fields for embeddings
  • +18/-2   
    000006_docs.down.sql
    Improve docs table cleanup process                                             

    memory-store/migrations/000006_docs.down.sql

    • Added CASCADE to table drops
    • Removed redundant index drops
    +3/-4     
    000016_entry_relations.up.sql
    Optimize entry relations indexes                                                 

    memory-store/migrations/000016_entry_relations.up.sql

  • Removed redundant components index
  • Simplified leaf index structure
  • +1/-3     
    000015_entries.up.sql
    Simplify entries table index structure                                     

    memory-store/migrations/000015_entries.up.sql

    • Simplified entries index by removing entry_id
    +1/-1     
    Documentation
    1 files
    __init__.py
    Add document module initialization                                             

    agents-api/agents_api/queries/docs/init.py

  • Added module documentation
  • Defined public API exports
  • Added type hints and docstrings
  • +35/-0   
    Configuration changes
    1 files
    000007_ann.up.sql
    Update VoyageAI embedding configuration                                   

    memory-store/migrations/000007_ann.up.sql

    • Added 'document' parameter to embedding_voyageai function
    +2/-2     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link
    Contributor

    qodo-merge-pro-for-open-source bot commented Dec 20, 2024

    PR Reviewer Guide 🔍

    (Review updated until commit 934db8a)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 Security concerns

    SQL injection:
    The search_docs_by_embedding query manually constructs SQL strings with user-provided UUIDs using string concatenation rather than proper parameterization, which could potentially allow SQL injection attacks if the UUID validation is bypassed.

    ⚡ Recommended focus areas for review

    Performance

    The embedding search implementation manually constructs query strings with string concatenation for UUID arrays which could lead to performance issues with large arrays. Consider using proper parameterization.

        # NOTE: Manually replace uuids list coz asyncpg isnt sending it correctly
        owner_ids_pg_str = f"ARRAY['{'\', \''.join(owner_ids)}']"
        query = search_docs_by_embedding_query.replace("$UUID_LIST", owner_ids_pg_str)
    Error Handling

    The maximal marginal relevance implementation assumes valid input embeddings but lacks proper validation and error handling for edge cases like empty or malformed embeddings.

    def maximal_marginal_relevance(
        query_embedding: np.ndarray,
        embedding_list: list,
        lambda_mult: float = 0.5,
        k: int = 4,
    ) -> list[int]:
        """Calculate maximal marginal relevance.
    
        Args:
            query_embedding: The query embedding.
            embedding_list: A list of embeddings.
            lambda_mult: The lambda parameter for MMR. Default is 0.5.
            k: The number of embeddings to return. Default is 4.
    
        Returns:
            A list of indices of the embeddings to return.
    
        Raises:
            ImportError: If numpy is not installed.
        """
    
        if min(k, len(embedding_list)) <= 0:
            return []
        if query_embedding.ndim == 1:
            query_embedding = np.expand_dims(query_embedding, axis=0)
        similarity_to_query = _cosine_similarity(query_embedding, embedding_list)[0]
        most_similar = int(np.argmax(similarity_to_query))
        idxs = [most_similar]
        selected = np.array([embedding_list[most_similar]])
        while len(idxs) < min(k, len(embedding_list)):
            best_score = -np.inf
            idx_to_add = -1
            similarity_to_selected = _cosine_similarity(embedding_list, selected)
            for i, query_score in enumerate(similarity_to_query):
                if i in idxs:
                    continue
                redundant_score = max(similarity_to_selected[i])
                equation_score = (
                    lambda_mult * query_score - (1 - lambda_mult) * redundant_score
                )
                if equation_score > best_score:
                    best_score = equation_score
                    idx_to_add = i
            idxs.append(idx_to_add)
            selected = np.append(selected, [embedding_list[idx_to_add]], axis=0)
        return idxs
    Validation

    The document creation logic allows empty or invalid metadata without proper validation. Consider adding metadata schema validation to prevent invalid document states.

    async def create_doc(
        *,
        developer_id: UUID,
        doc_id: UUID | None = None,
        data: CreateDocRequest,
        owner_type: Literal["user", "agent"],
        owner_id: UUID,
        modality: Literal["text", "image", "mixed"] | None = "text",
        embedding_model: str | None = "voyage-3",
        embedding_dimensions: int | None = 1024,
        language: str | None = "english",
        index: int | None = 0,
    ) -> list[tuple[str, list, Literal["fetch", "fetchmany", "fetchrow"]]]:

    Copy link
    Contributor

    qodo-merge-pro-for-open-source bot commented Dec 20, 2024

    PR Code Suggestions ✨

    Latest suggestions up to 934db8a
    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Missing required function parameter could cause runtime errors

    The function generate_canonical_name() is called without any arguments, but based on
    the removed code it appears it should receive the task name as a parameter. Add the
    task name parameter to avoid potential errors.

    agents-api/agents_api/queries/tasks/create_or_update_task.py [170]

    -canonical_name = data.canonical_name or generate_canonical_name()
    +canonical_name = data.canonical_name or generate_canonical_name(data.name)
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: The suggestion fixes a critical bug where generate_canonical_name() is called without its required parameter, which would cause a runtime error. This is a direct regression from the old code.

    9
    Validate embedding vector dimensions to prevent SQL errors

    Add validation for query_embedding length to ensure it matches embedding_dimensions.

    agents-api/agents_api/queries/docs/search_docs_by_embedding.py [62-63]

     if not query_embedding:
         raise HTTPException(status_code=400, detail="Empty embedding provided")
    +if len(query_embedding) != 1024:  # Based on vector(1024) in SQL
    +    raise HTTPException(
    +        status_code=400,
    +        detail=f"Embedding must have 1024 dimensions, got {len(query_embedding)}"
    +    )
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: The suggestion adds crucial validation to ensure embedding dimensions match the database schema requirements (1024d), preventing SQL errors that could crash the application.

    9
    Improve array validation logic to handle empty arrays and prevent potential SQL errors

    Add error handling for empty arrays in the owner filter validation to prevent SQL
    errors.

    memory-store/migrations/000018_doc_search.up.sql [110-112]

    -IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND
    -    array_length(owner_types, 1) != array_length(owner_ids, 1) AND
    -    array_length(owner_types, 1) <= 0 THEN
    +IF owner_types IS NOT NULL AND owner_ids IS NOT NULL AND (
    +    array_length(owner_types, 1) != array_length(owner_ids, 1) OR
    +    array_length(owner_types, 1) <= 0 OR
    +    array_length(owner_ids, 1) <= 0) THEN
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion improves input validation by also checking for empty owner_ids array and using OR instead of AND for the length checks, which prevents potential SQL errors from invalid array inputs.

    8
    General
    Add validation to prevent empty or excessively long search queries

    Add input validation for the query parameter to ensure it's not empty and has a
    reasonable length limit.

    agents-api/agents_api/queries/docs/search_docs_by_text.py [46-54]

     async def search_docs_by_text(
         *,
         developer_id: UUID,
         owners: list[tuple[Literal["user", "agent"], UUID]],
         query: str,
         k: int = 3,
         metadata_filter: dict[str, Any] = {},
         search_language: str | None = "english",
     ) -> tuple[str, list]:
    +    if not query or len(query.strip()) == 0:
    +        raise HTTPException(status_code=400, detail="Search query cannot be empty")
    +    if len(query) > 1000:
    +        raise HTTPException(status_code=400, detail="Search query too long")
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion adds important input validation to prevent potential issues with empty or overly long search queries that could impact performance or cause errors. This is a critical validation for a search endpoint.

    8
    Add type validation for list content elements to prevent data inconsistency

    Add validation to ensure that when content is provided as a list, all elements are
    of the same type (string) to maintain data consistency and prevent potential
    database errors.

    agents-api/agents_api/queries/docs/create_doc.py [120-125]

     if isinstance(data.content, list):
    +    if not all(isinstance(c, str) for c in data.content):
    +        raise HTTPException(status_code=400, detail="All content elements must be strings")
         final_params_doc = []
         final_params_owner = []
     
         for idx, content in enumerate(data.content):
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion adds important input validation to prevent potential database errors and maintain data consistency, which is crucial for the document creation functionality.

    7
    Add parameter validation to ensure interpolation weight stays within valid range

    Add input validation for lambda_mult parameter to ensure it stays within valid range
    [0,1] since it's used as an interpolation weight in the MMR calculation.

    agents-api/agents_api/queries/docs/mmr.py [64-69]

     def maximal_marginal_relevance(
         query_embedding: np.ndarray,
         embedding_list: list,
         lambda_mult: float = 0.5,
         k: int = 4,
     ) -> list[int]:
    +    if not 0 <= lambda_mult <= 1:
    +        raise ValueError("lambda_mult must be between 0 and 1")
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: The suggestion adds important validation for the lambda_mult parameter to prevent invalid calculations in the MMR algorithm, improving robustness and preventing potential errors.

    6
    Replace hardcoded dummy content with proper null value until implementation

    Add warning or placeholder for the hardcoded dummy content in file retrieval.

    agents-api/agents_api/queries/files/get_file.py [58]

    -"content": "DUMMY: NEED TO FETCH CONTENT FROM BLOB STORAGE",
    +"content": None,  # TODO: Implement blob storage content retrieval
    • Apply this suggestion
    Suggestion importance[1-10]: 3

    Why: While the suggestion improves code clarity by using None instead of a dummy string, it's a minor improvement that doesn't affect functionality and only changes a placeholder value.

    3

    Previous suggestions

    Suggestions up to commit 6c77490
    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add validation for vector embedding dimensions to prevent invalid configurations

    Add input validation for embedding_dimensions to ensure it's a positive integer when
    embedding_model is specified, preventing invalid vector dimensions.

    agents-api/agents_api/queries/docs/create_doc.py [122-123]

     data.embedding_model or "none",
    -data.embedding_dimensions or 0,
    +data.embedding_dimensions if data.embedding_model != "none" and data.embedding_dimensions > 0 else 0,
    Suggestion importance[1-10]: 8

    Why: The suggestion adds crucial validation to prevent invalid vector dimensions when an embedding model is specified, which could cause serious issues in downstream vector operations.

    8
    Validate diversity-relevance tradeoff parameter to prevent invalid search results

    Add input validation for lambda_mult parameter to ensure it's between 0 and 1, as
    values outside this range would produce invalid diversity-relevance tradeoffs.

    agents-api/agents_api/queries/docs/mmr.py [64-69]

     def maximal_marginal_relevance(
         query_embedding: np.ndarray,
         embedding_list: list,
         lambda_mult: float = 0.5,
         k: int = 4,
     ) -> list[int]:
    +    if not 0 <= lambda_mult <= 1:
    +        raise ValueError("lambda_mult must be between 0 and 1")
    Suggestion importance[1-10]: 7

    Why: Adding validation for lambda_mult is important as values outside [0,1] would produce invalid diversity-relevance tradeoffs, potentially breaking the MMR algorithm's functionality.

    7
    Validate hybrid search weight parameter to prevent invalid result ranking

    Add validation for alpha parameter to ensure it's between 0 and 1, as this weight
    controls the balance between text and embedding search results.

    agents-api/agents_api/queries/docs/search_docs_hybrid.py [107-114]

     async def search_docs_hybrid(
         developer_id: UUID,
         text_query: str = "",
         embedding: List[float] = None,
         k: int = 10,
         alpha: float = 0.5,
    +) -> List[Doc]:
    +    if not 0 <= alpha <= 1:
    +        raise ValueError("alpha must be between 0 and 1")
    Suggestion importance[1-10]: 7

    Why: The validation ensures alpha stays within [0,1], preventing incorrect weighting between text and embedding search results that could lead to invalid rankings.

    7

    Copy link
    Contributor

    qodo-merge-pro-for-open-source bot commented Dec 20, 2024

    CI Failure Feedback 🧐

    (Checks updated until commit 934db8a)

    Action: Typecheck

    Failed stage: Typecheck [❌]

    Failed test name: pytype

    Failure summary:

    The pytype check failed due to multiple issues:
    1. Missing dependency: Cannot find module pycozo
    which is required in agents_api/common/utils/cozo.py
    2. Type error in
    agents_api/queries/docs/search_docs_hybrid.py:
    - Trying to access model_copy attribute on a None
    value
    - Type annotation mismatch for embedding variable
    3. Stray type comments in
    tests/test_workflow_routes.py

    Relevant error logs:
    1:  ##[group]Operating System
    2:  Ubuntu
    ...
    
    1186:  [9/374] check agents_api.common.utils.datetime
    1187:  [10/374] check agents_api.autogen.Executions
    1188:  [11/374] check agents_api.autogen.Entries
    1189:  [12/374] check agents_api.clients.__init__
    1190:  [13/374] check agents_api.autogen.Agents
    1191:  [14/374] check agents_api.env
    1192:  [15/374] check agents_api.autogen.Tasks
    1193:  [16/374] check agents_api.common.utils.cozo
    1194:  FAILED: /home/runner/work/julep/julep/agents-api/.pytype/pyi/agents_api/common/utils/cozo.pyi 
    1195:  /home/runner/work/julep/julep/agents-api/.venv/bin/python -m pytype.main --disable pyi-error --imports_info /home/runner/work/julep/julep/agents-api/.pytype/imports/agents_api.common.utils.cozo.imports --module-name agents_api.common.utils.cozo --platform linux -V 3.12 -o /home/runner/work/julep/julep/agents-api/.pytype/pyi/agents_api/common/utils/cozo.pyi --analyze-annotated --nofail --none-is-not-bool --quick --strict-none-binding /home/runner/work/julep/julep/agents-api/agents_api/common/utils/cozo.py
    1196:  /home/runner/work/julep/julep/agents-api/agents_api/common/utils/cozo.py:9:1: error: in <module>: Can't find module 'pycozo'. [import-error]
    1197:  from pycozo import Client
    1198:  ~~~~~~~~~~~~~~~~~~~~~~~~~
    1199:  For more details, see https://google.github.io/pytype/errors.html#import-error
    ...
    
    1209:  [26/374] check agents_api.common.utils.types
    1210:  [27/374] check agents_api.common.protocol.tasks
    1211:  [28/374] check agents_api.common.protocol.sessions
    1212:  [29/374] check agents_api.common.utils.messages
    1213:  [30/374] check agents_api.common.nlp
    1214:  [31/374] check agents_api.clients.async_s3
    1215:  [32/374] check agents_api.common.protocol.developers
    1216:  [33/374] check agents_api.queries.utils
    1217:  ERROR:pytype.matcher Invalid type: <class 'pytype.abstract.function.ParamSpecMatch'>
    ...
    
    1314:  [130/374] check agents_api.rec_sum.trim
    1315:  [131/374] check tests.sample_tasks.test_find_selector
    1316:  [132/374] check tests.test_chat_routes
    1317:  [133/374] check agents_api.common.__init__
    1318:  [134/374] check agents_api.routers.internal.__init__
    1319:  [135/374] check agents_api.rec_sum.entities
    1320:  [136/374] check agents_api.queries.developers.__init__
    1321:  [137/374] check agents_api.queries.docs.search_docs_hybrid
    1322:  FAILED: /home/runner/work/julep/julep/agents-api/.pytype/pyi/agents_api/queries/docs/search_docs_hybrid.pyi 
    1323:  /home/runner/work/julep/julep/agents-api/.venv/bin/python -m pytype.main --disable pyi-error --imports_info /home/runner/work/julep/julep/agents-api/.pytype/imports/agents_api.queries.docs.search_docs_hybrid.imports --module-name agents_api.queries.docs.search_docs_hybrid --platform linux -V 3.12 -o /home/runner/work/julep/julep/agents-api/.pytype/pyi/agents_api/queries/docs/search_docs_hybrid.pyi --analyze-annotated --nofail --none-is-not-bool --quick --strict-none-binding /home/runner/work/julep/julep/agents-api/agents_api/queries/docs/search_docs_hybrid.py
    1324:  /home/runner/work/julep/julep/agents-api/agents_api/queries/docs/search_docs_hybrid.py:98:15: error: in fuse_results: No attribute 'model_copy' on None [attribute-error]
    1325:  In Optional[agents_api.autogen.Docs.Doc]
    1326:  doc = doc.model_copy()  # or copy if you are using Pydantic
    1327:  ~~~~~~~~~~~~~~
    1328:  /home/runner/work/julep/julep/agents-api/agents_api/queries/docs/search_docs_hybrid.py:105:1: error: in <module>: Type annotation for embedding does not match type of assignment [annotation-type-mismatch]
    ...
    
    1423:  # fuse them
    1424:  ~~~~~~~~~~~~~~~
    1425:  fused = fuse_results(text_results, embed_results, alpha)
    1426:  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1427:  # Then pick top K overall
    1428:  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1429:  return fused[:k]
    1430:  ~~~~~~~~~~~~~~~~~~~~
    1431:  For more details, see https://google.github.io/pytype/errors.html
    ...
    
    1439:  [145/374] check agents_api.workflows.__init__
    1440:  [146/374] check agents_api.metrics.__init__
    1441:  [147/374] check agents_api.dependencies.__init__
    1442:  [148/374] check agents_api.common.exceptions.sessions
    1443:  [149/374] check agents_api.common.utils.debug
    1444:  [150/374] check agents_api.queries.docs.mmr
    1445:  [151/374] check tests.test_user_routes
    1446:  [152/374] check tests.test_workflow_routes
    1447:  /home/runner/work/julep/julep/agents-api/tests/test_workflow_routes.py:65:1: error: : Stray type comment: object [ignored-type-comment]
    1448:  #   type: object~~~~~~~~~~~~~~~
    1449:  #   type: object
    1450:  /home/runner/work/julep/julep/agents-api/tests/test_workflow_routes.py:114:1: error: : Stray type comment: object [ignored-type-comment]
    1451:  #   type: object~~~~~~~~~~~~~~~
    1452:  #   type: object
    1453:  For more details, see https://google.github.io/pytype/errors.html#ignored-type-comment
    ...
    
    1466:  [165/374] check agents_api.queries.entries.__init__
    1467:  [166/374] check tests.test_agent_routes
    1468:  [167/374] check agents_api.routers.sessions.exceptions
    1469:  [168/374] check agents_api.rec_sum.summarize
    1470:  [169/374] check tests.test_activities
    1471:  [170/374] check tests.test_sessions
    1472:  [171/374] check agents_api.queries.agents.__init__
    1473:  [172/374] check agents_api.queries.tasks.__init__
    1474:  ninja: build stopped: cannot make progress due to previous errors.
    1475:  Computing dependencies
    1476:  Generated API key since not set in the environment: 51505566278392106819051827099590
    1477:  Sentry DSN not found. Sentry will not be enabled.
    1478:  Analyzing 349 sources with 0 local dependencies
    1479:  Leaving directory '.pytype'
    1480:  ##[error]Process completed with exit code 1.
    

    ✨ CI feedback usage guide:

    The CI feedback tool (/checks) automatically triggers when a PR has a failed check.
    The tool analyzes the failed checks and provides several feedbacks:

    • Failed stage
    • Failed test name
    • Failure summary
    • Relevant error logs

    In addition to being automatically triggered, the tool can also be invoked manually by commenting on a PR:

    /checks "https://github.com/{repo_name}/actions/runs/{run_number}/job/{job_number}"
    

    where {repo_name} is the name of the repository, {run_number} is the run number of the failed check, and {job_number} is the job number of the failed check.

    Configuration options

    • enable_auto_checks_feedback - if set to true, the tool will automatically provide feedback when a check is failed. Default is true.
    • excluded_checks_list - a list of checks to exclude from the feedback, for example: ["check1", "check2"]. Default is an empty list.
    • enable_help_text - if set to true, the tool will provide a help message with the feedback. Default is true.
    • persistent_comment - if set to true, the tool will overwrite a previous checks comment with the new feedback. Default is true.
    • final_update_message - if persistent_comment is true and updating a previous checks message, the tool will also create a new message: "Persistent checks updated to latest commit". Default is true.

    See more information about the checks tool in the docs.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ❌ Changes requested. Reviewed everything up to 6c77490 in 1 minute and 30 seconds

    More details
    • Looked at 848 lines of code in 10 files
    • Skipped 0 files when reviewing.
    • Skipped posting 8 drafted comments based on config settings.
    1. agents-api/agents_api/queries/docs/search_docs_hybrid.py:13
    • Draft comment:
      The import statement for run_concurrently is unnecessary since it's not used in the code. Consider removing it to clean up the imports.
    • Reason this comment was not posted:
      Confidence changes required: 10%
      The import statement for run_concurrently is unnecessary since it's not used in the code.
    2. agents-api/agents_api/queries/docs/search_docs_hybrid.py:134
    • Draft comment:
      Appending an empty list to tasks is incorrect. You should append a coroutine or task. Consider using None or a similar placeholder if you intend to skip adding a task.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The code uses tasks.append([]) which is incorrect for appending tasks. It should append a coroutine instead.
    3. agents-api/agents_api/queries/docs/search_docs_hybrid.py:147
    • Draft comment:
      Appending an empty list to tasks is incorrect. You should append a coroutine or task. Consider using None or a similar placeholder if you intend to skip adding a task.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The code uses tasks.append([]) which is incorrect for appending tasks. It should append a coroutine instead.
    4. agents-api/agents_api/queries/docs/search_docs_hybrid.py:154
    • Draft comment:
      Ensure that at least one valid task is added to tasks before calling gather. If both text_query and embedding are empty, tasks will contain only empty lists, leading to an error.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The code does not handle the case where both text_query and embedding are empty, which could lead to an error when calling gather.
    5. agents-api/agents_api/queries/docs/search_docs_hybrid.py:110
    • Draft comment:
      The embedding parameter should have a default value of an empty list [] instead of None to avoid type issues and simplify checks.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The embedding parameter should have a default value of an empty list instead of None to avoid type issues.
    6. agents-api/agents_api/queries/docs/create_doc.py:69
    • Draft comment:
      Consider adding more exception handling for other potential database errors, such as asyncpg.DataError or asyncpg.SyntaxOrAccessError, to make the function more robust.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The rewrap_exceptions decorator is used to handle exceptions, but it doesn't cover all possible exceptions that might occur during database operations.
    7. agents-api/agents_api/queries/docs/delete_doc.py:43
    • Draft comment:
      Consider adding more exception handling for other potential database errors, such as asyncpg.DataError or asyncpg.SyntaxOrAccessError, to make the function more robust.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The rewrap_exceptions decorator is used to handle exceptions, but it doesn't cover all possible exceptions that might occur during database operations.
    8. agents-api/agents_api/queries/docs/get_doc.py:29
    • Draft comment:
      Consider adding exception handling for potential database errors, such as asyncpg.DataError or asyncpg.SyntaxOrAccessError, to make the function more robust.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The rewrap_exceptions decorator is used to handle exceptions, but it doesn't cover all possible exceptions that might occur during database operations.

    Workflow ID: wflow_0SgPnESfFL0Scwfr


    Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    agents-api/agents_api/queries/docs/mmr.py Show resolved Hide resolved
    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on 93673b7 in 1 minute and 3 seconds

    More details
    • Looked at 1195 lines of code in 21 files
    • Skipped 0 files when reviewing.
    • Skipped posting 5 drafted comments based on config settings.
    1. agents-api/agents_api/queries/docs/get_doc.py:5
    • Draft comment:
      The import statement for 'asyncpg' is unnecessary and can be removed to clean up the code. This is also applicable in other files like 'list_docs.py', 'search_docs_by_embedding.py', and 'search_docs_by_text.py'.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The import statement for 'asyncpg' is unnecessary in multiple files as it is not used anywhere in the code. Removing it will clean up the code.
    2. agents-api/agents_api/queries/docs/list_docs.py:5
    • Draft comment:
      The import statement for 'asyncpg' is unnecessary and can be removed to clean up the code. This is also applicable in other files like 'search_docs_by_embedding.py', and 'search_docs_by_text.py'.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The import statement for 'asyncpg' is unnecessary in multiple files as it is not used anywhere in the code. Removing it will clean up the code.
    3. agents-api/agents_api/queries/docs/search_docs_by_embedding.py:6
    • Draft comment:
      The import statement for 'asyncpg' is unnecessary and can be removed to clean up the code. This is also applicable in 'search_docs_by_text.py'.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The import statement for 'asyncpg' is unnecessary in multiple files as it is not used anywhere in the code. Removing it will clean up the code.
    4. agents-api/agents_api/queries/docs/search_docs_by_text.py:6
    • Draft comment:
      The import statement for 'asyncpg' is unnecessary and can be removed to clean up the code.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The import statement for 'asyncpg' is unnecessary in multiple files as it is not used anywhere in the code. Removing it will clean up the code.
    5. agents-api/agents_api/queries/docs/create_doc.py:98
    • Draft comment:
      The 'org' option is removed from the owner_type Literal, but this change is not reflected in the PR description. This should be documented for clarity. This change is also applicable in 'delete_doc.py', 'get_doc.py', and 'list_docs.py'.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The 'org' option is removed from the owner_type Literal in multiple files, but this change is not reflected in the PR description. This should be documented for clarity.

    Workflow ID: wflow_pnbwiXEmTCJVnbrw


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ❌ Changes requested. Incremental review on dc0ec36 in 48 seconds

    More details
    • Looked at 939 lines of code in 28 files
    • Skipped 0 files when reviewing.
    • Skipped posting 1 drafted comments based on config settings.
    1. agents-api/agents_api/queries/users/delete_user.py:59
    • Draft comment:
      The asyncpg.exceptions.UniqueViolationError should not be handled here as it is not relevant to delete operations. Consider removing this exception handling.
    • Reason this comment was not posted:
      Marked as duplicate.

    Workflow ID: wflow_WUrKdAU41fxw9Cd0


    Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on 831e950 in 51 seconds

    More details
    • Looked at 440 lines of code in 11 files
    • Skipped 0 files when reviewing.
    • Skipped posting 3 drafted comments based on config settings.
    1. agents-api/agents_api/queries/docs/get_doc.py:36
    • Draft comment:
      Using ast.literal_eval on d["content"] can be unsafe if the content is not guaranteed to be a valid Python literal. Consider using a safer method to parse or handle the content.
    • Reason this comment was not posted:
      Comment was on unchanged code.
    2. agents-api/agents_api/queries/docs/embed_snippets.py:10
    • Draft comment:
      vectorizer_query is set to None. This is a placeholder and should be replaced with an actual query before deployment.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable:
      The comment is essentially repeating information that's already explicitly stated in a TODO comment one line above. The TODO comment is more visible and serves the same purpose. This makes the PR comment redundant and not adding any new information or value.
      Perhaps the comment is trying to emphasize the importance of not deploying with a None query, which could be a critical issue.
      While deployment concerns are valid, the existing TODO comment already makes it clear this needs to be replaced, and deployment issues would be caught by basic testing since the function would fail immediately.
      Delete the comment as it's redundant with the existing TODO comment and doesn't provide additional actionable value.
    3. agents-api/agents_api/queries/entries/list_entries.py:88
    • Draft comment:
      Ensure sort_by and direction are validated and sanitized to prevent SQL injection, as they are used in string interpolation for SQL queries.
    • Reason this comment was not posted:
      Comment did not seem useful.

    Workflow ID: wflow_TPwdOI1YwvBHnQ9P


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on 249513d in 1 minute and 4 seconds

    More details
    • Looked at 1148 lines of code in 15 files
    • Skipped 0 files when reviewing.
    • Skipped posting 5 drafted comments based on config settings.
    1. agents-api/agents_api/queries/docs/create_doc.py:153
    • Draft comment:
      The function assumes data.content is always a list. Consider adding a check or conversion to handle cases where data.content might be a string or other type to prevent unexpected behavior.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable:
      The comment suggests adding a check that already exists in the code. The code already handles both list and non-list cases appropriately through the isinstance() check and separate logic paths. The comment appears to be incorrect in stating that the function "assumes data.content is always a list" when it clearly doesn't make that assumption.
      Could there be other types besides list and non-list that need handling? Could the CreateDocRequest type definition enforce the content type making this check unnecessary?
      The code's else branch handles any non-list type appropriately by treating it as a single content item. The type validation would be handled by FastAPI's request validation via CreateDocRequest if needed.
      The comment should be deleted because it incorrectly suggests adding a check that already exists in the code. The code already properly handles both list and non-list content.
    2. agents-api/agents_api/queries/docs/delete_doc.py:25
    • Draft comment:
      The EXISTS clause in the SQL query might be redundant since the doc_owners entry is already deleted in the deleted_owners CTE. Consider revising the logic to ensure the docs entry is only deleted if the doc_owners entry existed prior to deletion.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable:
      The comment raises an interesting point about the query logic - we delete from doc_owners first, then check if that same record existed in doc_owners before deleting from docs. However, this could actually be intentional behavior to ensure atomicity and proper ordering. Without deeper knowledge of the data model and requirements, we can't be certain this is actually a problem vs a deliberate safeguard.
      I may be missing important context about transaction isolation levels and race conditions that could make this pattern necessary. The EXISTS check might serve as an important guard rail.
      While the comment raises an interesting point about query structure, we don't have enough context to confidently say this is incorrect or needs to be changed.
      This comment is speculative and requires more context about the data model and requirements to validate. Following our rules, we should err on the side of removing speculative comments.
    3. agents-api/agents_api/queries/docs/list_docs.py:135
    • Draft comment:
      The metadata_filter is directly appended to the query string, which could lead to SQL injection if not properly handled. Ensure that metadata keys and values are safely included in the query to prevent SQL injection vulnerabilities.
    • Reason this comment was not posted:
      Comment did not seem useful.
    4. agents-api/agents_api/queries/docs/search_docs_by_text.py:19
    • Draft comment:
      The owner_types and owner_ids are passed as JSONB arrays, which might not be correctly handled by the SQL function. Ensure that these arrays are properly converted to UUID arrays to prevent unexpected behavior.
    • Reason this comment was not posted:
      Comment did not seem useful.
    5. agents-api/tests/test_docs_queries.py:11
    • Draft comment:
      Consider adding tests for search_docs_by_embedding and search_docs_hybrid to ensure comprehensive coverage of the search functionalities.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The test_docs_queries.py file has a test for search_docs_by_text but it lacks tests for search_docs_by_embedding and search_docs_hybrid. Adding these tests would ensure comprehensive coverage of the search functionalities.

    Workflow ID: wflow_qeLRxNLHNTGhklta


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    @creatorrr creatorrr marked this pull request as draft December 21, 2024 08:24
    @creatorrr creatorrr changed the title wip(agents-api): Add Doc sql queries feat(agents-api): Add Doc sql queries Dec 23, 2024
    @creatorrr creatorrr marked this pull request as ready for review December 23, 2024 05:23
    @creatorrr creatorrr merged commit 791a8a9 into f/switch-to-pg Dec 23, 2024
    1 check passed
    @creatorrr creatorrr deleted the f/doc-queries branch December 23, 2024 05:23
    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Reviewed everything up to 934db8a in 1 minute and 18 seconds

    More details
    • Looked at 4562 lines of code in 79 files
    • Skipped 0 files when reviewing.
    • Skipped posting 6 drafted comments based on config settings.
    1. agents-api/agents_api/queries/utils.py:32
    • Draft comment:
      The generate_canonical_name function's docstring is outdated. It mentions converting a display name to a canonical name, but the function no longer takes a name parameter. Update the docstring to reflect the current functionality.
    • Reason this comment was not posted:
      Comment looked like it was already resolved.
    2. agents-api/agents_api/queries/docs/create_doc.py:181
    • Draft comment:
      The create_doc function's docstring should mention the default value for the modality parameter, which is 'text'. This will help developers understand the default behavior of the function.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The create_doc function in create_doc.py has a potential issue with the modality parameter. It defaults to 'text', but the docstring does not mention this default value. This could lead to confusion for developers using this function.
    3. agents-api/agents_api/queries/docs/delete_doc.py:79
    • Draft comment:
      The delete_doc function's docstring should explicitly mention that the owner_type parameter is required. This will help developers understand the necessary inputs for the function.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The delete_doc function in delete_doc.py has a potential issue with the owner_type parameter. It is a required parameter, but the docstring does not explicitly mention this requirement. This could lead to confusion for developers using this function.
    4. agents-api/agents_api/queries/docs/list_docs.py:152
    • Draft comment:
      The list_docs function's docstring should mention the default value for the include_without_embeddings parameter, which is False. This will help developers understand the default behavior of the function.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The list_docs function in list_docs.py has a potential issue with the include_without_embeddings parameter. It defaults to False, but the docstring does not mention this default value. This could lead to confusion for developers using this function.
    5. agents-api/agents_api/queries/docs/search_docs_by_embedding.py:86
    • Draft comment:
      The search_docs_by_embedding function's docstring should mention the default value for the confidence parameter, which is 0.5. This will help developers understand the default behavior of the function.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The search_docs_by_embedding function in search_docs_by_embedding.py has a potential issue with the confidence parameter. It defaults to 0.5, but the docstring does not mention this default value. This could lead to confusion for developers using this function.
    6. agents-api/agents_api/queries/docs/search_docs_by_text.py:91
    • Draft comment:
      The search_docs_by_text function's docstring should mention the default value for the search_language parameter, which is 'english'. This will help developers understand the default behavior of the function.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The search_docs_by_text function in search_docs_by_text.py has a potential issue with the search_language parameter. It defaults to 'english', but the docstring does not mention this default value. This could lead to confusion for developers using this function.

    Workflow ID: wflow_8XNb0cnAXVtyPRMM


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    Persistent review updated to latest commit 934db8a

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants