Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add read signed url of the vector chunks #374

Merged
merged 3 commits into from
Oct 16, 2024

Conversation

mnvsk97
Copy link
Contributor

@mnvsk97 mnvsk97 commented Oct 12, 2024

  • Add a function to fetch artifacts of the data directory by data dir fqn.
  • Add function to get read signed URL of a data dir and local file path
  • Add _signed_url in the Q/A API response to enable UI to reference and provide links to the source files.
  • Handle metadata enrichment for stream and nonstream response types.
  • Refactor MultiModalParser
    • Move default multi-modal parser prompt to constants
    • Move LLM instance creation to __init__
    • Remove unnecessary passing of instance objects like prompt, and llm between functions and use instance vars instead.
  • Upgrade langchain-openai from 0.1.20 to 0.1.25 to fix issues introduced in feat: upgrade langchain libraries #371
  • Remove unused libraries: openai and orjson
  • Closes Parser Registry Problem #363

@mnvsk97 mnvsk97 changed the title feat: add read signed url of the chunk's pile path along with vector … feat: add read signed url of the vector chunks Oct 12, 2024
@mnvsk97 mnvsk97 force-pushed the feat/add-file-urls-to-chunks branch 4 times, most recently from 0fefdae to 53979fc Compare October 12, 2024 14:13
@mnvsk97 mnvsk97 requested a review from S1LV3RJ1NX October 13, 2024 10:21
@mnvsk97 mnvsk97 enabled auto-merge October 13, 2024 15:13
@mnvsk97 mnvsk97 disabled auto-merge October 13, 2024 15:13
Copy link
Contributor

@S1LV3RJ1NX S1LV3RJ1NX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need clarity on two things:

  • Need to update chunk metadata with document metadata, if any for all the parsers
  • How does caching work for parsers with same extension but different config, as the caching key is the file extension.

backend/modules/parsers/multi_modal_parser.py Show resolved Hide resolved
backend/modules/parsers/multi_modal_parser.py Show resolved Hide resolved
backend/modules/parsers/parser.py Outdated Show resolved Hide resolved
@mnvsk97
Copy link
Contributor Author

mnvsk97 commented Oct 13, 2024

Need clarity on two things:

  • Need to update chunk metadata with document metadata, if any for all the parsers

  • How does caching work for parsers with same extension but different config, as the caching key is the file extension.

  1. Not sure about this, need an example.
  2. I missed the part around config. I'll add an md5 hash of the config and extension.

@S1LV3RJ1NX
Copy link
Contributor

S1LV3RJ1NX commented Oct 14, 2024

Not sure about this, need an example.

@mnvsk97 - if you would check get_chunks function of any parser, it takes file_path and metadata (lets call this as doc_metadata for sake of explanation) as input argument. So when an individual document chunk generates it's own metadata, we should also add the doc_metadata to the chunk metadata.

Another thing that is missing is,
QueryController's required_metadata list should be updated to send pre-signed urls in the response.
https://github.com/mnvsk97/cognita/blob/16c20fb006e9065cc435439d77a44bd58e3981a6/backend/modules/query_controllers/base.py#L24

@mnvsk97
Copy link
Contributor Author

mnvsk97 commented Oct 15, 2024

Not sure about this, need an example.

@mnvsk97 - if you would check get_chunks function of any parser, it takes file_path and metadata (lets call this as doc_metadata for sake of explanation) as input argument. So when an individual document chunk generates it's own metadata, we should also add the doc_metadata to the chunk metadata.

Another thing that is missing is, QueryController's required_metadata list should be updated to send pre-signed urls in the response. https://github.com/mnvsk97/cognita/blob/16c20fb006e9065cc435439d77a44bd58e3981a6/backend/modules/query_controllers/base.py#L24

Resolved all the above mentioned points in the latest commits

@mnvsk97 mnvsk97 force-pushed the feat/add-file-urls-to-chunks branch 2 times, most recently from 9a963b0 to f3d8b5d Compare October 16, 2024 03:21
@mnvsk97 mnvsk97 enabled auto-merge October 16, 2024 07:14
@mnvsk97 mnvsk97 force-pushed the feat/add-file-urls-to-chunks branch from f3d8b5d to c3e20c1 Compare October 16, 2024 08:22
@mnvsk97 mnvsk97 merged commit 7595c05 into truefoundry:main Oct 16, 2024
1 check passed
S1LV3RJ1NX pushed a commit that referenced this pull request Nov 29, 2024
feat: add read signed url of the vector chunks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parser Registry Problem
3 participants