Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when ingesting a file and adding it to a collection right after #1435

Open
jeremi opened this issue Oct 20, 2024 · 2 comments
Open

Issue when ingesting a file and adding it to a collection right after #1435

jeremi opened this issue Oct 20, 2024 · 2 comments

Comments

@jeremi
Copy link

jeremi commented Oct 20, 2024

Describe the bug

My code is like this:

                ingest_response = self.r2r.ingest_files(
                    file_paths=[document.file_path],
                    metadatas=[metadata]
                )
                document_id = ingest_response['results'][0]['document_id']
                try:
                    self.r2r.assign_document_to_collection(document_id, self.r2r.collection_id)
                except Exception as e:
                    logger.error(f"Error assigning document to collection: {str(e)}")

Oftentimes, I'll get an error when calling assign_document_to_collection because it cannot find the document in the database.

Looking at the code, the row in document_info is not created before the result of assign_document_to_collection is returned. It is done in the ingest-files workflow:

raw_message: dict[str, Union[str, None]] = await self.orchestration_provider.run_workflow( # type: ignore
"ingest-files",
{"request": workflow_input},
options={
"additional_metadata": {
"document_id": str(document_id),
}
},
)
raw_message["document_id"] = str(document_id)
messages.append(raw_message)

So when calling the assign_document_to_collection, the document_info record does not exist yet:

document_check_query = f"""
SELECT 1 FROM {self._get_table_name('document_info')}
WHERE document_id = $1
"""
document_exists = await self.fetchrow_query(
document_check_query, [document_id]
)
if not document_exists:
raise R2RException(
status_code=404, message="Document not found"
)

How to solve this? Would it be possible to pass also the collection_id when calling the ingest_files? I noticed this workflow already added the file to the default collection.

@emrgnt-cmplxty
Copy link
Contributor

We are planning on extending the ingest_files endpoint to support exactly the behavior you outline above. Several other developers have requested this exact same functionality.

As for your other question around document info creation, there is a specific reason behind this implementation. In order to properly assign a document to a collection we must update the collection ids of the underlying chunks. It would have required non-trivial engineering work to have the implementation align with what you describe, so instead we add the document to the collection after ingestion is complete.

@jeremi
Copy link
Author

jeremi commented Oct 22, 2024

But as the default collection is also added to the chunk, the other collection_id could be in a context like for the metadata and added at the same time, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants