Issue when ingesting a file and adding it to a collection right after #1435

jeremi · 2024-10-20T14:04:24Z

Describe the bug

My code is like this:

                ingest_response = self.r2r.ingest_files(
                    file_paths=[document.file_path],
                    metadatas=[metadata]
                )
                document_id = ingest_response['results'][0]['document_id']
                try:
                    self.r2r.assign_document_to_collection(document_id, self.r2r.collection_id)
                except Exception as e:
                    logger.error(f"Error assigning document to collection: {str(e)}")

Oftentimes, I'll get an error when calling assign_document_to_collection because it cannot find the document in the database.

Looking at the code, the row in document_info is not created before the result of assign_document_to_collection is returned. It is done in the ingest-files workflow:

R2R/py/core/main/api/ingestion_router.py

Lines 164 to 174 in c9be2c5

    
           raw_message: dict[str, Union[str, None]] = await self.orchestration_provider.run_workflow(  # type: ignore 
        
               "ingest-files", 
        
               {"request": workflow_input}, 
        
               options={ 
        
                   "additional_metadata": { 
        
                       "document_id": str(document_id), 
        
                   } 
        
               }, 
        
           ) 
        
           raw_message["document_id"] = str(document_id) 
        
           messages.append(raw_message)

So when calling the assign_document_to_collection, the document_info record does not exist yet:

R2R/py/core/providers/database/collection.py

Lines 451 to 462 in c9be2c5

    
                       document_check_query = f""" 
        
                           SELECT 1 FROM {self._get_table_name('document_info')} 
        
                           WHERE document_id = $1 
        
                       """ 
        
                       document_exists = await self.fetchrow_query( 
        
                           document_check_query, [document_id] 
        
                       ) 
        
                       if not document_exists: 
        
                           raise R2RException( 
        
                               status_code=404, message="Document not found" 
        
                           )

How to solve this? Would it be possible to pass also the collection_id when calling the ingest_files? I noticed this workflow already added the file to the default collection.

The text was updated successfully, but these errors were encountered:

emrgnt-cmplxty · 2024-10-21T21:14:22Z

We are planning on extending the ingest_files endpoint to support exactly the behavior you outline above. Several other developers have requested this exact same functionality.

As for your other question around document info creation, there is a specific reason behind this implementation. In order to properly assign a document to a collection we must update the collection ids of the underlying chunks. It would have required non-trivial engineering work to have the implementation align with what you describe, so instead we add the document to the collection after ingestion is complete.

jeremi · 2024-10-22T07:09:31Z

But as the default collection is also added to the chunk, the other collection_id could be in a context like for the metadata and added at the same time, no?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when ingesting a file and adding it to a collection right after #1435

Issue when ingesting a file and adding it to a collection right after #1435

jeremi commented Oct 20, 2024

emrgnt-cmplxty commented Oct 21, 2024

jeremi commented Oct 22, 2024

Issue when ingesting a file and adding it to a collection right after #1435

Issue when ingesting a file and adding it to a collection right after #1435

Comments

jeremi commented Oct 20, 2024

emrgnt-cmplxty commented Oct 21, 2024

jeremi commented Oct 22, 2024