Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graphrag CLI Update command issue #1600

Open
ajain85 opened this issue Jan 9, 2025 · 3 comments
Open

graphrag CLI Update command issue #1600

ajain85 opened this issue Jan 9, 2025 · 3 comments
Labels
backlog We've confirmed some action is needed on this and will plan it enhancement New feature or request

Comments

@ajain85
Copy link

ajain85 commented Jan 9, 2025

Discussed in #1599

Originally posted by ajain85 January 9, 2025
HI All, I have running graphrag cli for update command for blob but getting below error , I am using azure blob and azure search AI services to save parquet file in blob and update indexing in search ai. but getting below error , Can anyone suggest me the solution does it graphrag library error or anything I am missing in setting .yml file .

Error->
ValueError: Incremental Indexing Error: No new documents to process.

this is how I have updated my update file setting

update_index_storage:
type: "blob" # or blob
connection_string: ""
container_name: "graphrag"
base_dir: "output"
storage_account_blob_url: "https://
*.blob.core.windows.net/"

error_msg = 'Incremental Indexing Error: No new documents to process.' │ │
│ │ is_update_run = True │ │
│ │ logger = <graphrag.logger.rich_progress.RichProgressLogger object at 0x000001E99D45B2D0> │ │
│ │ progress_logger = <graphrag.logger.rich_progress.RichProgressLogger object at 0x000001E99D45B2D0> │ │
│ │ root_dir = 'C:\Users\JAINAB\UNHCR Workspace\test_graphrag\cligraphrag' │ │
│ │ run_id = '20250109-135013' │ │
│ │ storage = <graphrag.storage.blob_pipeline_storage.BlobPipelineStorage object at │ │
│ │ 0x000001E99D472250> │ │
│ │ storage_config = { │ │
│ │ │ 'type': "blob", │ │
│ │ │ 'base_dir': 'output', │ │
│ │ │ 'connection_string': │ │
│ │ 'DefaultEndpointsProtocol=https;AccountName=d1hcrstgenaisharedxfc;AccountKey=K9Ya'… │ │
│ │ │ 'container_name': 'graphrag', │ │
│ │ │ 'storage_account_blob_url': │ │
│ │ 'https://d1hcrstgenaisharedxfc.blob.core.windows.net/', │ │
│ │ │ 'cosmosdb_account_url': None │ │
│ │ } │ │
│ │ update_index_storage = <graphrag.storage.file_pipeline_storage.FilePipelineStorage object at │ │
│ │ 0x000001E99F017710> │ │
│ │ update_storage_config = { │ │
│ │ │ 'type': "file", │ │
│ │ │ 'base_dir': 'C:\Users\JAINAB\UNHCR │ │
│ │ Workspace\test_graphrag\cligraphrag\update_output', │ │
│ │ │ 'connection_string': None, │ │
│ │ │ 'container_name': None, │ │
│ │ │ 'storage_account_blob_url': None, │ │
│ │ │ 'cosmosdb_account_url': None │ │
│ │ } │ │
│ │ workflows = [ │ │
│ │ │ 'create_base_text_units', │ │
│ │ │ 'create_final_documents', │ │
│ │ │ 'extract_graph', │ │
│ │ │ 'compute_communities', │ │
│ │ │ 'create_final_entities', │ │
│ │ │ 'create_final_relationships', │ │
│ │ │ 'create_final_nodes', │ │
│ │ │ 'create_final_communities', │ ││ │ │ 'create_final_text_units', │ ││ │ │ 'create_final_community_reports', │ ││ │ │ ... +1 │ ││ │ ] │ ││ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ValueError: Incremental Indexing Error: No new documents to process.

@mmaitre314
Copy link
Contributor

We are hitting a similar issue. It looks like the code considers that not having any documents to add is an error:

# Fail on empty delta dataset
if delta_dataset.new_inputs.empty:
    error_msg = "Incremental Indexing Error: No new documents to process."
    raise ValueError(error_msg)

if delta_dataset.new_inputs.empty:

It would be nice if instead the update would succeed and just keep the data as-is. We are running indexing on scheduled jobs and it is OK if there is nothing new since the last run -- the job can be a noop.

If helpful, I could send a PR to update the logic of that piece of code.

Besides failing when there are no new document to add, it looks like the code will also fail if there are only documents to delete, and the code might not handle document deletions at all. Likely a different problem, so opening a separate issue to track.

@mmaitre314
Copy link
Contributor

One workaround in our case has been to replicate the logic in get_delta_docs() and skip indexing if there are no new docs.

async def get_delta_docs(

document_current = {doc for doc in os.listdir(f'{root_dir}/input') if doc.endswith('.txt')}
document_previous = {doc for doc in pd.read_parquet(f'{root_dir}/output/create_final_documents.parquet')['title'].unique().tolist()}
document_count = len(document_current)
document_added = len([doc for doc in document_current if doc not in document_previous])
document_removed = len([doc for doc in document_previous if doc not in document_current])
	
if document_added == 0:
    mssparkutils.notebook.exit('{}')

@natoverse natoverse added enhancement New feature or request backlog We've confirmed some action is needed on this and will plan it labels Jan 14, 2025
@natoverse
Copy link
Collaborator

We've added this to the backlog - we'll make sure updates without new content can exit safely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog We've confirmed some action is needed on this and will plan it enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants