graphrag CLI Update command issue #1600

ajain85 · 2025-01-09T08:28:22Z

Discussed in #1599

^{Originally posted by ajain85 January 9, 2025}
HI All, I have running graphrag cli for update command for blob but getting below error , I am using azure blob and azure search AI services to save parquet file in blob and update indexing in search ai. but getting below error , Can anyone suggest me the solution does it graphrag library error or anything I am missing in setting .yml file .

Error->
ValueError: Incremental Indexing Error: No new documents to process.

this is how I have updated my update file setting

update_index_storage:
type: "blob" # or blob
connection_string: ""
container_name: "graphrag"
base_dir: "output"
storage_account_blob_url: "https://*.blob.core.windows.net/"

error_msg = 'Incremental Indexing Error: No new documents to process.' │ │
│ │ is_update_run = True │ │
│ │ logger = <graphrag.logger.rich_progress.RichProgressLogger object at 0x000001E99D45B2D0> │ │
│ │ progress_logger = <graphrag.logger.rich_progress.RichProgressLogger object at 0x000001E99D45B2D0> │ │
│ │ root_dir = 'C:\Users\JAINAB\UNHCR Workspace\test_graphrag\cligraphrag' │ │
│ │ run_id = '20250109-135013' │ │
│ │ storage = <graphrag.storage.blob_pipeline_storage.BlobPipelineStorage object at │ │
│ │ 0x000001E99D472250> │ │
│ │ storage_config = { │ │
│ │ │ 'type': "blob", │ │
│ │ │ 'base_dir': 'output', │ │
│ │ │ 'connection_string': │ │
│ │ 'DefaultEndpointsProtocol=https;AccountName=d1hcrstgenaisharedxfc;AccountKey=K9Ya'… │ │
│ │ │ 'container_name': 'graphrag', │ │
│ │ │ 'storage_account_blob_url': │ │
│ │ 'https://d1hcrstgenaisharedxfc.blob.core.windows.net/', │ │
│ │ │ 'cosmosdb_account_url': None │ │
│ │ } │ │
│ │ update_index_storage = <graphrag.storage.file_pipeline_storage.FilePipelineStorage object at │ │
│ │ 0x000001E99F017710> │ │
│ │ update_storage_config = { │ │
│ │ │ 'type': "file", │ │
│ │ │ 'base_dir': 'C:\Users\JAINAB\UNHCR │ │
│ │ Workspace\test_graphrag\cligraphrag\update_output', │ │
│ │ │ 'connection_string': None, │ │
│ │ │ 'container_name': None, │ │
│ │ │ 'storage_account_blob_url': None, │ │
│ │ │ 'cosmosdb_account_url': None │ │
│ │ } │ │
│ │ workflows = [ │ │
│ │ │ 'create_base_text_units', │ │
│ │ │ 'create_final_documents', │ │
│ │ │ 'extract_graph', │ │
│ │ │ 'compute_communities', │ │
│ │ │ 'create_final_entities', │ │
│ │ │ 'create_final_relationships', │ │
│ │ │ 'create_final_nodes', │ │
│ │ │ 'create_final_communities', │ ││ │ │ 'create_final_text_units', │ ││ │ │ 'create_final_community_reports', │ ││ │ │ ... +1 │ ││ │ ] │ ││ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ValueError: Incremental Indexing Error: No new documents to process.

mmaitre314 · 2025-01-10T19:29:22Z

We are hitting a similar issue. It looks like the code considers that not having any documents to add is an error:

# Fail on empty delta dataset
if delta_dataset.new_inputs.empty:
    error_msg = "Incremental Indexing Error: No new documents to process."
    raise ValueError(error_msg)

graphrag/graphrag/index/run/run_workflows.py

Line 93 in 63042d2

if delta_dataset.new_inputs.empty:

It would be nice if instead the update would succeed and just keep the data as-is. We are running indexing on scheduled jobs and it is OK if there is nothing new since the last run -- the job can be a noop.

If helpful, I could send a PR to update the logic of that piece of code.

Besides failing when there are no new document to add, it looks like the code will also fail if there are only documents to delete, and the code might not handle document deletions at all. Likely a different problem, so opening a separate issue to track.

mmaitre314 · 2025-01-12T15:33:04Z

One workaround in our case has been to replicate the logic in get_delta_docs() and skip indexing if there are no new docs.

graphrag/graphrag/index/update/incremental_index.py

Line 51 in 0e7d22b

async def get_delta_docs(

document_current = {doc for doc in os.listdir(f'{root_dir}/input') if doc.endswith('.txt')}
document_previous = {doc for doc in pd.read_parquet(f'{root_dir}/output/create_final_documents.parquet')['title'].unique().tolist()}
document_count = len(document_current)
document_added = len([doc for doc in document_current if doc not in document_previous])
document_removed = len([doc for doc in document_previous if doc not in document_current])
	
if document_added == 0:
    mssparkutils.notebook.exit('{}')

natoverse · 2025-01-14T19:01:16Z

We've added this to the backlog - we'll make sure updates without new content can exit safely

mmaitre314 mentioned this issue Jan 10, 2025

[Bug]: index update does not process deleted documents #1613

Open

3 tasks

natoverse added enhancement New feature or request backlog We've confirmed some action is needed on this and will plan it labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graphrag CLI Update command issue #1600

graphrag CLI Update command issue #1600

ajain85 commented Jan 9, 2025

mmaitre314 commented Jan 10, 2025

mmaitre314 commented Jan 12, 2025

natoverse commented Jan 14, 2025

graphrag CLI Update command issue #1600

graphrag CLI Update command issue #1600

Comments

ajain85 commented Jan 9, 2025

Discussed in #1599

mmaitre314 commented Jan 10, 2025

mmaitre314 commented Jan 12, 2025

natoverse commented Jan 14, 2025