[Bug]: <title>extract entities by nltk strategy found Error: "Column(s) ['description', 'source_id'] do not exist" #1601

HENScience · 2025-01-09T09:16:07Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When I set the strategy of entity extraction to nltk, the following error occurs during index creation:
KeyError: "Column(s) ['description', 'source_id'] do not exist"
graphrag\index\operations\extract_entities\extract_entities.py", line 171, in _merge_entities
.agg(description=("description", list), text_unit_ids=("source_id", list))

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

# Paste your config here
entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1
  strategy: 
    type: nltk

Logs and screenshots

No response

Additional Information

GraphRAG Version: v1.1.1
Operating System: window11 Professional
Python Version: 3.10

natoverse · 2025-01-10T21:37:32Z

The nltk strategy has not been used in quite some time, so it may be unreliable. However, we have a new implementation coming in the next few days that will provide the same functionality.

HENScience · 2025-01-14T05:30:39Z

Thanks! Looking forward to the functionality!
I was using the nltk strategy because the graph_intelligence strategy requests LLM cleaning, which takes too long for longer texts. For instance, processing a book with around 300k tokens takes over 1 hour. It might also be an issue with my setup, so I’m investigating further.

Bartola64 · 2025-01-18T15:02:09Z

Hi,
I have a similar problem.
I also configured the entity extraction strategy to nltk and received the following error (from indexing-engine.log):

18:12:24,715 graphrag.storage.file_pipeline_storage INFO Creating file storage at /Users/bartola/Progetti/graphrag/giustizia/exp2/output
18:12:24,715 graphrag.index.input.factory INFO loading input from root_dir=../input
18:12:24,715 graphrag.index.input.factory INFO using file storage for input
18:12:24,716 graphrag.storage.file_pipeline_storage INFO search /Users/bartola/Progetti/graphrag/giustizia/exp2/../input for files matching .*\.txt$
18:12:24,717 graphrag.index.input.text INFO found text files from ../input, found [('N057UP002024X9C72F001_3.txt', {}), ('N057UP002024X9C72F001_2.txt', {}), ('N057UP002024X9C72F001_1.txt', {}), ('N057UP002024X9C79D001_2.txt', {}), ('N057UP002024X9C790001_2.txt', {}), ('N057UP002024X9C72F001.txt', {}), ('N057UP002024X9C79D001_1.txt', {}), ('N057UP002024X9C790001.txt', {}), ('N057UP002024X9C790001_1.txt', {}), ('N057UP002024X9C786001_2.txt', {}), ('N057UP002024X9C786001_1.txt', {}), ('N057UP002024X9C79D001.txt', {}), ('N057UP002024X9C786001.txt', {})]
18:12:24,725 graphrag.index.input.text INFO Found 13 files, loading 13
18:12:24,728 graphrag.index.run.run_workflows INFO Final # of rows loaded: 13
18:12:24,737 graphrag.utils.storage INFO reading table from storage: input.parquet
18:12:25,879 graphrag.utils.storage INFO reading table from storage: input.parquet
18:12:25,884 graphrag.utils.storage INFO reading table from storage: create_base_text_units.parquet
18:12:25,930 graphrag.utils.storage INFO reading table from storage: create_base_text_units.parquet
18:12:39,390 graphrag.index.run.run_workflows ERROR error running workflow extract_graph
Traceback (most recent call last):
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/run/run_workflows.py", line 166, in _run_workflows
    result = await run_workflow(
             ^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/workflows/extract_graph.py", line 45, in run_workflow
    base_entity_nodes, base_relationship_edges = await extract_graph(
                                                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/flows/extract_graph.py", line 33, in extract_graph
    entities, relationships = await extract_entities(
                              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/operations/extract_entities/extract_entities.py", line 136, in extract_entities
    entities = _merge_entities(entity_dfs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/operations/extract_entities/extract_entities.py", line 168, in _merge_entities
    all_entities.groupby(["title", "type"], sort=False)
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/frame.py", line 9183, in groupby
    return DataFrameGroupBy(
           ^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/groupby/groupby.py", line 1329, in __init__
    grouper, exclusions, obj = get_grouper(
                               ^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/groupby/grouper.py", line 1043, in get_grouper
    raise KeyError(gpr)
KeyError: 'title'
18:12:39,403 graphrag.callbacks.file_workflow_callbacks INFO Error running pipeline! details=None
18:12:39,431 graphrag.cli.index ERROR Errors occurred during the pipeline run, see logs for more details.

Environment specification:

Graphrag version: 1.2.0
OS: macOS 14.7.2
Python: 3.12

Thanks for your help.

naginoa · 2025-01-22T10:44:52Z

did you solve it？

Bartola64 · 2025-01-22T11:31:40Z

No, I didn't solve it.
I would like to wait for the new implementation that will provide the same functionality (as said by @natoverse).
Otherwise, do you have any suggestions?

naginoa · 2025-01-23T02:34:20Z

I think its not a nltk promblem.
when I add nltk strategy like you

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1
  strategy: 
    type: nltk

I got the same error as you

when I remove the strategy like

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

I got the error 'KeyError: 'title' like other comment in this issue.

I debug the code, find it is a bug in /graph/index/operations/extract_entities/extract_entities.py
the variable

result = await strategy_exec(
            [Document(text=text, id=id)],
            entity_types,
            callbacks,
            cache,
            strategy_config,
        )

result is empty, but the input is correct

the function strategy_exec seems like recursion error

@natoverse @Bartola64 @HENScience

HENScience added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: <title>extract entities by nltk strategy found Error: "Column(s) ['description', 'source_id'] do not exist" #1601

[Bug]: <title>extract entities by nltk strategy found Error: "Column(s) ['description', 'source_id'] do not exist" #1601

HENScience commented Jan 9, 2025 •

edited

Loading

natoverse commented Jan 10, 2025

HENScience commented Jan 14, 2025

Bartola64 commented Jan 18, 2025

naginoa commented Jan 22, 2025

Bartola64 commented Jan 22, 2025

naginoa commented Jan 23, 2025

[Bug]: <title>extract entities by nltk strategy found Error: "Column(s) ['description', 'source_id'] do not exist" #1601

[Bug]: <title>extract entities by nltk strategy found Error: "Column(s) ['description', 'source_id'] do not exist" #1601

Comments

HENScience commented Jan 9, 2025 • edited Loading

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

natoverse commented Jan 10, 2025

HENScience commented Jan 14, 2025

Bartola64 commented Jan 18, 2025

naginoa commented Jan 22, 2025

Bartola64 commented Jan 22, 2025

naginoa commented Jan 23, 2025

HENScience commented Jan 9, 2025 •

edited

Loading