Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: <title>extract entities by nltk strategy found Error: "Column(s) ['description', 'source_id'] do not exist" #1601

Open
3 tasks
HENScience opened this issue Jan 9, 2025 · 6 comments
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer

Comments

@HENScience
Copy link

HENScience commented Jan 9, 2025

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When I set the strategy of entity extraction to nltk, the following error occurs during index creation:
KeyError: "Column(s) ['description', 'source_id'] do not exist"
graphrag\index\operations\extract_entities\extract_entities.py", line 171, in _merge_entities
.agg(description=("description", list), text_unit_ids=("source_id", list))

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

# Paste your config here
entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1
  strategy: 
    type: nltk

Logs and screenshots

No response

Additional Information

  • GraphRAG Version: v1.1.1
  • Operating System: window11 Professional
  • Python Version: 3.10
@HENScience HENScience added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jan 9, 2025
@natoverse
Copy link
Collaborator

The nltk strategy has not been used in quite some time, so it may be unreliable. However, we have a new implementation coming in the next few days that will provide the same functionality.

@HENScience
Copy link
Author

Thanks! Looking forward to the functionality!
I was using the nltk strategy because the graph_intelligence strategy requests LLM cleaning, which takes too long for longer texts. For instance, processing a book with around 300k tokens takes over 1 hour. It might also be an issue with my setup, so I’m investigating further.

@Bartola64
Copy link

Hi,
I have a similar problem.
I also configured the entity extraction strategy to nltk and received the following error (from indexing-engine.log):

18:12:24,715 graphrag.storage.file_pipeline_storage INFO Creating file storage at /Users/bartola/Progetti/graphrag/giustizia/exp2/output
18:12:24,715 graphrag.index.input.factory INFO loading input from root_dir=../input
18:12:24,715 graphrag.index.input.factory INFO using file storage for input
18:12:24,716 graphrag.storage.file_pipeline_storage INFO search /Users/bartola/Progetti/graphrag/giustizia/exp2/../input for files matching .*\.txt$
18:12:24,717 graphrag.index.input.text INFO found text files from ../input, found [('N057UP002024X9C72F001_3.txt', {}), ('N057UP002024X9C72F001_2.txt', {}), ('N057UP002024X9C72F001_1.txt', {}), ('N057UP002024X9C79D001_2.txt', {}), ('N057UP002024X9C790001_2.txt', {}), ('N057UP002024X9C72F001.txt', {}), ('N057UP002024X9C79D001_1.txt', {}), ('N057UP002024X9C790001.txt', {}), ('N057UP002024X9C790001_1.txt', {}), ('N057UP002024X9C786001_2.txt', {}), ('N057UP002024X9C786001_1.txt', {}), ('N057UP002024X9C79D001.txt', {}), ('N057UP002024X9C786001.txt', {})]
18:12:24,725 graphrag.index.input.text INFO Found 13 files, loading 13
18:12:24,728 graphrag.index.run.run_workflows INFO Final # of rows loaded: 13
18:12:24,737 graphrag.utils.storage INFO reading table from storage: input.parquet
18:12:25,879 graphrag.utils.storage INFO reading table from storage: input.parquet
18:12:25,884 graphrag.utils.storage INFO reading table from storage: create_base_text_units.parquet
18:12:25,930 graphrag.utils.storage INFO reading table from storage: create_base_text_units.parquet
18:12:39,390 graphrag.index.run.run_workflows ERROR error running workflow extract_graph
Traceback (most recent call last):
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/run/run_workflows.py", line 166, in _run_workflows
    result = await run_workflow(
             ^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/workflows/extract_graph.py", line 45, in run_workflow
    base_entity_nodes, base_relationship_edges = await extract_graph(
                                                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/flows/extract_graph.py", line 33, in extract_graph
    entities, relationships = await extract_entities(
                              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/operations/extract_entities/extract_entities.py", line 136, in extract_entities
    entities = _merge_entities(entity_dfs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/operations/extract_entities/extract_entities.py", line 168, in _merge_entities
    all_entities.groupby(["title", "type"], sort=False)
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/frame.py", line 9183, in groupby
    return DataFrameGroupBy(
           ^^^^^^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/groupby/groupby.py", line 1329, in __init__
    grouper, exclusions, obj = get_grouper(
                               ^^^^^^^^^^^^
  File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/groupby/grouper.py", line 1043, in get_grouper
    raise KeyError(gpr)
KeyError: 'title'
18:12:39,403 graphrag.callbacks.file_workflow_callbacks INFO Error running pipeline! details=None
18:12:39,431 graphrag.cli.index ERROR Errors occurred during the pipeline run, see logs for more details.

Environment specification:

  • Graphrag version: 1.2.0
  • OS: macOS 14.7.2
  • Python: 3.12

Thanks for your help.

@naginoa
Copy link

naginoa commented Jan 22, 2025

did you solve it?

@Bartola64
Copy link

No, I didn't solve it.
I would like to wait for the new implementation that will provide the same functionality (as said by @natoverse).
Otherwise, do you have any suggestions?

@naginoa
Copy link

naginoa commented Jan 23, 2025

I think its not a nltk promblem.
when I add nltk strategy like you

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1
  strategy: 
    type: nltk

I got the same error as you

when I remove the strategy like

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

I got the error 'KeyError: 'title' like other comment in this issue.

I debug the code, find it is a bug in /graph/index/operations/extract_entities/extract_entities.py
the variable

result = await strategy_exec(
            [Document(text=text, id=id)],
            entity_types,
            callbacks,
            cache,
            strategy_config,
        )

result is empty, but the input is correct

the function strategy_exec seems like recursion error

@natoverse @Bartola64 @HENScience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer
Projects
None yet
Development

No branches or pull requests

4 participants