-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: <title>extract entities by nltk strategy found Error: "Column(s) ['description', 'source_id'] do not exist" #1601
Comments
The nltk strategy has not been used in quite some time, so it may be unreliable. However, we have a new implementation coming in the next few days that will provide the same functionality. |
Thanks! Looking forward to the functionality! |
Hi, 18:12:24,715 graphrag.storage.file_pipeline_storage INFO Creating file storage at /Users/bartola/Progetti/graphrag/giustizia/exp2/output
18:12:24,715 graphrag.index.input.factory INFO loading input from root_dir=../input
18:12:24,715 graphrag.index.input.factory INFO using file storage for input
18:12:24,716 graphrag.storage.file_pipeline_storage INFO search /Users/bartola/Progetti/graphrag/giustizia/exp2/../input for files matching .*\.txt$
18:12:24,717 graphrag.index.input.text INFO found text files from ../input, found [('N057UP002024X9C72F001_3.txt', {}), ('N057UP002024X9C72F001_2.txt', {}), ('N057UP002024X9C72F001_1.txt', {}), ('N057UP002024X9C79D001_2.txt', {}), ('N057UP002024X9C790001_2.txt', {}), ('N057UP002024X9C72F001.txt', {}), ('N057UP002024X9C79D001_1.txt', {}), ('N057UP002024X9C790001.txt', {}), ('N057UP002024X9C790001_1.txt', {}), ('N057UP002024X9C786001_2.txt', {}), ('N057UP002024X9C786001_1.txt', {}), ('N057UP002024X9C79D001.txt', {}), ('N057UP002024X9C786001.txt', {})]
18:12:24,725 graphrag.index.input.text INFO Found 13 files, loading 13
18:12:24,728 graphrag.index.run.run_workflows INFO Final # of rows loaded: 13
18:12:24,737 graphrag.utils.storage INFO reading table from storage: input.parquet
18:12:25,879 graphrag.utils.storage INFO reading table from storage: input.parquet
18:12:25,884 graphrag.utils.storage INFO reading table from storage: create_base_text_units.parquet
18:12:25,930 graphrag.utils.storage INFO reading table from storage: create_base_text_units.parquet
18:12:39,390 graphrag.index.run.run_workflows ERROR error running workflow extract_graph
Traceback (most recent call last):
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/run/run_workflows.py", line 166, in _run_workflows
result = await run_workflow(
^^^^^^^^^^^^^^^^^^^
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/workflows/extract_graph.py", line 45, in run_workflow
base_entity_nodes, base_relationship_edges = await extract_graph(
^^^^^^^^^^^^^^^^^^^^
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/flows/extract_graph.py", line 33, in extract_graph
entities, relationships = await extract_entities(
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/operations/extract_entities/extract_entities.py", line 136, in extract_entities
entities = _merge_entities(entity_dfs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/graphrag/index/operations/extract_entities/extract_entities.py", line 168, in _merge_entities
all_entities.groupby(["title", "type"], sort=False)
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/frame.py", line 9183, in groupby
return DataFrameGroupBy(
^^^^^^^^^^^^^^^^^
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/groupby/groupby.py", line 1329, in __init__
grouper, exclusions, obj = get_grouper(
^^^^^^^^^^^^
File "/Users/bartola/.virtualenvs/graphrag_env/lib/python3.12/site-packages/pandas/core/groupby/grouper.py", line 1043, in get_grouper
raise KeyError(gpr)
KeyError: 'title'
18:12:39,403 graphrag.callbacks.file_workflow_callbacks INFO Error running pipeline! details=None
18:12:39,431 graphrag.cli.index ERROR Errors occurred during the pipeline run, see logs for more details. Environment specification:
Thanks for your help. |
did you solve it? |
No, I didn't solve it. |
I think its not a nltk promblem.
I got the same error as you when I remove the strategy like
I got the error 'KeyError: 'title' like other comment in this issue. I debug the code, find it is a bug in /graph/index/operations/extract_entities/extract_entities.py
result is empty, but the input is correct the function strategy_exec seems like recursion error |
Do you need to file an issue?
Describe the bug
When I set the strategy of entity extraction to nltk, the following error occurs during index creation:
KeyError: "Column(s) ['description', 'source_id'] do not exist"
graphrag\index\operations\extract_entities\extract_entities.py", line 171, in _merge_entities
.agg(description=("description", list), text_unit_ids=("source_id", list))
Steps to reproduce
No response
Expected Behavior
No response
GraphRAG Config Used
Logs and screenshots
No response
Additional Information
The text was updated successfully, but these errors were encountered: