[Feature Request]: Merging artefacts parquet files. #958

k2ai · 2024-08-18T07:03:44Z

Do you need to file an issue?

I have searched the existing issues and this feature is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.

Is your feature request related to a problem? Please describe.

Hi,

I have around 15,000 chunks stored in individual .txt files, and I've used batches of 500 .txt files to generate graphs. In the output folder, various timestamped folders were created. When processing the new batches of 500 .txt files, I removed the older .txt files from the input folder. As previously mentioned, the new timestamp folders were expected to utilize cached data for indexing, but that hasn't been the case; the answers from earlier batches are not accurate.

I would like to understand how to merge all the timestamped folders into a single artifacts folder so that I can use the search API effectively.

Describe the solution you'd like

No response

Additional context

No response

natoverse · 2024-08-19T22:11:57Z

Is there a particular reason you need to remove the old txt files from your input folder? The cache is invoked when the system encounters content that would produce an identical request - i.e., the same LLM parameters, and the same prompt (which will be identical for text chunks that were already processed). I would not expect the cache to be utilized for your new chunks, so that behavior sounds correct. But if you leave the old files, they should utilize the cache and therefore save on LLM calls.

That said, the later steps will recompute, because presumably your new content will result in updates to the graph, and therefore the community composition and summaries.

There is more info on incremental indexing at #741

k2ai · 2024-08-22T19:46:34Z

I was under the impression that the old input data cache would be utilized with the new input files so I removed old files. Now it's clear. will keep old files and continue adding new data.

k2ai · 2024-08-25T13:26:32Z

Would you like me to use --resume command while using CLI to consider previously generated cached data and parquet files?
python -m graphrag.index --init --root ./test --resume "20240820-143721"

k2ai added the enhancement New feature or request label Aug 18, 2024

natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed enhancement New feature or request labels Aug 19, 2024

k2ai closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Merging artefacts parquet files. #958

[Feature Request]: Merging artefacts parquet files. #958

k2ai commented Aug 18, 2024

natoverse commented Aug 19, 2024

k2ai commented Aug 22, 2024

k2ai commented Aug 25, 2024

[Feature Request]: Merging artefacts parquet files. #958

[Feature Request]: Merging artefacts parquet files. #958

Comments

k2ai commented Aug 18, 2024

Do you need to file an issue?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

natoverse commented Aug 19, 2024

k2ai commented Aug 22, 2024

k2ai commented Aug 25, 2024