Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Merging artefacts parquet files. #958

Closed
3 tasks
k2ai opened this issue Aug 18, 2024 · 3 comments
Closed
3 tasks

[Feature Request]: Merging artefacts parquet files. #958

k2ai opened this issue Aug 18, 2024 · 3 comments
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response

Comments

@k2ai
Copy link

k2ai commented Aug 18, 2024

Do you need to file an issue?

  • I have searched the existing issues and this feature is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.

Is your feature request related to a problem? Please describe.

Hi,

I have around 15,000 chunks stored in individual .txt files, and I've used batches of 500 .txt files to generate graphs. In the output folder, various timestamped folders were created. When processing the new batches of 500 .txt files, I removed the older .txt files from the input folder. As previously mentioned, the new timestamp folders were expected to utilize cached data for indexing, but that hasn't been the case; the answers from earlier batches are not accurate.

I would like to understand how to merge all the timestamped folders into a single artifacts folder so that I can use the search API effectively.

Describe the solution you'd like

No response

Additional context

No response

@k2ai k2ai added the enhancement New feature or request label Aug 18, 2024
@natoverse
Copy link
Collaborator

Is there a particular reason you need to remove the old txt files from your input folder? The cache is invoked when the system encounters content that would produce an identical request - i.e., the same LLM parameters, and the same prompt (which will be identical for text chunks that were already processed). I would not expect the cache to be utilized for your new chunks, so that behavior sounds correct. But if you leave the old files, they should utilize the cache and therefore save on LLM calls.

That said, the later steps will recompute, because presumably your new content will result in updates to the graph, and therefore the community composition and summaries.

There is more info on incremental indexing at #741

@natoverse natoverse added awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response and removed enhancement New feature or request labels Aug 19, 2024
@k2ai
Copy link
Author

k2ai commented Aug 22, 2024

I was under the impression that the old input data cache would be utilized with the new input files so I removed old files. Now it's clear. will keep old files and continue adding new data.

@k2ai k2ai closed this as completed Aug 22, 2024
@k2ai
Copy link
Author

k2ai commented Aug 25, 2024

Would you like me to use --resume command while using CLI to consider previously generated cached data and parquet files?
python -m graphrag.index --init --root ./test --resume "20240820-143721"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response
Projects
None yet
Development

No branches or pull requests

2 participants