Support html and pdf as input file type #375
Replies: 4 comments 3 replies
-
Integration with Document Intelligence and all its supported types would be excellent too. |
Beta Was this translation helpful? Give feedback.
-
Hi! Currently we only support txt and csv inputs, as you pointed out. However, what I have done to process multiple file types is to have a preprocessing step where, using libs like pypdf, I convert everything into the same format. Same for docx, xslx, among others. I really like the proposal of adding Document Intelligence support. |
Beta Was this translation helpful? Give feedback.
-
+1 for the document intelligence support. The new version of document intelligence is really good. Would be wonderful to have it integrated as part of graphRAG processing directly. |
Beta Was this translation helpful? Give feedback.
-
use docling do an ocr on the pdf and then feed the text to graphrag |
Beta Was this translation helpful? Give feedback.
-
In the config guide, we can only chose .csv or text for the "file_type";
https://microsoft.github.io/graphrag/posts/config/json_yaml/
Whereas html file (saved from internal web site) and pdf are very common now. So I'd ask whether the Microsoft graphrag supports these input files types or not; If it supports, any guide will be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions