Support html and pdf as input file type #375

chenyang-shanghai · 2024-07-05T09:27:05Z

chenyang-shanghai
Jul 5, 2024

In the config guide, we can only chose .csv or text for the "file_type";
https://microsoft.github.io/graphrag/posts/config/json_yaml/

Whereas html file (saved from internal web site) and pdf are very common now. So I'd ask whether the Microsoft graphrag supports these input files types or not; If it supports, any guide will be appreciated.

cendenta · 2024-07-05T14:52:26Z

cendenta
Jul 5, 2024

Integration with Document Intelligence and all its supported types would be excellent too.

1 reply

RichMorin Jul 6, 2024

HTML and PDF are plausible input formats, but they aren't the only ones of interest. To explore this, let's begin by comparing them with the current (CSV and "text") categories. Although CSV is written as a text (as opposed to binary) file, it is optimized for encoding rows of values (e.g., numbers, strings). Typically, this is used to describe a rectangular data structure (e.g., spreadsheet, table).

However, vanilla CSV does not include syntactic support for links, styles, etc. So, HTML and PDF would typically include information that CSV cannot represent without extensions. For example, it seems that Microsoft Excel defines a [=HYPERLINK()] (https://stackoverflow.com/questions/7572268/how-to-encode-a-hyperlink-in-csv-formatted-file) extension for links, but GR may not support this.

That said, in the world of LLMs and such, any sequence of characters can be treated as a token and thus serve as an implicit link between the usage instances. So, can someone point me to a description of what GraphRAG considers to be "text" and what sorts of graph information GR can extract from it?

Also, if GR can harvest linkage information from fairly arbitrary text files, perhaps it could make sense of some popular graph representation formats. For example:

ArangoDB is a graph database which is based on JSON.
Graphviz uses the DOT graph description language.
JSON-LD is a standard for encoding linked data.
Neo4j uses the Cypher query language..

Can someone fill me in on the current prospects for this sort of thing?

-r

AlonsoGuevara · 2024-07-05T23:14:53Z

AlonsoGuevara
Jul 5, 2024
Maintainer

Hi! Currently we only support txt and csv inputs, as you pointed out.

However, what I have done to process multiple file types is to have a preprocessing step where, using libs like pypdf, I convert everything into the same format. Same for docx, xslx, among others.

I really like the proposal of adding Document Intelligence support.

2 replies

adairgj Jul 11, 2024

What part of the architecture restricts it to txt and csv only? The LLM supports these formats so I'm curious why its restricted.

matbee-eth Jul 15, 2024

What part of the architecture restricts it to txt and csv only? The LLM supports these formats so I'm curious why its restricted.

it's probably just its token truncating / token chunking system, to be honest.

Text tends to be formatted with newlines, paragraphs, etc... I'll have to take a deeper look

Bioinf-usr · 2024-08-13T01:57:07Z

Bioinf-usr
Aug 13, 2024

+1 for the document intelligence support. The new version of document intelligence is really good. Would be wonderful to have it integrated as part of graphRAG processing directly.

0 replies

dyafu · 2024-12-12T05:20:29Z

dyafu
Dec 12, 2024

use docling do an ocr on the pdf and then feed the text to graphrag

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support html and pdf as input file type #375

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support html and pdf as input file type #375

chenyang-shanghai Jul 5, 2024

Replies: 4 comments · 3 replies

cendenta Jul 5, 2024

RichMorin Jul 6, 2024

AlonsoGuevara Jul 5, 2024 Maintainer

adairgj Jul 11, 2024

matbee-eth Jul 15, 2024

Bioinf-usr Aug 13, 2024

dyafu Dec 12, 2024

chenyang-shanghai
Jul 5, 2024

Replies: 4 comments 3 replies

cendenta
Jul 5, 2024

AlonsoGuevara
Jul 5, 2024
Maintainer

Bioinf-usr
Aug 13, 2024

dyafu
Dec 12, 2024