How do we trace global search result references back to source documents? #931

natoverse · 2024-08-14T18:11:39Z

natoverse
Aug 14, 2024
Maintainer

Copied from #930

Here is an outline of the linkage from community reports back to source documents:

The communities are cited in the response text from the LLM, using the community ID (create_final_community_reports.parquet)
Each community is composed of entity nodes, whose descriptions are summarized to create the community summary. The entity nodes each have a community assignment per level, so you can collect the list of entities depending on the level you performed the search at (create_final_nodes.parquet).
The entity nodes can be linked back to the canonical entity instances using their GUID (link from create_final_nodes.parquet to create_final_entities.parquet).
Each entity has a list of the text units it was found within (create_final_entities.parquet)
Text units are linked back to their source documents (create_final_text_units.parquet)

Note that there is intermediary summarization that makes it difficult to establish an exact link. For example, an entity's description is summarized from the descriptions extracted from all text units it is found within. We therefore can't pinpoint the exact text unit (and therefore document) that contributed a particular element of the entity's summary, because it is an aggregation. Similarly, the community report itself is summarized from all entity descriptions, so it is difficult to pinpoint the exact entity description that resulted in a particular answer. You may be able to use some post-processing with a vector store to find the best text unit matches to narrow it down with a cross-check.

iitsDelbruegger · 2024-08-22T10:32:58Z

iitsDelbruegger
Aug 22, 2024

I think several approaches could work here. Let me propose one of them, I call it the "union set approach", where you have the summary and one set of sources for the whole summary. The opposite of this approach would be to not build a union source for the whole summary, but have specific source links for each part of the summary. In the following, i follow the "union set approach" because it seems easier to implement to me:

Getting the set of sources of a community / entity

Well the least we can do is whenever there is some summarization going on, build the union of all sources and say these are the sources of the summarization. People do a similar thing when writing summaries in scientific texts (see [p. 4, pp. 44-46, p.120]).

As an enhancing, postprocessing step, the sources can be filtered so that if a summary does not contain specific information anymore of a part of the block, it can be removed from the source set. For example, if a page contains a detailed summary of a particular engine block and that engine block is not mentioned anymore in a higher level summary of the text, then the page should not be part of the summary sources.

As another postprocessing step, we should merge these sources so that e.g. if more than 50% of a chapter is a source, we just say chapter X is a source" or pp. 49-50, pp. 50-51 becomes pp. 49-51.

Finding the particular part of the sources that is relevant to a given question

This part of the problem could be solved with a LLM call or maybe some vector search. We do this a lot for our LLM use cases.

0 replies

icon2341 · 2024-09-06T13:24:00Z

icon2341
Sep 6, 2024

Has anybody come up with an implementation for this?

0 replies

todap · 2024-10-01T09:05:30Z

todap
Oct 1, 2024

How can we do it for local search?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we trace global search result references back to source documents? #931

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

How do we trace global search result references back to source documents? #931

natoverse Aug 14, 2024 Maintainer

Replies: 3 comments

iitsDelbruegger Aug 22, 2024

Getting the set of sources of a community / entity

Finding the particular part of the sources that is relevant to a given question

icon2341 Sep 6, 2024

todap Oct 1, 2024

natoverse
Aug 14, 2024
Maintainer

iitsDelbruegger
Aug 22, 2024

icon2341
Sep 6, 2024

todap
Oct 1, 2024