How do we trace global search result references back to source documents? #931
Replies: 3 comments
-
I think several approaches could work here. Let me propose one of them, I call it the "union set approach", where you have the summary and one set of sources for the whole summary. The opposite of this approach would be to not build a union source for the whole summary, but have specific source links for each part of the summary. In the following, i follow the "union set approach" because it seems easier to implement to me: Getting the set of sources of a community / entityWell the least we can do is whenever there is some summarization going on, build the union of all sources and say these are the sources of the summarization. People do a similar thing when writing summaries in scientific texts (see [p. 4, pp. 44-46, p.120]). As an enhancing, postprocessing step, the sources can be filtered so that if a summary does not contain specific information anymore of a part of the block, it can be removed from the source set. For example, if a page contains a detailed summary of a particular engine block and that engine block is not mentioned anymore in a higher level summary of the text, then the page should not be part of the summary sources. As another postprocessing step, we should merge these sources so that e.g. if more than 50% of a chapter is a source, we just say chapter X is a source" or pp. 49-50, pp. 50-51 becomes pp. 49-51. Finding the particular part of the sources that is relevant to a given questionThis part of the problem could be solved with a LLM call or maybe some vector search. We do this a lot for our LLM use cases. |
Beta Was this translation helpful? Give feedback.
-
Has anybody come up with an implementation for this? |
Beta Was this translation helpful? Give feedback.
-
How can we do it for local search? |
Beta Was this translation helpful? Give feedback.
-
Copied from #930
Here is an outline of the linkage from community reports back to source documents:
Note that there is intermediary summarization that makes it difficult to establish an exact link. For example, an entity's description is summarized from the descriptions extracted from all text units it is found within. We therefore can't pinpoint the exact text unit (and therefore document) that contributed a particular element of the entity's summary, because it is an aggregation. Similarly, the community report itself is summarized from all entity descriptions, so it is difficult to pinpoint the exact entity description that resulted in a particular answer. You may be able to use some post-processing with a vector store to find the best text unit matches to narrow it down with a cross-check.
Beta Was this translation helpful? Give feedback.
All reactions