Web search needs to return more than a snippet. #877

Ben-Pattinson · 2023-05-09T09:55:30Z

Ben-Pattinson
May 9, 2023

At present, both the Bing and Google connectors return the snippet from a page. This is problematic as in my experience, there is rarely anything of substance in this snippet. If you are getting an AI to attempt to gain knowledge via a web search, this doesn't work.
I appreciate that downloading and parsing pages is more expensive, yet this approach is far more likely to result in good content, and hence better overall performance.

For reference, the existing code is here: https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/Skills/Skills.Web/Bing/BingConnector.cs

and looks thus (with some unimportant lines omitted for brevity):

` public async Task<IEnumerable> SearchAsync(string query, int count = 1, int offset = 0, CancellationToken cancellationToken = default)
{
Uri uri = new($"https://api.bing.microsoft.com/v7.0/search?q={Uri.EscapeDataString(query)}&count={count}&offset={offset}");

    HttpResponseMessage response = await this._httpClient.GetAsync(uri, cancellationToken).ConfigureAwait(false);
    response.EnsureSuccessStatusCode();

    string json = await response.Content.ReadAsStringAsync().ConfigureAwait(false);

    BingSearchResponse? data = JsonSerializer.Deserialize<BingSearchResponse>(json);
    WebPage[]? results = data?.WebPages?.Value;

    return results == null ? Enumerable.Empty<string>() : results.Select(x => x.Snippet);
}`

My approach is to instead:

Use the full page content
Run it through a HTML parser, extracting just the text.
Remove blank lines and collapse blank space.
Split the remaining content into chunks that have a token size small enough to pass to text completion
Pass each of these chunks to text completion, asking it to cleanup the content leaving only the content that is relevant to the search subject.
Concatenate all the cleaned-up chunks back together again.
Return it

Initial tests show this to be working well.

So, is this a modification that the team feels sits here in the connector? Should it be an alternative mode / different method on the connector?
I honestly don't see the value of looking at the snippets, so possibly I'm misunderstanding the use-case.

Thanks

IKDH · 2023-07-17T17:37:29Z

IKDH
Jul 17, 2023

I am thinking the same as you but somehow, the chat of bing ai has very great result using only the snippet when you ask it a question. I am wondering how they manage to set up their AI to answer so well using only the data from the snippet. Also, their snippet is more complete that the snippets retourner by the bing search api.

When I do the same method as the one you are proposing, the AI (chatgpt4) often only use the data that I am giving to it to answer and not also the data that it knows about the subject. I am trying to solve this issue.

For exemple, if you ask bing ai to write an article about the best place to visit in England, it will answer using the search result but it will also add a lot of data from it knowledge base. I think that Microsoft has worked very well the integration of the data from the search browsing to their AI using probably some optimized prompts.

0 replies

Mano1192 · 2023-08-03T23:00:02Z

Mano1192
Aug 3, 2023

I beleive @craigomatic is already working on another repo to solve this using Playwrights API to return the body tag of text back to the LLM per url. I ran into the exact same issue as you and found that the google api only returning snippets is not adequate to get enough data on a topic for a valid response to be formulated by an LLM. I plan to be looking through what Craig has already built here: https://github.com/craigomatic/webscraper-aiplugin/ and expand on it. Hope that hepls!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web search needs to return more than a snippet. #877

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Web search needs to return more than a snippet. #877

Ben-Pattinson May 9, 2023

Replies: 2 comments

IKDH Jul 17, 2023

Mano1192 Aug 3, 2023

Ben-Pattinson
May 9, 2023

IKDH
Jul 17, 2023

Mano1192
Aug 3, 2023