Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet search #10

Open
lyie28 opened this issue Sep 20, 2023 · 2 comments
Open

Internet search #10

lyie28 opened this issue Sep 20, 2023 · 2 comments
Labels
BlindChat enhancement New feature or request

Comments

@lyie28
Copy link
Collaborator

lyie28 commented Sep 20, 2023

No description provided.

@lyie28 lyie28 added the enhancement New feature or request label Sep 20, 2023
@clauverjat
Copy link

clauverjat commented Oct 24, 2023

Integrating search into the chat while preserving user privacy is no small task. I've broken down my thoughts about the relevant challenges and how we could implement the feature.

But first a quick recap on how search is integrated into an AI chat.

  • The user message is first analyzed with the LLM to produce the search query.
  • This query is then sent to a (regular) search engine (ChatGPT uses Bing, Bard uses Google Search)
  • The top links returned by the search engine are fetched (this means that the system needs to download/browse these web pages)
  • Contents of these web pages along with the initial user message are fed into the LLM to produce a response

Now, let's evaluate the privacy challenges with this approach:
Dependence on a search engine: Since the user data is sent to the search engine, the users need to trust the search engine (and not only the local model or the enclave). Thus, the search feature should come with a big warning so that users understand the privacy implications of using the search. When they choose to use the search, we should consider using privacy-friendly search engines like DuckDuckGo, in order to maximize privacy.

Fetching web page contents: The challenge here is twofold. We are fetching the content of the returned web pages. This implies a potential exposure to the various site trackers or cookies. Sadly, we cannot do a lot of things against the tracker on the web pages. Also there is the issue of metadata leakage that can reveal which websites are queried (by inspecting the network of the service fetching the content). Here we have basically two options we can do the search locally or in an enclave.

Option 1 : Local implementation of the search ?
I believe it's not technically feasible to do the search locally (in-browser) because the browser puts restrictions on what a webpage from a domain can fetch (CORS / Same-Origin Policy). If we still want to do it locally, we will need to provide either a browser extension or a desktop app. Those present challenges from an adoption perspective. So I don't think we should go down that road (at least for now).
Anyway if the search is done locally, the web pages that are loaded would be subject to the same issues regarding metadata leakage as the rest of the user browsing activity. Notably, a network administrator could infer which websites/domain were searched/accessed based on the destination IP in the network packets. Still, this concern all his browser history so users concerned by this risk will need to use VPN (or Tor/I2P if VPN aren't enough).

Option 2: Use an enclave
Like for the models we could call the search engine & load pages from an enclave.
This option does not present the feasibility issue of the local implementation, we could implement it and integrate it into our existing web app.
However, like for the local search, network related metadata, which can often be revealing, remains exposed. It might actually be more problematic since a malicious administrator can spy on the network interface of the enclave and analyze all its traffic. If there are many users, the concept of "anonymity set" can offer some relief : you can't link a particular website to a user, since multiple users use the enclave at the same time. But I think we should go further. To address the issue, and provide greater privacy against us (the service operator) we could use a trustworthy VPN like Mullvad. They are known to take privacy seriously (Mozilla partnered with them for their VPN), and, like us, they use remote attestation to attest their server's software stack. That way we really couldn't even know which websites are queried from the traffic, which is very nice!

@dhuynh95
Copy link
Contributor

Interesting insights @clauverjat

My feedback:

  • The message used to create the search query for the search engine need not be done remotely, I guess you could do some local processing or small LLM to do keyword extraction but it's just a minor detail
  • I guess we could ask the users the level of privacy they want to have, and what are the implications of using external browsers, i.e. what is revealed to whom.

My assumption is that given that people already use web search a lot, both for professional and personal, and that good privacy solutions exist through VPNs as you mentioned or DuckDuckGo, the question is more: can the use of our service with web browser expose data to us, and in some way to those other services we rely on.

If we assume we just do synthesis generation and the user gets the content to be synthesized by the LLM in an enclave, then there is no more exposure to us than usual.

I think the easiest way to move forward is to ask our community what they want for search, and what they want to protect from whom but good mapping :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BlindChat enhancement New feature or request
Projects
Status: Todo
Status: Planned
Development

No branches or pull requests

3 participants