Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out how inference will work #7

Open
0x000011b opened this issue Jan 23, 2023 · 2 comments
Open

Figure out how inference will work #7

0x000011b opened this issue Jan 23, 2023 · 2 comments
Labels
planning Stuff we need to think about

Comments

@0x000011b
Copy link
Contributor

I don't plan on shelling out money for inference at the moment, so the initial plan is to have users bring their own "inference back-end" with them - likely Colab for now. Some points about this though:

  • Who will be responsible for creating the prompt, sending it off to the inference backend and parsing the resulting generation?
    • Initial plan is to implement that here, and the front-end will simply POST user messages to an endpoint and receive responses (Maybe via WebSockets? Not sure holding on to a connection for 10+ seconds is a good idea)
    • Pros:
      • We'll have real-world data on inference requests, which we can use to calculate how much $ it would cost to actually run inference ourselves (I've gotten many users suggesting I open a Patreon to cover hosting expenses - I'm unsure how well that'd pan out but with data this decision could be made a little more clearly)
      • We can automatically push new prompting code by just updating the server
    • Cons:
      • Increased server load, since we'll be acting as a proxy for inference requests.
  • How will inference work for group chats? How do we decide which characters should speak and when?
@0x000011b 0x000011b added the planning Stuff we need to think about label Jan 23, 2023
@TearGosling
Copy link

Adding onto the last bullet point, the naive route would to just let the LLM handle the work on which characters should speak. Of course, this would cause the model to forget who is speaking very quickly. It may be worth looking into datasets with multiple speakers at once, so as to get the model more used to the idea. You could even have some sort of control/sentinel token which dictates whether the LLM should be in "one-on-one" mode or "group chat" mode, which would double as a testing ground for future modules.
If the LLM even after this still forgets to bring back characters, you could have a system where it keeps track of how many responses it's been since a certain character (not since a user, so that multi-user chats don't cause this system to fail) has spoken, and if it's above a particular number, "force" the character to speak in the next reply. If multiple bots have the same number, pick randomly from a list.

My biggest concern, though, would not be the bot forgetting characters, but introducing new ones out of nowhere. We may need some sort of system to assist in this regard. Perhaps it would be better at keeping track of who is and isn't in the conversation with RLHF? Unsure.

@conanak99
Copy link

conanak99 commented Feb 22, 2023

Can I suggest we use the same approach as TavernAI(https://github.com/TavernAI/TavernAI) for inference?

  1. Users need to prepare their own back-end of Kobold AI and paste the API into a textbox. (They can use Collab or run their back-end locally)
    image

  2. Paphos will create the prompt, send it off to the Kobold API endpoints provided by the user, and parse the resulting generation.

  3. The front end will simply POST user messages to an endpoint and receive responses. No WebSocket is needed, the front end can just make a pooling every 2-4 seconds to see if the interference has finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
planning Stuff we need to think about
Projects
None yet
Development

No branches or pull requests

3 participants