Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement streaming for assistants #215

Merged
merged 22 commits into from
Jan 17, 2024
Merged

implement streaming for assistants #215

merged 22 commits into from
Jan 17, 2024

Conversation

pmeier
Copy link
Member

@pmeier pmeier commented Nov 20, 2023

Addresses the Python API part of #185.

@pmeier
Copy link
Member Author

pmeier commented Dec 13, 2023

The current state implements the idea in #185 (comment). The UX is now as follows:

main

import asyncio
from ragna import Rag, assistants, source_storages

async def main():
    async with Rag().chat(
        documents=["ragna.txt"],
        source_storage=source_storages.Chroma,
        assistant=assistants.ClaudeInstant,
    ) as chat:
        message = await chat.answer("What is Ragna in ten words or less?")
        print(message)

asyncio.run(main())
 Ragna is an open source framework for building RAG-based AI applications.

PR

import asyncio
from ragna import Rag, assistants, source_storages

async def main():
    async with Rag().chat(
        documents=["ragna.txt"],
        source_storage=source_storages.Chroma,
        assistant=assistants.ClaudeInstant,
    ) as chat:
        message1 = await chat.answer("What is Ragna in ten words or less?")
        print(message1)
        print()
        print(await message1.content())
        print()
        print(message1)

        print("#" * 20)

        message2 = await chat.answer("What is Ragna in ten words or less?")
        print(message2)
        print()
        async for chunk in message2:
            print(chunk, end="")
        print()
        print()
        print(message2)

asyncio.run(main())
...

 Ragna is an open source framework for building RAG-based AI applications.

 Ragna is an open source framework for building RAG-based AI applications.
####################
...

 Ragna is an open source framework for building RAG-based AI applications.

 Ragna is an open source framework for building RAG-based AI applications.

So the trade-off for the ability to stream the answer is that user needs to await twice. Once to get the message object from the chat and a second time to get its content (either await message.content() or async for chunk in message). Printing the message when it hasn't been awaited yet currently just shows ..., but that also could be a source on confusion.

I'm not a 100% happy with the result. One option might be to put a stream: bool = False parameter on Chat.answer() and unless that is True, i.e. the user explicitly needs to set it, we exhaust the stream inside the function and thus remove the requirement of awaiting again.

Thoughts @nenb?

@pmeier
Copy link
Member Author

pmeier commented Dec 13, 2023

One thing that also bugs me is that we no longer can access the message.content attribute from the outside. After the message is awaited, we can only access it, by "awaiting Message.content() again" (we'll just return the stored value) or by Message.__str__. I'll have another look at the response classes of HTTP clients on how they do it.

@pmeier
Copy link
Member Author

pmeier commented Dec 13, 2023

Ok, now I'm actually happy with it 🚀

import asyncio
from ragna import Rag, assistants, source_storages

async def main():
    async with Rag().chat(
        documents=["ragna.txt"],
        source_storage=source_storages.Chroma,
        assistant=assistants.ClaudeInstant,
    ) as chat:
        print("Without stream=True")
        message = await chat.answer("What is Ragna in ten words or less?")
        print(message)

        print("#" * 20)

        print("With stream=True and iterating")
        message = await chat.answer("What is Ragna in ten words or less?", stream=True)

        try:
            print(message)
        except RuntimeError as error:
            print(f"{type(error).__name__}: {error}")

        async for chunk in message:
            print(chunk, end="")
        print()
        print(message)
        print(message.content)

        print("#" * 20)

        print("With stream=True but no iterating")

        message = await chat.answer("What is Ragna in ten words or less?", stream=True)
        print(await message.read())
        print(message)
        print(message.content)

asyncio.run(main())

Compared to #215 (comment), by default users can simply print the message and get its content without awaiting a second time. With stream=True, users now get control over how they want to read the message either in chunks, i.e. iterating over it, or simply .read()ing everything in one go.

Plus, #215 (comment) is resolved as well. We now have Message.content property again. Accessing it when the content stream has not been consumed yet will raise an error. Same for Message.__str__ and Message.__repr__ which internally rely on Message.content.

@nenb
Copy link
Contributor

nenb commented Dec 17, 2023

@pmeier Nice work. The code snippet looks quite sensible and intuitive to me.

Let me know if you need me to look over the PR in more depth (but a quick skim seemed fine to me!)

@pmeier
Copy link
Member Author

pmeier commented Dec 18, 2023

@peachkeel I could use some input from you on this. You rightfully advocated in #242 to trim the output of the answer REST API endpoint to avoid unnecessary data being sent. I'm currently in the process of deciding what the "message chunk" object looks like if one streams from answer endpoint. As of the latest commit we are using

class Message(BaseModel):
id: uuid.UUID = Field(default_factory=uuid.uuid4)
content: str
role: ragna.core.MessageRole
sources: list[Source] = Field(default_factory=list)
timestamp: datetime.datetime = Field(
default_factory=lambda: datetime.datetime.utcnow()
)

in which the content attribute is just the message chunk. Meaning, for every chunk we transmit the full source object. While already wasteful, this will be greatly amplified by #248, which we agreed to implement. So we need to come up with a better scheme.

One option could be to transmit the chunk on its own for most of the stream and only on the first or last chunk transmit the remaining message data as well. If we go for this, I would prefer sending the message data together with the first chunk for a two reasons:

  1. All the other information is available when we get the first chunk. Thus, there is no reason to not send it to the client right away.
  2. This would match the Python API. In there we have direct access to .sources etc. attributes when the object is created.

Happy for other opinions on this or completely different suggestions on how to handle streaming for the REST API.

@peachkeel
Copy link
Contributor

Your preference for how to implement this feature is eminently reasonable. The only reason not to send sources with the first chunk might be the added latency. If that's a factor for people, maybe having an option like /chats/{id}/answer?sources=none or /chats/{id}/answer?sources=last would be the best way to go so all bases are covered.

@pmeier
Copy link
Member Author

pmeier commented Dec 18, 2023

If that's a factor for people, maybe having an option like /chats/{id}/answer?sources=none or /chats/{id}/answer?sources=last would be the best way to go so all bases are covered.

Let's wait for demand on this. Thanks for the input.

@pmeier pmeier mentioned this pull request Dec 21, 2023
@pmeier
Copy link
Member Author

pmeier commented Jan 4, 2024

With #261 finally merged, adding streaming support for the UI was a breeze. Or so I thought. It kinda works, but we have an horrible "blinking" due to the rapid updates:

Screencast.from.2024-01-05.00-13-32.webm

@ahuang11 any idea where this is coming from or how to fix it?

@pmeier pmeier marked this pull request as ready for review January 4, 2024 23:22
@pierrotsmnrd
Copy link
Contributor

I have had a look and my understanding is that the flickering is due to the repeated calls to ChatMessage._update_object_pane when updating message.object here.

I've tried various things to fix this but haven't found a proper solution yet.

@ahuang11
Copy link

ahuang11 commented Jan 9, 2024

My first thought is that it looks like CSS being reset then reloaded every time so it jumps around. Can you check the browser console to see if that's true?

Also, this is how we stream internally: https://github.com/holoviz/panel/blob/main/panel/chat/feed.py#L379-L387

@pmeier
Copy link
Member Author

pmeier commented Jan 9, 2024

@ahuang11

Can you check the browser console to see if that's true?

How would I do that? What are you expecting there to be in case you are right?

@ahuang11
Copy link

ahuang11 commented Jan 9, 2024

If on Chrome, if the stylesheets are updated, it should show flashing / highlighting I think?

(flashing like this, but for the stylesheets):

Screen.Recording.2024-01-09.at.1.04.37.PM.mov

I am expecting the styles to be cached, but if not, it could be a bug in Panel.

@ahuang11
Copy link

ahuang11 commented Jan 9, 2024

Also, I find it strange that your Copy / Source Info buttons disappear and re-appear. Are they implemented in ReactiveHTML?

@pmeier
Copy link
Member Author

pmeier commented Jan 10, 2024

@ahuang11

If on Chrome, if the stylesheets are updated, it should show flashing / highlighting I think?

Previous screengrab was from Firefox. The one below is from Chrome. I see some flashing style attributes. More important, on Chrome the message is not rendered at all until the very end:

Screencast.from.2024-01-10.21-21-47.webm

Also, I find it strange that your Copy / Source Info buttons disappear and re-appear. Are they implemented in ReactiveHTML?

They are:

  • class CopyToClipboardButton(ReactiveHTML):
    title = param.String(default=None, doc="The title of the button ")
    value = param.String(default=None, doc="The text to copy to the clipboard.")
    _template = """
    <div type="button"
    id="copy-button"
    onclick="${script('copy_to_clipboard')}"
    class="container"
    style="cursor: pointer;"
    >
    <svg xmlns="http://www.w3.org/2000/svg" class="icon icon-tabler icon-tabler-clipboard" width="16" height="16"
    viewBox="0 0 24 24" stroke-width="2" stroke="gray" fill="none" stroke-linecap="round" stroke-linejoin="round">
    <path stroke="none" d="M0 0h24v24H0z" fill="none"/>
    <path d="M9 5h-2a2 2 0 0 0 -2 2v12a2 2 0 0 0 2 2h10a2 2 0 0 0 2 -2v-12a2 2 0 0 0 -2 -2h-2" />
    <path d="M9 3m0 2a2 2 0 0 1 2 -2h2a2 2 0 0 1 2 2v0a2 2 0 0 1 -2 2h-2a2 2 0 0 1 -2 -2z" />
    </svg>
    <span>${title}</span>
    </div>
    """
    _scripts = {
    "copy_to_clipboard": """navigator.clipboard.writeText(`${data.value}`);"""
    }
    _stylesheets = [
    """ div.container {
    display:flex;
    flex-direction: row;
    margin: 7px 10px;
    color:gray;
    }
    svg {
    margin-right:5px;
    }
    """
    ]
  • def _copy_and_source_view_buttons(self) -> pn.Row:
    return pn.Row(
    CopyToClipboardButton(
    value=self.object,
    title="Copy",
    stylesheets=[
    ui.CHAT_INTERFACE_CUSTOM_BUTTON,
    ],
    ),
    pn.widgets.Button(
    name="Source Info",
    icon="info-circle",
    stylesheets=[
    ui.CHAT_INTERFACE_CUSTOM_BUTTON,
    ],
    on_click=lambda event: self.on_click_source_info_callback(
    event, self.sources
    ),
    ),
    )

We didn't find a better way to move them below our message than by replacing the original object pane with a column that includes the message and the buttons:

def _update_object_pane(self, event=None):
super()._update_object_pane(event)
if self.sources:
assert self.role == "assistant"
css_class = "message-content-assistant-with-buttons"
self._object_panel = self._center_row[0] = pn.Column(
self._object_panel,
self._copy_and_source_view_buttons(),
css_classes=["message", css_class],
stylesheets=ui.stylesheets(
(f":host(.{css_class})", self._content_style_declarations)
),
)

@ahuang11
Copy link

ahuang11 commented Jan 12, 2024

Thanks! Can you expand that flashing style attribute?

If I understand Ragna's codebase correctly, I think rather than calling _update_object_pane repeatedly, through message.object += chunk["content"], I think you can store a _contents_pane internally

_contents_pane = pn.pane.Markdown()
message.object = pn.Column(_contents_pane, ...)  # include the other two things you have

Then, call
message._contents_pane.object += ...

So you're only affecting the contents pane, and not updating an entire column each time.

For reference, here's how we stream:
https://github.com/holoviz/panel/blob/main/panel/chat/feed.py#L428-L452
https://github.com/holoviz/panel/blob/main/panel/chat/feed.py#L370-L411

@pmeier
Copy link
Member Author

pmeier commented Jan 16, 2024

I've refactored the message class a little after an offline chat with @ahuang11. Previously, we created with a str as object and had a renderer set that turned it into a pn.pane.Markdown down the line. Now, we do this inside the constructor and pass that up to the base class constructor. This has two benefits:

  1. We don't need to overwrite private functionality from the base class to get our buttons in. In case we have a message from an assistant, we just wrap the markdown into a pn.Column right away.
  2. We can create a self.content_pane object that can be updated while streaming the message without rendering anything else anew.

This is the result:

Screencast.from.2024-01-16.23-56-44.webm

Copy link
Contributor

@pierrotsmnrd pierrotsmnrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. tested locally and it works.

@pmeier pmeier merged commit c08a223 into main Jan 17, 2024
10 checks passed
@pmeier pmeier deleted the streaming branch January 17, 2024 14:06
@pmeier pmeier mentioned this pull request Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants