Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Screen Rendering Slows Down Towards the End of Streaming from LLM Server #1388

Open
calycekr opened this issue Aug 6, 2024 · 7 comments
Open
Labels
enhancement New feature or request front This issue is related to the front-end of the app.

Comments

@calycekr
Copy link

calycekr commented Aug 6, 2024

While the initial part of streaming from the LLM server is fine, the screen display speed slows down as time progresses, particularly towards the end of the streaming process. However, the LLM server has already finished sending the data, and only the screen display continues to be updated.

// Send the update to the client
controller.enqueue(JSON.stringify(event) + "\n");
// Send 4096 of spaces to make sure the browser doesn't blocking buffer that holding the response
if (event.type === "finalAnswer") {
controller.enqueue(" ".repeat(4096));
}

I suspect the above code section. Could it be a shortage of buffer?

@evalstate
Copy link
Contributor

evalstate commented Aug 8, 2024

This has been troubling me too. My quick investigation is leading me here:

const anomalyThresholdMS = 2000;
const anomalyDurationMS = sampledTimesMS
.map((time, i, times) => time - times[i - 1])
.slice(1)
.filter((time) => time > anomalyThresholdMS)
.reduce((a, b) => a + b, 0);
// eslint-disable-next-line @typescript-eslint/no-non-null-assertion
const totalTimeMSBetweenValues = sampledTimesMS.at(-1)! - sampledTimesMS[0];
const timeMSBetweenValues = totalTimeMSBetweenValues - anomalyDurationMS;
const averageTimeMSBetweenValues = Math.min(
200,
timeMSBetweenValues / (sampledTimesMS.length - 1)
);
const timeSinceLastEmitMS = performance.now() - timeOfLastEmitMS;
// Emit after waiting duration or cancel if "next" event is emitted
const gotNext = await Promise.race([
sleep(Math.max(5, averageTimeMSBetweenValues - timeSinceLastEmitMS)),
waitForEvent(eventTarget, "next"),
]);

Looks like the buffer on the browser side fills up, and the delay calculations don't work quite right in that situation. Inference with bigger contexts is very bursty which i think skews it.

@nsarrazin nsarrazin added enhancement New feature or request front This issue is related to the front-end of the app. labels Aug 15, 2024
@nsarrazin
Copy link
Collaborator

I suspect the above code section. Could it be a shortage of buffer?
Don't think that's the issue as the smoothing only occurs for tokens iirc, not raw updates so it shouldn't be an issue

I think we need to tweak the smoothing function to be faster the bigger the current buffer is. That way if there's a lot of words waiting to be displayed they should come out faster.

@Erquint
Copy link

Erquint commented Aug 22, 2024

I implore for an option to never throttle token output to the document.
It slows down to way beyond a crawl with longer chats and for the entire message, not just towards the end.
Whatever was the aesthetic idea behind stalling output when the server endpoint has already streamed all the tokens in — I cannot possibly get behind it.
Most dubiously: if I hit the abort button — the entire buffer is dumped into the document at once. But it invites guesswork of when has the endpoint invisibly finished its streaming into the local buffer — you get cut-off responses if you're too early.
Just remove the entire abstraction, please.

The way it gets dumped upon halting should give an idea as to where to look for a possible bug in the code.

P. S.
I noticed how some other platforms like for example figgs.ai also do fake aesthetic typing but as soon as the endpoint stream is over — the entire buffer is flushed. That's something that could help.

@Erquint
Copy link

Erquint commented Aug 28, 2024

I notice a change has been made that alters batching of token output but aggravates JS sleeping cumbersomeness. Could idling be decoupled from global tab sleep..? That'd be a good compromise to start with.

@evalstate
Copy link
Contributor

This commit added a PUBLIC_SMOOTH_UPDATES flag:

a59bc5e

Set it to true to enable the existing behaviour.

@Erquint
Copy link

Erquint commented Sep 11, 2024

I have to correct myself.
I made a favorable interpretation, assuming, foolishly, that the browser-side sleep you implemented was blocking the main thread and is being used intentionally for that effect. In tested fact, I now realize it's not blocking and the hard hanging of the tabs during the ever increasingly long bursts (literally a dead tab for over a minute at worst) is rather due to horrid performance.
There is no world in which printing a few words should take a minute, even if you flung every possible framework into the project and Svelte-Vite-Buzzworded all of it.

Spent days trying to debug and profile, but it's a fool's errand when you obfuscate everything.
Everything seems to point towards the pendingMessage.DJWnRfGi.js with its 800 kb (!) of minified and obfuscated code, which you'd expect to be compiled from pendingMessage.ts found in this repo — but that's just an empty husk with no meaningful contents or even imports.
So I'm left staring at…

dl=function(L){let le=null,Ee=null;if(za)L="<remove></remove>"+L;else{const Gt=k(L,/^[\r\n\t ]+/);Ee=Gt&&Gt[0]}Fr==="application/xhtml+xml"&&dr===B0&&(L='<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body>'+L+"</body></html>");const vt=Kt?Kt.createHTML(L):L;if(dr===B0)try{le=new Wt().parseFromString(vt,Fr)}catch{}if(!le||!le.documentElement){le=Aa.createDocument(dr,"template",null);try{le.documentElement.innerHTML=Ra?Er:vt}catch{}}const Xt=le.body||le.documentElement;return L&&Ee&&Xt.insertBefore(X.createTextNode(Ee),Xt.childNodes[0]||null),dr===B0?u1.call(le,tr?"html":"body")[0]:tr?le.documentElement:Xt}

👀 Marvellous..!
And even when I try searching for nearby inline string constants that survive obfuscation — they're simply not present in this repo.
So we have: file references are useless, function references are useless and inline constants are useless in cross-referencing the repo.

Was obfuscation really necessary in an open-source project???

Here's a profile screenshot…
profile_screenshot
Ran this profile less than it really took a message to fake-stream, to conserve the buffer length.
Digging into the call trees is useless due to obfuscation.
call_tree_screenshot
This is the modern web development — extremely hostile to contributors by design.

@evalstate
Copy link
Contributor

I've spent a bit of time instrumenting this; if you npm run dev you should be all set?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request front This issue is related to the front-end of the app.
Projects
None yet
Development

No branches or pull requests

4 participants