Text-generation-inference (TPU) container fixes #65

Michellehbn · 2024-06-27T15:48:59Z

As part of the support of TPU in Inference Endpoints and for a better user experience:

Resolve hanging on the server side
Very small generation length

tengomucho · 2024-06-28T13:09:16Z

This can be separated in several smaller tasks. I'll list them here to follow up progress.

if "health" is called before any prefill call, it hangs.
"warmup" fails. This is because it tries to do prefill with a very long sequence (ignoring truncate info in the request), and apparently that ends up in a RESOURCE_EXHAUSTED error.
I believe "warmup" isn't really doing anything to make prefill/decode smoother afterwards in TPU, and it's just taking time compiling and making the inference, filling TPU memory uselessly. We might consider better handling warmup calls.
sequence length and parameters are low. We should investigate if we could increase this by bucketing or what is causing this issue.

I have now fixed the health issue. The problem was a wrong CachedBatch serialization. progrss is in branch debug-tgi-ie.

tengomucho · 2024-07-01T16:04:47Z

Daily update : warmup now works on the branch and the truncate works too. I am currently working on increasing the input length, trying to do that by bucketing prefilled inputs.

tengomucho · 2024-07-03T21:15:54Z

I have almost fixed everything, I do truncate as it should and I do bucketing and warmup. But I also introduced a bug, because I padded wrongly when bucketing prefills. I will fix that tomorrow.

tengomucho · 2024-07-08T12:00:52Z

Fixes are in #67 #68 and #69. Closing this!

tengomucho self-assigned this Jun 27, 2024

tengomucho closed this as completed Jul 8, 2024

alvarobartt mentioned this issue Jul 12, 2024

/health endpoint not working properly #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text-generation-inference (TPU) container fixes #65

Text-generation-inference (TPU) container fixes #65

Michellehbn commented Jun 27, 2024

tengomucho commented Jun 28, 2024 •

edited

Loading

tengomucho commented Jul 1, 2024

tengomucho commented Jul 3, 2024

tengomucho commented Jul 8, 2024

Text-generation-inference (TPU) container fixes #65

Text-generation-inference (TPU) container fixes #65

Comments

Michellehbn commented Jun 27, 2024

tengomucho commented Jun 28, 2024 • edited Loading

tengomucho commented Jul 1, 2024

tengomucho commented Jul 3, 2024

tengomucho commented Jul 8, 2024

tengomucho commented Jun 28, 2024 •

edited

Loading