-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast PA teardown #674
base: main
Are you sure you want to change the base?
Fast PA teardown #674
Conversation
1c55334
to
3604a7d
Compare
{ | ||
// Make sure no requests carry over to the next test | ||
while (stats_->num_active_infer_calls) { | ||
std::this_thread::sleep_for(std::chrono::milliseconds(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want a configurable timeout? Otherwise, it's going to be hard to know why the test times out for someone seeing it fail for that reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting part-way through the minimize emails to you but still give you something to respond to because I couldn't finish the review in one sitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was 100% after the final results were printed. Old behavior is that it just sat there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, can you suggest an order of files to review along with some line-specific PR comments explaining at a high level what the changes are for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One high-level thing that crossed my mind is that: do we want to just drop requests like this PR implements?
Maybe we can instead add a print statement explaining that profiling is complete, but that PA is waiting for unfinished requests and that user can ^C
early if they don't mind that the server is still potentially processing said requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer term what we need is a proper way to cancel requests. Short term, this is holding up some GenAI-Perf scripting, where they are running with different request-rates in a bash script. ^C isn't going to help them.
7f3bbfa
to
bee15ad
Compare
Adds the ability for PA to exit quicker.
Old behavior was that PA would always wait for all requests to finish before exiting. For the case of async request-rate combined with a slow model, this could add up to many minutes of waiting.
New behavior is that for non-shared-memory cases, PA will exit immediately and drop remaining requests on the floor.
For the case of PA sweeping through multiple values (request-rate 10:20:10 for example), it WILL still wait for all requests from the first experiment to finish before going to the next step.
Something messy is that the LoadManager now needs to remember if it is shared memory or not. It was already utilizing that information, so not the end of the world.
Something important to note is that running PA back to back immediately may result in lower results on the 2nd run, as the actual server may still be draining the abandoned requests from the first run.
Here is a before/after. Note that both of these include a change (that won't be part of this story) to make sure that request-rate can actually issue the requested rate.
Before:
After: