Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast PA teardown #674

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Fast PA teardown #674

wants to merge 3 commits into from

Conversation

tgerdesnv
Copy link
Collaborator

@tgerdesnv tgerdesnv commented May 24, 2024

Adds the ability for PA to exit quicker.

Old behavior was that PA would always wait for all requests to finish before exiting. For the case of async request-rate combined with a slow model, this could add up to many minutes of waiting.

New behavior is that for non-shared-memory cases, PA will exit immediately and drop remaining requests on the floor.

For the case of PA sweeping through multiple values (request-rate 10:20:10 for example), it WILL still wait for all requests from the first experiment to finish before going to the next step.

Something messy is that the LoadManager now needs to remember if it is shared memory or not. It was already utilizing that information, so not the end of the world.

Something important to note is that running PA back to back immediately may result in lower results on the 2nd run, as the actual server may still be draining the abandoned requests from the first run.

Here is a before/after. Note that both of these include a change (that won't be part of this story) to make sure that request-rate can actually issue the requested rate.

Before:
faster_teardown_without_changes

After:
faster_teardown_with_changes

@tgerdesnv tgerdesnv force-pushed the tgerdes-faster-teardown branch 2 times, most recently from 1c55334 to 3604a7d Compare May 29, 2024 16:43
@tgerdesnv tgerdesnv marked this pull request as ready for review May 31, 2024 15:11
{
// Make sure no requests carry over to the next test
while (stats_->num_active_infer_calls) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want a configurable timeout? Otherwise, it's going to be hard to know why the test times out for someone seeing it fail for that reason.

Copy link
Contributor

@matthewkotila matthewkotila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting part-way through the minimize emails to you but still give you something to respond to because I couldn't finish the review in one sitting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first run is 1m54s:

faster_teardown_without_changes

The second run is 24s.

Is the additional time for the first run mainly after the last pass? Or is it distributed among the passes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was 100% after the final results were printed. Old behavior is that it just sat there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, can you suggest an order of files to review along with some line-specific PR comments explaining at a high level what the changes are for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One high-level thing that crossed my mind is that: do we want to just drop requests like this PR implements?

Maybe we can instead add a print statement explaining that profiling is complete, but that PA is waiting for unfinished requests and that user can ^C early if they don't mind that the server is still potentially processing said requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term what we need is a proper way to cancel requests. Short term, this is holding up some GenAI-Perf scripting, where they are running with different request-rates in a bash script. ^C isn't going to help them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants