-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: add proposal for collecting and analyzing synthetic data #1999
base: rfcs
Are you sure you want to change the base?
Conversation
f2734b8
to
ae021b6
Compare
The second modification that will be needed for benchdnn is a new low-fidelity | ||
batched benchmarking mode. This mode will would only execute each workload once, | ||
with the expectation that run-to-run variability is being handled by executing | ||
multiple similar workloads. The benchmarking is expected to be performed as | ||
follows on GPU, although some tweaks may be required to get performance stability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this functionality is currently present in benchdnn via the --fix-times-per-prb
knob, we could set it to 1 or some other small value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the --fix-times-per-prb
knob, is that it is not setup to keep the device saturated with work. This could cause issues with benchmarking if the target device enters some kind power saving mode while oneDNN primitives are being created. To avoid that, benchdnn currently performs a warmup phase for each workload. This warmup should be unnecessary in a batched benchmarking mode and as the goal here is to execute a primitive exactly once, the existing warmup phase would be a significant contributor to benchmarking throughput.
* Automated Diffing: As one of the relevant metrics is a comparison between | ||
oneDNN builds, this tool needs a way to execute multiple oneDNN libraries. As | ||
a consequence, we can add a testing API for comparing implementations. When | ||
implementations match between oneDNN builds, we can often skip benchmarking. | ||
As oneDNN primitive creation and dispatching can be quite complicated, this is | ||
beneficial for developers as they cannot always predict the effect of an | ||
implementation change. In addition, if functionality validation is supported, | ||
this provides a much faster way to initially validate correctness by skipping | ||
unchanged implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're suggesting that the tool could be used to compare performance (or correctness) differences between two versions of oneDNN? How would you handle changes to the API (breaking or otherwise) between the versions? It seems to me like this feature would add unnecessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you handle changes to the API (breaking or otherwise) between the versions?
As things currently stand, this would not be supported. The main intention here is for a fast to run regression test on a PR, not for testing between something like oneDNN releases. The problem, as I see it, is that after an optimization is made, a (most likely small) percentage of workloads are actually changed and we need to validate that optimization. If we can collect performance of 200 changed cases, we would have effective evidence for how the PR is performing in practice. This can be done in under a minute, but only if we can quickly identify what has changed. In addition, by using an automated search, we can remove accidental bias caused by developers restricting the search space to cases they believe are effected, which may not align with the reality.
* Option 2: Hash relevant primitive data to generate a UUID | ||
|
||
Given that we are looking at generating millions of primitives, and serialized | ||
primitive data is likely to be in the 10's of kilobytes in size, Option 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an API to query a cache blob ID from primitive descriptors to distinguish different primitives. Would it work here and have you checked how big it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not considered that API. The biggest issue I see is that primitive descriptors do not contain enough information. For example, consider the recent GEMM fix in commit 25f6896. This only effects the GPU kernel binary generated by the jit:gemm implementation, and so a kernel changed by this commit cannot be caught within the scope of the primitive descriptor.
On the other hand, the previous change in commit e8e05b4 would effect the primitive descriptor, so in theory could be caught by a cache blob ID (with appropriate modifications for this task). I had considered having a mode the filters on primitive descriptor changes, which boils down to a performance/accuracy tradeoff that developers can use based the work performed. I decided to not to include it until we know generating primitives is too slow in practice. While this definitely is a potential issue as we can only generate around ~100 OpenCL kernels per second for the GPU, this shouldn't be an issue with the move to reusable implementations unless searching lots of post-ops combinations becomes important.
d4a1b20
to
21cef93
Compare
21cef93
to
029767c
Compare
Document Link
This RFC proposes new tooling to enable a data pipeline on synthetic data analyze and validate oneDNN performance.