-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement MSELoss Function #3156
base: develop
Are you sure you want to change the base?
Conversation
Please note: This PR is sharing some code (particularly warp-level reduction, tensor view, etc.) with some other code in this group of Moreh’s upstream requests. We’ll consolidate them as they gets closer to being merged to avoid merge conflicts |
Also, is there any method we can use to view the runner's output for easier debugging? I guess we can just ask you for the failing test, but it might be easier and less work for you if we can just see what is wrong on our own |
|
Windows build is not passing, but that is to be expected (please check #2970, previous conversations seems to suggest it was the cause) |
This algorithm is very similar to #3143 could you explain why do you use different indexing scheme? @BuiChiTrung could you help too? Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) |
Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work. |
Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way. The algorithm implies finding an average across some dimension and mathematically it should just accumulate all the numbers and divide by the number of elements. If during the accumulation you've got so huge error that you have to mimic GPU behavior over straight-forward accumulation, it means the algorithm does not perform mean calculation, it calculates something else and if you alter, for example, block size, it will calculate something another. There are only two ways to get it fixed and both require to get rid of GPU-specific code.
|
It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.
Not sure if this really is calculating something else. It is, for what its worth, adding all the values together, then divide them by another value (or reversed in this case, divide them within each threads, and adding them up). Again, the issue here is what kind of floating point-induced errors are we willing to accept.
Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge. |
I'm perfectly aware about parallel reduction pitfalls and fp operations. Lucky we are not using atomics here. But again - verification algorithm must be algorithm agnostic and as generic as possible.
If there is not accepted tolerance for some input data, it just means that the algorithm is not applicable for that case, and this problem must not be hidden by perfectly tuned verification algorithm. We can implement another version of MSELoss and we will use the same verification algorithm for it, because it is still MSELoss. And it's a way how to compare precision and stability between algorithms - using the same, pretty close to mathematical meaning, naive implementation, probably with even higher precision accumulators. |
I'm also working on integrating @long10024070 's MIOpenReduceSum into the reduction part and remove that duplicated code. Although due to some reorganization, please do expect some delays on that |
@CAHEK7 please take a look the latest changes and comments if you have some concerns. |
git tree got unreadable last merge attempt, I think I will just squash + rebase everything. Makes it easier for final reviews |
de47aa0
to
b619047
Compare
b619047
to
8a1f508
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI test failed log:
[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestFloat. Mismatch types: Prefix Hw Datatype
[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestHalf. Mismatch types: Prefix Hw Datatype
[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestBfloat16. Mismatch types: Prefix Hw Datatype
[2024-10-02T03:20:25.464Z] CRITICAL:GTest name checker:Tests do not match to the test naming scheme (see https://github.com/ROCm/MIOpen/wiki/GTest-development#naming )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the reduction part in you code, following the implementation in #3166.
This PR ports the
MSELoss
family of loss function to MIOpen:MSELoss
MSELossUnreduced
Performance measurements seems to suggest that in general we're performing better than ROCm on forward, reduced operation (mostly thanks to parallel reduction).
Sample performance measurements
float32
float16
bfloat16
Average performance
MSELoss
MSELossUnreduced
Codepaths that do not yield a sufficient performance gains have been cordoned and made unavailable.