Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MSELoss Function #3156

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from
Open

Implement MSELoss Function #3156

wants to merge 19 commits into from

Conversation

o2buzzle
Copy link
Collaborator

This PR ports the MSELoss family of loss function to MIOpen:

  • MSELoss
  • MSELossUnreduced

Performance measurements seems to suggest that in general we're performing better than ROCm on forward, reduced operation (mostly thanks to parallel reduction).

Sample performance measurements

float32
op_name dtype Size-reduction contiguous model direction rocm kernel time miopen kernel time miopendriver cmdline rocm/miopen
MSELoss float32 [32 80]-sum noncontiguous tacotron2 bwd 20784 11804 mseloss -I 32x80 -r sum -Z 0_bwd 1.76075906472382
MSELoss float32 [25 100 100]-mean noncontiguous random-md fwd 27568 16551 mseloss -I 25x100x100 -r mean -Z 0 1.66563953839647
MSELoss float32 [25 300]-sum noncontiguous random-md bwd 15888 10204 mseloss -I 25x300 -r sum -Z 0_bwd 1.55703645629165
MSELoss float32 [10 10 50 100 100]-none noncontiguous random-lg bwd 14407533 9301430 mseloss -I 10x10x50x100x100 -r none -Z 0_bwd 1.54895892352036
MSELoss float32 [32 80]-mean contiguous tacotron2 fwd 16544 10791 mseloss -I 32x80 -r mean -Z 1_contig 1.53312945973496
MSELoss float32 [25 300]-mean noncontiguous random-md bwd 15632 10400 mseloss -I 25x300 -r mean -Z 0_bwd 1.50307692307692
MSELoss float32 [32 80]-none noncontiguous tacotron2 bwd 17760 12089 mseloss -I 32x80 -r none -Z 0_bwd 1.46910414426338
MSELoss float32 [32 80]-sum contiguous tacotron2 fwd 15952 10862 mseloss -I 32x80 -r sum -Z 1_contig 1.46860614988032
MSELoss float32 [10 100 100 100]-none noncontiguous random-lg bwd 2423129 1661950 mseloss -I 10x100x100x100 -r none -Z 0_bwd 1.45800355004663
MSELoss float32 [2000 3000]-none noncontiguous random-lg bwd 547674 376328 mseloss -I 2000x3000 -r none -Z 0_bwd 1.45531026126145
MSELoss float32 [25 100 100]-sum contiguous random-md fwd 23745 16676 mseloss -I 25x100x100 -r sum -Z 1_contig 1.42390261453586
MSELoss float32 [25 100 100]-mean contiguous random-md fwd 23760 16711 mseloss -I 25x100x100 -r mean -Z 1_contig 1.42181796421519
MSELoss float32 [2000 3000]-sum contiguous random-lg fwd 96514 68268 mseloss -I 2000x3000 -r sum -Z 1_contig 1.41375168453741
MSELoss float32 [1000000]-mean noncontiguous random-lg fwd 33825 24231 mseloss -I 1000000 -r mean -Z 0 1.39593908629442
MSELoss float32 [32 80 870]-sum contiguous t5 fwd 48673 34880 mseloss -I 32x80x870 -r sum -Z 1_contig 1.39544151376147
MSELoss float32 [100 20 20 20]-none noncontiguous random-md bwd 83490 59877 mseloss -I 100x20x20x20 -r none -Z 0_bwd 1.39435843479132
MSELoss float32 [32 80 870]-mean contiguous t5 fwd 48273 34934 mseloss -I 32x80x870 -r mean -Z 1_contig 1.38183431613901
MSELoss float32 [100 10 10 10 10]-none noncontiguous random-md bwd 98626 71770 mseloss -I 100x10x10x10x10 -r none -Z 0_bwd 1.37419534624495
MSELoss float32 [128 80 870]-sum contiguous tacotron2 fwd 129890 94687 mseloss -I 128x80x870 -r sum -Z 1_contig 1.37178282129543
MSELoss float32 [2000 3000]-mean contiguous random-lg fwd 94226 68694 mseloss -I 2000x3000 -r mean -Z 1_contig 1.37167729350453
float16
op_name dtype Size-reduction contiguous model direction rocm kernel time miopen kernel time miopendriver cmdline rocm/miopen
MSELoss float16 [25 100 100]-mean noncontiguous random-md fwd 30241 16124 mselossfp16 -I 25x100x100 -r mean -Z 0 1.87552716447532
MSELoss float16 [25 300]-none noncontiguous random-md bwd 18624 10080 mselossfp16 -I 25x300 -r none -Z 0_bwd 1.84761904761905
MSELoss float16 [25 100 100]-sum noncontiguous random-md fwd 28785 16160 mselossfp16 -I 25x100x100 -r sum -Z 0 1.78125
MSELoss float16 [32 80]-mean contiguous tacotron2 fwd 16880 10506 mselossfp16 -I 32x80 -r mean -Z 1_contig 1.6067009328003
MSELoss float16 [32 80]-sum contiguous tacotron2 fwd 17008 10649 mselossfp16 -I 32x80 -r sum -Z 1_contig 1.59714527185651
MSELoss float16 [25 300]-sum noncontiguous random-md bwd 16368 10275 mselossfp16 -I 25x300 -r sum -Z 0_bwd 1.59299270072993
MSELoss float16 [32 80]-sum noncontiguous tacotron2 bwd 19920 12515 mselossfp16 -I 32x80 -r sum -Z 0_bwd 1.59168997203356
MSELoss float16 [25 300]-mean noncontiguous random-md bwd 16448 10453 mselossfp16 -I 25x300 -r mean -Z 0_bwd 1.5735195637616
MSELoss float16 [32 80]-none noncontiguous tacotron2 bwd 18128 12035 mselossfp16 -I 32x80 -r none -Z 0_bwd 1.50627336933943
MSELoss float16 [25 100 100]-sum contiguous random-md fwd 23392 15911 mselossfp16 -I 25x100x100 -r sum -Z 1_contig 1.47017786437056
MSELoss float16 [32 80]-mean noncontiguous tacotron2 bwd 17344 12125 mselossfp16 -I 32x80 -r mean -Z 0_bwd 1.43043298969072
MSELoss float16 [10 10 50 100 100]-none noncontiguous random-lg bwd 11586974 8183960 mselossfp16 -I 10x10x50x100x100 -r none -Z 0_bwd 1.41581508218515
MSELoss float16 [25 100 100]-mean contiguous random-md fwd 22784 16107 mselossfp16 -I 25x100x100 -r mean -Z 1_contig 1.41454026199789
MSELoss float16 [25 100 100]-none noncontiguous random-md bwd 30769 22134 mselossfp16 -I 25x100x100 -r none -Z 0_bwd 1.39012379145206
MSELoss float16 [100 20 20 20]-none noncontiguous random-md bwd 75393 57352 mselossfp16 -I 100x20x20x20 -r none -Z 0_bwd 1.31456618775282
MSELoss float16 [100 10 10 10 10]-none noncontiguous random-md bwd 89410 68943 mselossfp16 -I 100x10x10x10x10 -r none -Z 0_bwd 1.29686842754159
MSELoss float16 [2000 3000]-none noncontiguous random-lg bwd 463896 360150 mselossfp16 -I 2000x3000 -r none -Z 0_bwd 1.28806330695544
MSELoss float16 [10 100 100 100]-none noncontiguous random-lg bwd 1869280 1504310 mselossfp16 -I 10x100x100x100 -r none -Z 0_bwd 1.24261621607249
MSELoss float16 [25 100 100]-mean noncontiguous random-md bwd 26752 21618 mselossfp16 -I 25x100x100 -r mean -Z 0_bwd 1.23748727911925
MSELoss float16 [1000000]-mean contiguous random-lg fwd 29456 24249 mselossfp16 -I 1000000 -r mean -Z 1_contig 1.21473050435069
bfloat16
op_name dtype Size-reduction contiguous model direction rocm kernel time miopen kernel time miopendriver cmdline rocm/miopen
MSELoss bfloat16 [100 20 20 20]-sum noncontiguous random-md fwd 43889 22329 mselossbfp16 -I 100x20x20x20 -r sum -Z 0 1.96556048188454
MSELoss bfloat16 [25 300]-sum contiguous random-md fwd 19904 10169 mselossbfp16 -I 25x300 -r sum -Z 1_contig 1.95732127052808
MSELoss bfloat16 [100 10 10 10 10]-mean noncontiguous random-md fwd 48049 24693 mselossbfp16 -I 100x10x10x10x10 -r mean -Z 0 1.94585510063581
MSELoss bfloat16 [10000]-mean noncontiguous random-md fwd 19664 10151 mselossbfp16 -I 10000 -r mean -Z 0 1.93714904935474
MSELoss bfloat16 [100 20 20 20]-mean noncontiguous random-md fwd 42273 22347 mselossbfp16 -I 100x20x20x20 -r mean -Z 0 1.89166331051148
MSELoss bfloat16 [25 300]-none noncontiguous random-md bwd 19216 10293 mselossbfp16 -I 25x300 -r none -Z 0_bwd 1.86689983483921
MSELoss bfloat16 [25 100 100]-sum noncontiguous random-md fwd 29441 16302 mselossbfp16 -I 25x100x100 -r sum -Z 0 1.80597472702736
MSELoss bfloat16 [25 100 100]-mean noncontiguous random-md fwd 28848 16107 mselossbfp16 -I 25x100x100 -r mean -Z 0 1.79102253678525
MSELoss bfloat16 [32 80]-mean contiguous tacotron2 fwd 17456 10240 mselossbfp16 -I 32x80 -r mean -Z 1_contig 1.7046875
MSELoss bfloat16 [32 80]-sum contiguous tacotron2 fwd 17616 10507 mselossbfp16 -I 32x80 -r sum -Z 1_contig 1.67659655467783
MSELoss bfloat16 [32 80]-sum noncontiguous tacotron2 bwd 21296 13013 mselossbfp16 -I 32x80 -r sum -Z 0_bwd 1.63651732882502
MSELoss bfloat16 [25 100 100]-sum contiguous random-md fwd 24832 15982 mselossbfp16 -I 25x100x100 -r sum -Z 1_contig 1.55374796646227
MSELoss bfloat16 [25 100 100]-mean contiguous random-md fwd 24592 15911 mselossbfp16 -I 25x100x100 -r mean -Z 1_contig 1.5455973854566
MSELoss bfloat16 [10 100 100 100]-none noncontiguous random-lg bwd 1945089 1296260 mselossbfp16 -I 10x100x100x100 -r none -Z 0_bwd 1.50053924367025
MSELoss bfloat16 [10 10 50 100 100]-none noncontiguous random-lg bwd 11904466 8223780 mselossbfp16 -I 10x10x50x100x100 -r none -Z 0_bwd 1.44756620434885
MSELoss bfloat16 [25 300]-sum noncontiguous random-md bwd 15104 10595 mselossbfp16 -I 25x300 -r sum -Z 0_bwd 1.42557810287872
MSELoss bfloat16 [25 300]-mean noncontiguous random-md bwd 14992 10649 mselossbfp16 -I 25x300 -r mean -Z 0_bwd 1.40783172128838
MSELoss bfloat16 [25 100 100]-none noncontiguous random-md bwd 31489 22685 mselossbfp16 -I 25x100x100 -r none -Z 0_bwd 1.38809786202336
MSELoss bfloat16 [32 80]-none noncontiguous tacotron2 bwd 17792 13155 mselossbfp16 -I 32x80 -r none -Z 0_bwd 1.35248954770049
MSELoss bfloat16 [32 80]-mean noncontiguous tacotron2 bwd 16928 12835 mselossbfp16 -I 32x80 -r mean -Z 0_bwd 1.3188936501753

Average performance

  • MSELoss
dtype forward backward
float32 2.52 0.74
float16 2.08 0.66
bfloat16 2.12 0.66
  • MSELossUnreduced
dtype forward backward
float32 0.53 0.92
float16 0.45 0.82
bfloat16 0.49 0.86

Codepaths that do not yield a sufficient performance gains have been cordoned and made unavailable.

@o2buzzle
Copy link
Collaborator Author

Please note: This PR is sharing some code (particularly warp-level reduction, tensor view, etc.) with some other code in this group of Moreh’s upstream requests. We’ll consolidate them as they gets closer to being merged to avoid merge conflicts

@o2buzzle
Copy link
Collaborator Author

Also, is there any method we can use to view the runner's output for easier debugging? I guess we can just ask you for the failing test, but it might be easier and less work for you if we can just see what is wrong on our own

@o2buzzle
Copy link
Collaborator Author

o2buzzle commented Aug 1, 2024

Hmmm... is the Static build the same as setting -DMIOPEN_EMBED_BUILD=On? Because if I set that on my local build it seems that even develop failed to build. nvm it's something else

@o2buzzle
Copy link
Collaborator Author

o2buzzle commented Aug 2, 2024

Windows build is not passing, but that is to be expected (please check #2970, previous conversations seems to suggest it was the cause)

@o2buzzle o2buzzle marked this pull request as ready for review August 2, 2024 02:40
@CAHEK7
Copy link
Contributor

CAHEK7 commented Aug 9, 2024

This algorithm is very similar to #3143 could you explain why do you use different indexing scheme? @BuiChiTrung could you help too?

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment))
And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

@CAHEK7 CAHEK7 mentioned this pull request Aug 9, 2024
@o2buzzle
Copy link
Collaborator Author

o2buzzle commented Aug 9, 2024

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Aug 9, 2024

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way. The algorithm implies finding an average across some dimension and mathematically it should just accumulate all the numbers and divide by the number of elements. If during the accumulation you've got so huge error that you have to mimic GPU behavior over straight-forward accumulation, it means the algorithm does not perform mean calculation, it calculates something else and if you alter, for example, block size, it will calculate something another.

There are only two ways to get it fixed and both require to get rid of GPU-specific code.

  1. increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.
  2. if increasing tolerance is not possible, the algorithm must be improved, because it does not produce mathematically correct result with acceptable error.

@o2buzzle
Copy link
Collaborator Author

o2buzzle commented Aug 9, 2024

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

it calculates something else

Not sure if this really is calculating something else. It is, for what its worth, adding all the values together, then divide them by another value (or reversed in this case, divide them within each threads, and adding them up). Again, the issue here is what kind of floating point-induced errors are we willing to accept.

increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Aug 9, 2024

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

I'm perfectly aware about parallel reduction pitfalls and fp operations. Lucky we are not using atomics here. But again - verification algorithm must be algorithm agnostic and as generic as possible.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

If there is not accepted tolerance for some input data, it just means that the algorithm is not applicable for that case, and this problem must not be hidden by perfectly tuned verification algorithm.

We can implement another version of MSELoss and we will use the same verification algorithm for it, because it is still MSELoss. And it's a way how to compare precision and stability between algorithms - using the same, pretty close to mathematical meaning, naive implementation, probably with even higher precision accumulators.

@long10024070 long10024070 requested a review from CAHEK7 August 17, 2024 15:29
@o2buzzle
Copy link
Collaborator Author

I'm also working on integrating @long10024070 's MIOpenReduceSum into the reduction part and remove that duplicated code. Although due to some reorganization, please do expect some delays on that

@iq136boy
Copy link
Contributor

@CAHEK7 please take a look the latest changes and comments if you have some concerns.

@o2buzzle
Copy link
Collaborator Author

git tree got unreadable last merge attempt, I think I will just squash + rebase everything. Makes it easier for final reviews

@o2buzzle o2buzzle force-pushed the o2buzzle/impl_MSELoss branch from de47aa0 to b619047 Compare September 30, 2024 03:21
@o2buzzle o2buzzle force-pushed the o2buzzle/impl_MSELoss branch from b619047 to 8a1f508 Compare September 30, 2024 06:54
Copy link
Contributor

@iq136boy iq136boy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI test failed log:

[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestFloat. Mismatch types: Prefix Hw Datatype

[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestHalf. Mismatch types: Prefix Hw Datatype

[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestBfloat16. Mismatch types: Prefix Hw Datatype

[2024-10-02T03:20:25.464Z] CRITICAL:GTest name checker:Tests do not match to the test naming scheme (see https://github.com/ROCm/MIOpen/wiki/GTest-development#naming )

Copy link
Collaborator

@long10024070 long10024070 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the reduction part in you code, following the implementation in #3166.

@long10024070 long10024070 marked this pull request as draft November 4, 2024 13:02
@long10024070 long10024070 changed the title Implement MSELoss loss function Implement MSELoss Function Feb 16, 2025
@long10024070 long10024070 marked this pull request as ready for review February 17, 2025 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants