Implement MSELoss Function #3156

o2buzzle · 2024-07-26T10:12:43Z

This PR ports the MSELoss family of loss function to MIOpen:

MSELoss
MSELossUnreduced

Performance measurements seems to suggest that in general we're performing better than ROCm on forward, reduced operation (mostly thanks to parallel reduction).

Sample performance measurements

float32

op_name	dtype	Size-reduction	contiguous	model	direction	rocm kernel time	miopen kernel time	miopendriver cmdline	rocm/miopen
MSELoss	float32	[32 80]-sum	noncontiguous	tacotron2	bwd	20784	11804	mseloss -I 32x80 -r sum -Z 0_bwd	1.76075906472382
MSELoss	float32	[25 100 100]-mean	noncontiguous	random-md	fwd	27568	16551	mseloss -I 25x100x100 -r mean -Z 0	1.66563953839647
MSELoss	float32	[25 300]-sum	noncontiguous	random-md	bwd	15888	10204	mseloss -I 25x300 -r sum -Z 0_bwd	1.55703645629165
MSELoss	float32	[10 10 50 100 100]-none	noncontiguous	random-lg	bwd	14407533	9301430	mseloss -I 10x10x50x100x100 -r none -Z 0_bwd	1.54895892352036
MSELoss	float32	[32 80]-mean	contiguous	tacotron2	fwd	16544	10791	mseloss -I 32x80 -r mean -Z 1_contig	1.53312945973496
MSELoss	float32	[25 300]-mean	noncontiguous	random-md	bwd	15632	10400	mseloss -I 25x300 -r mean -Z 0_bwd	1.50307692307692
MSELoss	float32	[32 80]-none	noncontiguous	tacotron2	bwd	17760	12089	mseloss -I 32x80 -r none -Z 0_bwd	1.46910414426338
MSELoss	float32	[32 80]-sum	contiguous	tacotron2	fwd	15952	10862	mseloss -I 32x80 -r sum -Z 1_contig	1.46860614988032
MSELoss	float32	[10 100 100 100]-none	noncontiguous	random-lg	bwd	2423129	1661950	mseloss -I 10x100x100x100 -r none -Z 0_bwd	1.45800355004663
MSELoss	float32	[2000 3000]-none	noncontiguous	random-lg	bwd	547674	376328	mseloss -I 2000x3000 -r none -Z 0_bwd	1.45531026126145
MSELoss	float32	[25 100 100]-sum	contiguous	random-md	fwd	23745	16676	mseloss -I 25x100x100 -r sum -Z 1_contig	1.42390261453586
MSELoss	float32	[25 100 100]-mean	contiguous	random-md	fwd	23760	16711	mseloss -I 25x100x100 -r mean -Z 1_contig	1.42181796421519
MSELoss	float32	[2000 3000]-sum	contiguous	random-lg	fwd	96514	68268	mseloss -I 2000x3000 -r sum -Z 1_contig	1.41375168453741
MSELoss	float32	[1000000]-mean	noncontiguous	random-lg	fwd	33825	24231	mseloss -I 1000000 -r mean -Z 0	1.39593908629442
MSELoss	float32	[32 80 870]-sum	contiguous	t5	fwd	48673	34880	mseloss -I 32x80x870 -r sum -Z 1_contig	1.39544151376147
MSELoss	float32	[100 20 20 20]-none	noncontiguous	random-md	bwd	83490	59877	mseloss -I 100x20x20x20 -r none -Z 0_bwd	1.39435843479132
MSELoss	float32	[32 80 870]-mean	contiguous	t5	fwd	48273	34934	mseloss -I 32x80x870 -r mean -Z 1_contig	1.38183431613901
MSELoss	float32	[100 10 10 10 10]-none	noncontiguous	random-md	bwd	98626	71770	mseloss -I 100x10x10x10x10 -r none -Z 0_bwd	1.37419534624495
MSELoss	float32	[128 80 870]-sum	contiguous	tacotron2	fwd	129890	94687	mseloss -I 128x80x870 -r sum -Z 1_contig	1.37178282129543
MSELoss	float32	[2000 3000]-mean	contiguous	random-lg	fwd	94226	68694	mseloss -I 2000x3000 -r mean -Z 1_contig	1.37167729350453

float16

op_name	dtype	Size-reduction	contiguous	model	direction	rocm kernel time	miopen kernel time	miopendriver cmdline	rocm/miopen
MSELoss	float16	[25 100 100]-mean	noncontiguous	random-md	fwd	30241	16124	mselossfp16 -I 25x100x100 -r mean -Z 0	1.87552716447532
MSELoss	float16	[25 300]-none	noncontiguous	random-md	bwd	18624	10080	mselossfp16 -I 25x300 -r none -Z 0_bwd	1.84761904761905
MSELoss	float16	[25 100 100]-sum	noncontiguous	random-md	fwd	28785	16160	mselossfp16 -I 25x100x100 -r sum -Z 0	1.78125
MSELoss	float16	[32 80]-mean	contiguous	tacotron2	fwd	16880	10506	mselossfp16 -I 32x80 -r mean -Z 1_contig	1.6067009328003
MSELoss	float16	[32 80]-sum	contiguous	tacotron2	fwd	17008	10649	mselossfp16 -I 32x80 -r sum -Z 1_contig	1.59714527185651
MSELoss	float16	[25 300]-sum	noncontiguous	random-md	bwd	16368	10275	mselossfp16 -I 25x300 -r sum -Z 0_bwd	1.59299270072993
MSELoss	float16	[32 80]-sum	noncontiguous	tacotron2	bwd	19920	12515	mselossfp16 -I 32x80 -r sum -Z 0_bwd	1.59168997203356
MSELoss	float16	[25 300]-mean	noncontiguous	random-md	bwd	16448	10453	mselossfp16 -I 25x300 -r mean -Z 0_bwd	1.5735195637616
MSELoss	float16	[32 80]-none	noncontiguous	tacotron2	bwd	18128	12035	mselossfp16 -I 32x80 -r none -Z 0_bwd	1.50627336933943
MSELoss	float16	[25 100 100]-sum	contiguous	random-md	fwd	23392	15911	mselossfp16 -I 25x100x100 -r sum -Z 1_contig	1.47017786437056
MSELoss	float16	[32 80]-mean	noncontiguous	tacotron2	bwd	17344	12125	mselossfp16 -I 32x80 -r mean -Z 0_bwd	1.43043298969072
MSELoss	float16	[10 10 50 100 100]-none	noncontiguous	random-lg	bwd	11586974	8183960	mselossfp16 -I 10x10x50x100x100 -r none -Z 0_bwd	1.41581508218515
MSELoss	float16	[25 100 100]-mean	contiguous	random-md	fwd	22784	16107	mselossfp16 -I 25x100x100 -r mean -Z 1_contig	1.41454026199789
MSELoss	float16	[25 100 100]-none	noncontiguous	random-md	bwd	30769	22134	mselossfp16 -I 25x100x100 -r none -Z 0_bwd	1.39012379145206
MSELoss	float16	[100 20 20 20]-none	noncontiguous	random-md	bwd	75393	57352	mselossfp16 -I 100x20x20x20 -r none -Z 0_bwd	1.31456618775282
MSELoss	float16	[100 10 10 10 10]-none	noncontiguous	random-md	bwd	89410	68943	mselossfp16 -I 100x10x10x10x10 -r none -Z 0_bwd	1.29686842754159
MSELoss	float16	[2000 3000]-none	noncontiguous	random-lg	bwd	463896	360150	mselossfp16 -I 2000x3000 -r none -Z 0_bwd	1.28806330695544
MSELoss	float16	[10 100 100 100]-none	noncontiguous	random-lg	bwd	1869280	1504310	mselossfp16 -I 10x100x100x100 -r none -Z 0_bwd	1.24261621607249
MSELoss	float16	[25 100 100]-mean	noncontiguous	random-md	bwd	26752	21618	mselossfp16 -I 25x100x100 -r mean -Z 0_bwd	1.23748727911925
MSELoss	float16	[1000000]-mean	contiguous	random-lg	fwd	29456	24249	mselossfp16 -I 1000000 -r mean -Z 1_contig	1.21473050435069

bfloat16

op_name	dtype	Size-reduction	contiguous	model	direction	rocm kernel time	miopen kernel time	miopendriver cmdline	rocm/miopen
MSELoss	bfloat16	[100 20 20 20]-sum	noncontiguous	random-md	fwd	43889	22329	mselossbfp16 -I 100x20x20x20 -r sum -Z 0	1.96556048188454
MSELoss	bfloat16	[25 300]-sum	contiguous	random-md	fwd	19904	10169	mselossbfp16 -I 25x300 -r sum -Z 1_contig	1.95732127052808
MSELoss	bfloat16	[100 10 10 10 10]-mean	noncontiguous	random-md	fwd	48049	24693	mselossbfp16 -I 100x10x10x10x10 -r mean -Z 0	1.94585510063581
MSELoss	bfloat16	[10000]-mean	noncontiguous	random-md	fwd	19664	10151	mselossbfp16 -I 10000 -r mean -Z 0	1.93714904935474
MSELoss	bfloat16	[100 20 20 20]-mean	noncontiguous	random-md	fwd	42273	22347	mselossbfp16 -I 100x20x20x20 -r mean -Z 0	1.89166331051148
MSELoss	bfloat16	[25 300]-none	noncontiguous	random-md	bwd	19216	10293	mselossbfp16 -I 25x300 -r none -Z 0_bwd	1.86689983483921
MSELoss	bfloat16	[25 100 100]-sum	noncontiguous	random-md	fwd	29441	16302	mselossbfp16 -I 25x100x100 -r sum -Z 0	1.80597472702736
MSELoss	bfloat16	[25 100 100]-mean	noncontiguous	random-md	fwd	28848	16107	mselossbfp16 -I 25x100x100 -r mean -Z 0	1.79102253678525
MSELoss	bfloat16	[32 80]-mean	contiguous	tacotron2	fwd	17456	10240	mselossbfp16 -I 32x80 -r mean -Z 1_contig	1.7046875
MSELoss	bfloat16	[32 80]-sum	contiguous	tacotron2	fwd	17616	10507	mselossbfp16 -I 32x80 -r sum -Z 1_contig	1.67659655467783
MSELoss	bfloat16	[32 80]-sum	noncontiguous	tacotron2	bwd	21296	13013	mselossbfp16 -I 32x80 -r sum -Z 0_bwd	1.63651732882502
MSELoss	bfloat16	[25 100 100]-sum	contiguous	random-md	fwd	24832	15982	mselossbfp16 -I 25x100x100 -r sum -Z 1_contig	1.55374796646227
MSELoss	bfloat16	[25 100 100]-mean	contiguous	random-md	fwd	24592	15911	mselossbfp16 -I 25x100x100 -r mean -Z 1_contig	1.5455973854566
MSELoss	bfloat16	[10 100 100 100]-none	noncontiguous	random-lg	bwd	1945089	1296260	mselossbfp16 -I 10x100x100x100 -r none -Z 0_bwd	1.50053924367025
MSELoss	bfloat16	[10 10 50 100 100]-none	noncontiguous	random-lg	bwd	11904466	8223780	mselossbfp16 -I 10x10x50x100x100 -r none -Z 0_bwd	1.44756620434885
MSELoss	bfloat16	[25 300]-sum	noncontiguous	random-md	bwd	15104	10595	mselossbfp16 -I 25x300 -r sum -Z 0_bwd	1.42557810287872
MSELoss	bfloat16	[25 300]-mean	noncontiguous	random-md	bwd	14992	10649	mselossbfp16 -I 25x300 -r mean -Z 0_bwd	1.40783172128838
MSELoss	bfloat16	[25 100 100]-none	noncontiguous	random-md	bwd	31489	22685	mselossbfp16 -I 25x100x100 -r none -Z 0_bwd	1.38809786202336
MSELoss	bfloat16	[32 80]-none	noncontiguous	tacotron2	bwd	17792	13155	mselossbfp16 -I 32x80 -r none -Z 0_bwd	1.35248954770049
MSELoss	bfloat16	[32 80]-mean	noncontiguous	tacotron2	bwd	16928	12835	mselossbfp16 -I 32x80 -r mean -Z 0_bwd	1.3188936501753

Average performance

MSELoss

dtype	forward	backward
float32	2.52	0.74
float16	2.08	0.66
bfloat16	2.12	0.66

MSELossUnreduced

dtype	forward	backward
float32	0.53	0.92
float16	0.45	0.82
bfloat16	0.49	0.86

Codepaths that do not yield a sufficient performance gains have been cordoned and made unavailable.

o2buzzle · 2024-07-26T10:23:26Z

Please note: This PR is sharing some code (particularly warp-level reduction, tensor view, etc.) with some other code in this group of Moreh’s upstream requests. We’ll consolidate them as they gets closer to being merged to avoid merge conflicts

o2buzzle · 2024-07-29T04:21:12Z

Also, is there any method we can use to view the runner's output for easier debugging? I guess we can just ask you for the failing test, but it might be easier and less work for you if we can just see what is wrong on our own

o2buzzle · 2024-08-01T04:51:13Z

~~Hmmm... is the Static build the same as setting -DMIOPEN_EMBED_BUILD=On? Because if I set that on my local build it seems that even develop failed to build.~~ nvm it's something else

o2buzzle · 2024-08-02T02:40:37Z

Windows build is not passing, but that is to be expected (please check #2970, previous conversations seems to suggest it was the cause)

src/kernels/MIOpenMSELoss.cpp

CAHEK7 · 2024-08-09T12:33:56Z

This algorithm is very similar to #3143 could you explain why do you use different indexing scheme? @BuiChiTrung could you help too?

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment))
And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

o2buzzle · 2024-08-09T16:29:47Z

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

CAHEK7 · 2024-08-09T16:46:14Z

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way. The algorithm implies finding an average across some dimension and mathematically it should just accumulate all the numbers and divide by the number of elements. If during the accumulation you've got so huge error that you have to mimic GPU behavior over straight-forward accumulation, it means the algorithm does not perform mean calculation, it calculates something else and if you alter, for example, block size, it will calculate something another.

There are only two ways to get it fixed and both require to get rid of GPU-specific code.

increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.
if increasing tolerance is not possible, the algorithm must be improved, because it does not produce mathematically correct result with acceptable error.

o2buzzle · 2024-08-09T17:03:36Z

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

it calculates something else

Not sure if this really is calculating something else. It is, for what its worth, adding all the values together, then divide them by another value (or reversed in this case, divide them within each threads, and adding them up). Again, the issue here is what kind of floating point-induced errors are we willing to accept.

increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

CAHEK7 · 2024-08-09T19:31:15Z

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

I'm perfectly aware about parallel reduction pitfalls and fp operations. Lucky we are not using atomics here. But again - verification algorithm must be algorithm agnostic and as generic as possible.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

If there is not accepted tolerance for some input data, it just means that the algorithm is not applicable for that case, and this problem must not be hidden by perfectly tuned verification algorithm.

We can implement another version of MSELoss and we will use the same verification algorithm for it, because it is still MSELoss. And it's a way how to compare precision and stability between algorithms - using the same, pretty close to mathematical meaning, naive implementation, probably with even higher precision accumulators.

driver/mseloss_driver.hpp

o2buzzle · 2024-08-22T11:36:23Z

I'm also working on integrating @long10024070 's MIOpenReduceSum into the reduction part and remove that duplicated code. Although due to some reorganization, please do expect some delays on that

iq136boy · 2024-09-24T17:35:14Z

@CAHEK7 please take a look the latest changes and comments if you have some concerns.

o2buzzle · 2024-09-30T03:21:22Z

git tree got unreadable last merge attempt, I think I will just squash + rebase everything. Makes it easier for final reviews

iq136boy

CI test failed log:

[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestFloat. Mismatch types: Prefix Hw Datatype

[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestHalf. Mismatch types: Prefix Hw Datatype

[2024-10-02T03:20:25.464Z] WARNING:GTest name checker:Name: MSELossTest/MSELossTestBfloat16. Mismatch types: Prefix Hw Datatype

[2024-10-02T03:20:25.464Z] CRITICAL:GTest name checker:Tests do not match to the test naming scheme (see https://github.com/ROCm/MIOpen/wiki/GTest-development#naming )

long10024070

Please update the reduction part in you code, following the implementation in #3166.

…ss GTEST (driver FAILED)

o2buzzle requested review from JehandadKhan and junliume as code owners July 26, 2024 10:12

o2buzzle marked this pull request as draft July 26, 2024 10:12

o2buzzle added enhancement external_collaborator labels Jul 26, 2024

o2buzzle self-assigned this Jul 26, 2024

o2buzzle marked this pull request as ready for review August 2, 2024 02:40

CAHEK7 suggested changes Aug 9, 2024

View reviewed changes

src/kernels/MIOpenMSELoss.cpp Outdated Show resolved Hide resolved

CAHEK7 mentioned this pull request Aug 9, 2024

Implement NLLLoss #3168

Open

long10024070 requested a review from CAHEK7 August 17, 2024 15:29

CAHEK7 suggested changes Aug 20, 2024

View reviewed changes

driver/mseloss_driver.hpp Outdated Show resolved Hide resolved

driver/mseloss_driver.hpp Outdated Show resolved Hide resolved

driver/mseloss_driver.hpp Outdated Show resolved Hide resolved

driver/mseloss_driver.hpp Outdated Show resolved Hide resolved

o2buzzle requested a review from CAHEK7 August 26, 2024 09:00

long10024070 requested review from iq136boy, DrizztDoUrden and BrianHarrisonAMD September 23, 2024 03:06

o2buzzle force-pushed the o2buzzle/impl_MSELoss branch from de47aa0 to b619047 Compare September 30, 2024 03:21

added: MSELoss loss function for MIOpen

8a1f508

o2buzzle force-pushed the o2buzzle/impl_MSELoss branch from b619047 to 8a1f508 Compare September 30, 2024 06:54

Merge remote-tracking branch 'upstream/develop' into port/MSELoss-squash

5d96a32

iq136boy reviewed Oct 8, 2024

View reviewed changes

o2buzzle added 3 commits October 9, 2024 11:19

fixed: mseloss gtest tests naming

68df3a5

Merge remote-tracking branch 'upstream/develop' into port/MSELoss-squash

6d68a9b

fixed: miopen.h issues from merging

0f7519a

long10024070 reviewed Oct 14, 2024

View reviewed changes

Merge branch 'develop' into o2buzzle/impl_MSELoss

5ca8f60

long10024070 marked this pull request as draft November 4, 2024 13:02

long10024070 and others added 3 commits November 26, 2024 01:33

Merge branch 'develop' into o2buzzle/impl_MSELoss

7958ef5

Merge branch 'develop' into o2buzzle/impl_MSELoss

7aa9b81

Merge branch 'develop' into o2buzzle/impl_MSELoss

9f00375

long10024070 assigned long10024070 and o2buzzle and unassigned o2buzzle and long10024070 Feb 14, 2025

long10024070 added 4 commits February 14, 2025 07:24

Short coding convension fix

b4d8e71

update GTEST call format for mseloss

8e82d6e

seperate forward and backward GTEST

c8b47a4

remove MSELoss unreduce implementation

3790838

long10024070 changed the title ~~Implement MSELoss loss function~~ Implement MSELoss Function Feb 16, 2025

long10024070 added 4 commits February 17, 2025 04:58

Bug fix and update format, update reduction for MSELoss API and MSELo…

0243371

…ss GTEST (driver FAILED)

splite mseloss solver backward and forward from 1 file into 2 files

4dc18f1

update MSELoss Driver

ea53ad1

Merge branch 'develop' into o2buzzle/impl_MSELoss

fa59085

long10024070 marked this pull request as ready for review February 17, 2025 09:53

long10024070 requested a review from BradPepersAMD as a code owner February 17, 2025 09:53

long10024070 added 2 commits February 17, 2025 10:52

fix packing fail at CI/CD

21ac706

Merge branch 'develop' into o2buzzle/impl_MSELoss

a94a755

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement MSELoss Function #3156

Implement MSELoss Function #3156

o2buzzle commented Jul 26, 2024

o2buzzle commented Jul 26, 2024

o2buzzle commented Jul 29, 2024

o2buzzle commented Aug 1, 2024 •

edited

Loading

o2buzzle commented Aug 2, 2024

CAHEK7 commented Aug 9, 2024

o2buzzle commented Aug 9, 2024 •

edited

Loading

CAHEK7 commented Aug 9, 2024

o2buzzle commented Aug 9, 2024 •

edited

Loading

CAHEK7 commented Aug 9, 2024

o2buzzle commented Aug 22, 2024

iq136boy commented Sep 24, 2024

o2buzzle commented Sep 30, 2024

iq136boy left a comment

long10024070 left a comment

Implement MSELoss Function #3156

Are you sure you want to change the base?

Implement MSELoss Function #3156

Conversation

o2buzzle commented Jul 26, 2024

Sample performance measurements

Average performance

o2buzzle commented Jul 26, 2024

o2buzzle commented Jul 29, 2024

o2buzzle commented Aug 1, 2024 • edited Loading

o2buzzle commented Aug 2, 2024

CAHEK7 commented Aug 9, 2024

o2buzzle commented Aug 9, 2024 • edited Loading

CAHEK7 commented Aug 9, 2024

o2buzzle commented Aug 9, 2024 • edited Loading

CAHEK7 commented Aug 9, 2024

o2buzzle commented Aug 22, 2024

iq136boy commented Sep 24, 2024

o2buzzle commented Sep 30, 2024

iq136boy left a comment

Choose a reason for hiding this comment

long10024070 left a comment

Choose a reason for hiding this comment

o2buzzle commented Aug 1, 2024 •

edited

Loading

o2buzzle commented Aug 9, 2024 •

edited

Loading

o2buzzle commented Aug 9, 2024 •

edited

Loading