remove redundant memory traffic #7100

FindHao · 2020-12-10T00:51:45Z

In make_convolutional_layer function, l.output is initialized to 0 by xcalloc. Then it is copied to l.output_gpu, and conditionally copied to l.x_gpu and l.x_norm_gpu.

We can use cuda_memset to set these three arrays in gpu side rather than doing copy all zeros from cpu side. This optimization will save a lot of memory traffic. In my simple test for dog.jpg, it saves about 20% memory copy traffic.

FindHao · 2020-12-10T01:09:06Z

Between the initialization for l.output and the copies, l.output never changes.

FindHao · 2020-12-14T21:57:01Z

I found more examples with the same issue. And change cudamemset to async version.
For the simple test on data/dog.jpg, it gains 1.02x speedup on a RTX 2080Ti. This speedup is free lunch without harm in accuracy.

remove redundant memory traffic

fbb0cf3

remove more uncessnary memory copy and change memset to async

95fc755

cenit added 3 commits September 1, 2023 01:47

Merge branch 'master' into pr/7100

c028640

restore build compatibility

a589476

Merge branch 'master' into pr/7100

aa19871

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove redundant memory traffic #7100

remove redundant memory traffic #7100

FindHao commented Dec 10, 2020 •

edited

Loading

FindHao commented Dec 10, 2020

FindHao commented Dec 14, 2020 •

edited

Loading

remove redundant memory traffic #7100

Are you sure you want to change the base?

remove redundant memory traffic #7100

Conversation

FindHao commented Dec 10, 2020 • edited Loading

FindHao commented Dec 10, 2020

FindHao commented Dec 14, 2020 • edited Loading

FindHao commented Dec 10, 2020 •

edited

Loading

FindHao commented Dec 14, 2020 •

edited

Loading