Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support gpu send/recv thunk #1

Merged
merged 1 commit into from
Dec 4, 2023
Merged

support gpu send/recv thunk #1

merged 1 commit into from
Dec 4, 2023

Conversation

wbmc
Copy link

@wbmc wbmc commented Dec 4, 2023

No description provided.

@wbmc wbmc merged commit 5b044b1 into main Dec 4, 2023
3 checks passed
wbmc pushed a commit that referenced this pull request Dec 5, 2023
Imported from GitHub PR openxla#6599

FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue#openxla#6168

This PR is closely related to Flax PR-![3416](google/flax#3416).
Copybara import of the project:

--
a4140da by shuw <[email protected]>:

Add FP8 fast accumulation support for cublasLt.

--
9684568 by shuw <[email protected]>:

Improve based on review #1

--
e906d76 by shuw <[email protected]>:

Improve based on review #2

Merging this change closes openxla#6599

COPYBARA_INTEGRATE_REVIEW=openxla#6599 from wenscarl:fp8_fast_accumulation e906d76
PiperOrigin-RevId: 578948593
mars1248 pushed a commit to mars1248/xla that referenced this pull request Dec 22, 2023
Imported from GitHub PR openxla#7751

Due to fast accumulation being turned on in the forward mode, the cublasLt fp8 gemm with gelu epilogue can efficiently operate with a fused kernel. Compared against the XLA-generated gelu kernel on H100, the performance demonstrates some improvement for size of [8192, 4096] x [4096, 16384] + gelu:

Execution time for matmul using cublasLt and gelu (XLA): 1.28ms
Execution time for matmul_gelu using cublasLt: 1.25ms
Copybara import of the project:

--
e8abce3 by Shu Wang <[email protected]>:

Support cublasLt Fp8 Approx Gelu epilogue fusion.

--
818127c by shuw <[email protected]>:

Remove F32 check

--
5ce3108 by shuw <[email protected]>:

Improve based on review intelligent-machine-learning#1

Merging this change closes openxla#7751

COPYBARA_INTEGRATE_REVIEW=openxla#7751 from wenscarl:cublaslt_fp8_gelu 5ce3108
PiperOrigin-RevId: 591236441
ApsarasX pushed a commit that referenced this pull request Mar 28, 2024
…execution scope

Instead of always constructing while operation conditional in the default scope use the scope of a while operation itself.

This generates correct CUDA graph: https://gist.github.com/ezhulenev/a84192fe8b46a4bf1a934a8baa08ea60

Memeset operation launched in a scope #1 is not synchronized with initial condition handle update

PiperOrigin-RevId: 609742672
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant