-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions: Clarifying the use of FP8 for Training #99
Comments
Hi @jon-chuang , I am sorry for late reply. 1. Performance
Yes. Our first step is to apply FP8 format as much as possible to reduce memory footprint while maintaining accuracy, and the second step is to optimize the performance in MS-AMP. MS-AMP can be combined with the TransformerEngine to invoke optimized operators in TE. (Related PR: #98) 2. Weight Update
a) It is applied after the entire backward pass is complete. The FP8 weights are updated in the optimizer. (https://github.com/Azure/MS-AMP/blob/main/msamp/optim/adamw.py#L193). b) Good idea. I had tried using an additional CUDA stream for weight update, but it did not achieve the dessired acceleration, probably due to my implementation not being optimal : ) However, I still believe that it is effective to schedule weight updates concurrently, since weight update does not affect the calculation of backpropagation. It is available to update multiple FP8 weights in a single CUDA kernel, but it is notable that the FP8 tensor with a scaling factor should be treated as a whole. The maximum value of the entire tensor should be computed before quantization a high-precision tensor to a FP8 tensor. 3. More Accurate Scaling Factors
4. Adaptive Precision
No. This approach requires preserving enough memory in earlier epochs to store high-precision weights in later stages, which may not be as efficient as using high-precision weights and low-bit computations. |
Close this issue since there are no activities more than 9 months |
@tocean @wkcn
In line with the investigation in NVIDIA/TransformerEngine#424, it would be great to get the insights from the team at microsoft for using FP8 in aspects of training besides matmul.
Questions
1. Performance
The repo only mention training accuracy and memory savings. However, the kernels may not be very optimized and majority is implemented in Torch. I guess that performance is still unexplored.
2. Weight Update
on-the-fly
outstanding_weight_updates_bytes
threshold.3. More Accurate Scaling Factors
Is there a way to maintain more accurate
amax
by estimating:scaling_factor_weights_t = amax_weights_t-1 + amax_grad_t
- this is an accurate upper bound (no necessity of apriori knowledge)amax_weights_t = max(abs(weights_t))
- this is only used for the next iteratione5m2
might be able to help with dynamic range forv
(same dynamic range as FP16).sqrt_v
rather thanv
may help the precision. Update rule: see appendix2^16 -> 2^8
,2^-16 -> 2^-8
). Hence we perform sqrt in fp32/fp16 and quantize that as fp8, thus preserving the dynamic rangescaling_factor_weights_t
then it may be possible to use more of the dynamic range. Hence, storing the weights as FP8 (rather than FP16 as in the MS-AMP repo) might be possible.amax
on a per-batch basis is more bounded.4. Adaptive Precision
Has it been explored using lower precision (FP8) at high learning rate (at earlier epochs) and higher precision (e.g. FP32, FP16) at lower learning rate (at later epochs)?
Appendix
Update Rule for
sqrt_v_fp8
Notes:
amax_sqrt_v_fp8 = 448.0
, then the scaling factor is 1. This is captured in margin bits:MS-AMP/msamp/common/tensor/meta.py
Line 39 in aed29d6
The text was updated successfully, but these errors were encountered: