You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've recently extended the pretraining of OLMO on a different language for an additional 30K steps. After converting the checkpoint to Hugging Face format, I observed that the model was generating gibberish.
After an extensive debugging session, I identified the root cause: the issue arises when splitting the attn_proj linear layer into three separate q, k, and v layers. Specifically, the problem stems from the use of randomized algorithms in matrix multiplication within cuBLAS. The operation hidden_states @ attn_proj followed by a torch split yields results that differ in decimal precision (up to the 4th digit) from the individual operations hidden_states @ q, hidden_states @ k, and hidden_states @ v.
The text was updated successfully, but these errors were encountered:
❓ The question
I've recently extended the pretraining of OLMO on a different language for an additional 30K steps. After converting the checkpoint to Hugging Face format, I observed that the model was generating gibberish.
After an extensive debugging session, I identified the root cause: the issue arises when splitting the
attn_proj
linear layer into three separateq
,k
, andv
layers. Specifically, the problem stems from the use of randomized algorithms in matrix multiplication within cuBLAS. The operationhidden_states @ attn_proj
followed by a torch split yields results that differ in decimal precision (up to the 4th digit) from the individual operationshidden_states @ q
,hidden_states @ k
, andhidden_states @ v
.The text was updated successfully, but these errors were encountered: