add modulo cpx2 requirement

cyanguwa · cyanguwa · commit 0d8fb837f871 · 2025-05-13T16:28:17.000-07:00
Signed-off-by: Charlene Yang &lt;charleney@nvidia.com&gt;
diff --git a/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py b/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py
@@ -3490,10 +3490,12 @@ def attn_forward_func_with_cp(
 
         Context parallelism distributes chunks of the sequence onto different GPUs. To help with
         load balancing, users are expected to reorder their tokens before entering this function.
-        For example, given cp_size = 2, we divide each sequence into 4 chunks, and distribute chunk 0
-        and chunk 3 onto GPU 0, and chunk 1 and chunk 2 onto GPU 1. If all transformer layers use
-        the same context parallel configuration, this reordering can happen only once, i.e. before
-        the first layer. An example of the reordering is in Megatron-LM (please see `get_batch_on_this_cp_rank
+        For example, given cp_size = 2, we divide each sequence in a batch into 4 chunks, and
+        distribute chunk 0 and chunk 3 onto GPU 0, and chunk 1 and chunk 2 onto GPU 1. This requires
+        sequence lengths to be divisible by (cp_size * 2), and if not, sequences need to be padded to
+        meet this requirement. When all transformer layers use the same context parallelism configuration,
+        token reordering can happen in the dataloader, i.e. only once for all the layers. An example of
+        the reordering is in Megatron-LM (see `get_batch_on_this_cp_rank
         <https://github.com/NVIDIA/Megatron-LM/blob/d6eb60b5ea1efca47401c0be97f456fbe3a55bcd/megatron/core/utils.py#L1725>`_).
 
     """