Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Hang when using ProxyChan and one GPU is sending zeros bytes. #432

Closed
FC-Li opened this issue Dec 30, 2024 · 3 comments
Closed

[Bug] Hang when using ProxyChan and one GPU is sending zeros bytes. #432

FC-Li opened this issue Dec 30, 2024 · 3 comments

Comments

@FC-Li
Copy link

FC-Li commented Dec 30, 2024

The GPU who is sending zeros bytes will push a ProxyTrigger with zero value fst.

Proxy's loop in proxy.cc will skip handle this trigger.

      trigger = fifo.poll();
      if (trigger.fst == 0 || trigger.snd == 0) {  // TODO: this check is a potential pitfall for custom triggers
        continue;                                  // there is one in progress
      }
      trigger.snd ^= ((uint64_t)1 << (uint64_t)63);  // this is where the last bit of snd is reverted.

      ProxyHandlerResult result = handler(trigger);  // SKIPPED!!!!!!!!!!!!!!!!!!!

When skip happened this GPU's counterpart will not be signaled. This is because Host2DeviceSemaphore::signal -> IBConnection::updateAndSync will not be called. The counterpart will hang at Host2DeviceSemaphoreDeviceHandle's wait

  MSCCLPP_DEVICE_INLINE void wait(int64_t maxSpinCount = 100000000) {
    (*expectedInboundSemaphoreId) += 1;
    POLL_MAYBE_JAILBREAK((atomicLoad(inboundSemaphoreId, memoryOrderAcquire) < (*expectedInboundSemaphoreId)),
                         maxSpinCount);
  }
@FC-Li
Copy link
Author

FC-Li commented Dec 31, 2024

@Binyang2014 Is this an known issue?

@chhwang
Copy link
Contributor

chhwang commented Jan 2, 2025

@FC-Li This is expected as we don't define behavior of put-ing zero bytes (which will make fst == 0). We may need to handle this in a better way, but do you have any use cases for put-ing zero bytes, which is a no-op by definition?

@FC-Li
Copy link
Author

FC-Li commented Jan 4, 2025

@chhwang
It's just a corner case.

I solved it by

if (send_bytes > 0) {
    proxyChan.putWithSignal(....);
} else {
    proxyChan.signal(...);
}

@FC-Li FC-Li closed this as completed Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants