Metadata requests for a topic with many partitions #3916
Unanswered
travisdowns
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Consider a scenario where 50k distinct consumer processes using librdkafka consume from a single topic with 50k partitions on it using a consumer group. Since there is a 1:1 ratio of consumers to partitions, each consumer is ultimately only consuming from ~1 topic (with some small variation as consumers die or reenter the group).
However, empirically, the metadata requests ask about all partitions in the topic, so that's 50k partitions * 50k clients = 2.5 billion partitions worth of metadata sent every refresh interval (ignoring entirely additional metadata refreshes when certain conditions occur that might trigger them, like a topic leader change). If each partition takes ~100 bytes (a reasonable value, empirically, with 3 replicas per partition) that's 250 GB of traffic every 300s (by default) or ~6.7 Gbps constant load just from the periodic metadata refreshes.
Is there any way around this? The metadata API does not seem to admit any feature to ask only about a subset of the partitions in a topic: you may provide a list of topics, but not a list of partitions within those topics. So clients will retrieve all 50k partitions even though they just care about one.
The easy answer is "don't do that [50k partitions in one topic]" but those of us building infrastructure aren't always in the position to choose what users do.
Beta Was this translation helpful? Give feedback.
All reactions