Loading video...

Video Failed to Load

Go Home

In Kafka, we have topics. Producers send data to a topic, and consumers pull data from it. Each topic has multiple partitions. You might have heard your team lead say that we can simply increase the partition count to make our cluster more scalable. There is no ordering guarantee...

16,942 views • 4 months ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Something amazing is coming to Apache Kafka… Consumer Groups v2! If you’ve ever used consumer groups in production at any non-trivial scale, you probably know all the problems with it: - ⛔️ Group-wide synchronization barrier acts as a cap on scalability A single misbehaving consumer can disturb the whole group. Even if you have cooperative rebalancing and static membership enabled, you will still have rebalances happen. 🤷‍♂️ It’s a fact of life I've heard - death, taxes & consumer group rebalances. And the problem is that even with cooperative rebalancing (which helped a lot!), you’re bound on waiting for the slowest member of the group to complete the rebalance(s). 🐌 The problem is that no consumer can commit offsets while a rebalance is in progress. ❌ Another subtle thing is that with cooperative rebalancing, a rebalance will take longer than usual. Why? Because consumers are allowed to process partitions during rebalances. They will call the poll() method more infrequently -- they're busy processing records after all. Thus, the overall rebalance time will increase. This makes it pretty hard to scale to 1000s of members. - 🤯 Complexity There’s a reason you’re reading this! The current protocol is pretty complex and hard to understand. It's used for a bunch of stuff, including metadata propagation in Kafka Streams. This compexity results in more: - 🐛 bugs The harder to reason about and the more moving parts you have - the greater chance for bugs. There have been quite a few in the protocol. And due to the fact that a lot of the protocol’s logic lives on the clients, that results in: - 🐌 slow fixes Bugs require client-side fixes, which are slow to be adopted. If you run a Kafka ops team, you know how hard it is to get all of your clients' teams to upgrade! If you're using a cloud service, you need to wait for a new Kafka release to go out. Can't have the cloud provider handle it for you behind the scenes! - 🔍 hard to debug Debugging is harder because you need client logs. In the cloud, that's hard to do again. On-prem, it requires reading through a lot of logs and collecting a lot of files. - ⚙️ very extendable There’s a reusable embedded protocol within the rebalance protocol, where clients are free to attach raw bytes that only they can then parse themselves. It's challenging to build compatible software for this cross-client protocol, as well as near-impossible to inspect from the broker side. - 😢 inconsistent metadata Clients are responsible for triggering rebalances based on the metadata, but different clients can have different views of the metadata. - 😵‍💫 interoperability Different implementations of the clients (i.e. anything besides the Java client) may have bugs. The complex logic needs to be re-implemented quite a few times. This usually means more bugs and slower time to ship features in your favorite client. A combination nobody likes. ... So what should an open source community do? Move the logic to the broker! Then? Simplify it. The new protocol is very elegant - it streamlines all of the regular Kafka consumer logic inside a new heartbeat API. It has the broker decide what partition assignments the consumers should have, and totally omits the notion of a Group Leader client. Another major change is that the notion of a group-wide rebalance is removed now. The rebalance is more fine-grained now. 👌 When you think about it, a rebalance is simply a reassignment of some partitions from some consumers to others. 💡 Why does the whole group need to stop and know about this? It had to before because the logic was on the client. It doesn’t now. 🎂 The new protocol is fine-grained in its assignments. It maintains per-member epochs, as well as separate epochs for the general group membership and the global assignment. The goal is simple - get all of those epoch numbers to be the same. The order is the following: 1. the group-wide epoch is bumped. 2. the target assignment epoch is bumped. 3. individual consumers catch up to the epoch via the heartbeat request, individually. (fine-grained) In general, what you have is a simple state machine inside the Group Coordinator broker that’s running a constant reconciliation loop. 💥 Because every member converges to the target state independently, the coordinator is free to simplify that convergence member by member. 👍 It has the logic to resolve dependencies between members - the act of: 1. revoking one member’s partition. 2. confirming the success of that. 3. bumping that member's epoch before assigning that partition to another consumer. Here is an example visual of what happens when two members join a consumer group one by one:

Stanislav Kozlovski

52,814 views • 2 years ago