Apache Kafka introduced Queues in its latest version 4.0. This is a pretty powerful feature, and offers a completely different perspective on how Kafka topics can be leveraged. In this blog, let us explore in detail about Kafka Queues.

In case you don't have a Medium subscription, you can read this story here.

What are Kafka Queues?

An interesting thing about Kafka queues is that there is no component called Kafka Queue. What makes a Kafka topic a Kafka queue is the way in which the data gets consumed from the Kafka topic. Kafka 4.0 introduced the concept of Share groups as an alternative to consumer groups. It is this share groups that makes consuming from Kafka topic appear as if consuming from a queue. With share groups, the consumers cooperatively consume records without partition assignment. This implies that multiple consumers in the same share group can consume the data from the same partition. Also, the number of consumers in a share group can exceed the number of partitions in the topic.

As multiple consumers in the share group can consume from the same topic, you can no longer maintain consumer offsets at the partition level. Instead, records are acknowledged on individual basis, but are optimized for batch delivery. Share groups also take into account retries for automated handling of bad records. The retries are measured as delivery attempts. By default, the delivery attempts are set to 5.

Use-case: Scaling up Kafka for peak loads

When there is a sudden surge of traffic onto a Kafka topic, the consumer lag increases. This can result in SLA breach, and in general, a bad experience. Such scenarios are handled by increasing the number of topic partitions. With increased partitions, the consumers in the consumer group can be increased to handle the suddenly increased load. However, when the traffic reduces, the consumers can be brought down to a lower number sufficient enough to handle the incoming records. Overall, this leads to "over-partitioning" of the topic. Not only that, any change in the number of consumers in the consumer group leads to consumer rebalance which adds an overhead to consumer latency.

This use-case is beautifully handled with share groups. As multiple consumers of the share group can read from the same partition, you can simply increase the number of consumers in the share group to handle the peak load. As every consumer is capable of consuming and processing any record in the Kafka topic, there is no concept of consumer rebalance in the share group, and also the problem of over-partitioning of the Kafka topic is eliminated.

None
Share groups helps avoid over-partitioning

Use-case: Work Queue

When records undergo heavy processing at the consumer end, each record takes a varied time to get processed. Some records might get processed in a matter of a few seconds, while others can take minutes. Such situations can be very well handled if the messages are consumed as a queue rather than as a topic. With share groups, a pool of consumers freely share the load across all the partitions. No partition is kept waiting till its previous message is completely processed.

When the Kafka topic is consumed by share group, each consumer takes a small number of messages, and processes them to completion before acknowledging. As of Apache Kafka, you cannot control the batch size. This will however be possible with Apache Kafka 4.1.

None
Work Queue functioning with Kafka Queue

An Introduction to Share-Partition

A Share-Partition is a share group's view of a topic-partition, owned by the partition's leader. It represents how share group sees a topic and its partitions.

None
Share group's view of Kafka Topic

Each record is assignable to any consumer of a share group. The key responsibilities of a share-partitions includes:

Fetching from the log: When a consumer is ready to process the next set of records, it makes a fetch call, and the batch of records needs to be picked up and assigned to the consumer.

Managing in-flight records and delivery counts: As you already know by now, with share group, the state of each record in the Kafka topic needs to be maintained. Whether any consumer has already picked it up for processing, whether it was processed successfully or has failed to process, and in case it failed, how many times was it re-processed (delivery counts).

Managing acquisition lock timeouts: One important thing to be considered in any design is the handling failures. One should account for what happens if any component in the system goes down. In this case, we need to account for the consumer processing the record going down, by taking into consideration how long should we wait before the acquisition lock is released by the consumer.

In-flight records on a Share-Partition

Share group manages the delivery of records to the consumers. Every record capable of being delivered to the consumer is called an in-flight record. The in-flight records are bounded by share partition start offset (SPSO) and share partition end offset (SPEO). Each in-flight record has a state and a delivery count.

Let us understand this with the help of an example. Below is the share-partition view of a Kafka topic.

None
In-flight records on a Share-Partition

From the image above, records 2, 3 and 4 have been processed once and were not successful as can be seen from their delivery count which is 1. Records 2 and 4 are being processed by some consumer for the second time now as can be seen from their state which is "Acquired". Record 3 is "Available" for being processed by some consumer. Record 5 was processed successfully by some consumer as can be seen from its state "Acked". Record 6 was processed by a consumer, but not successfully, and was also marked as a non-retriable failure by changing its state to "Archived". No consumer will retry processing the "Archived" message. Record 7 is available for being processed by the consumer for the very first time.

As the delivery of the records completes, the SPFO advances, and the SPEO is permitted to advance too.

Following is the state machine which dictates the state transition of the records consumed by the share group:

None
State Machine for records consumed by Share Group

Every record starts with the "Available" state. As the record is given to the consumer for processing, it transitions into "Acquired" state. If the acquisition lock time out elapses or the consumer releases the lock, the record can be transitioned back into "Available" state if the record is to be made available for a processing retry. In case, the retries are exhausted, the record can be transitioned into "Archived" state. Apart from that, the consumer transitions the successfully processed records into "Acknowledged" state, and the failed non-retriable records into "Archived" state.

Potential features to enhance Share Groups

An important point to note here is that Queues have been released only as an Early Access feature in Apache Kafka 4.0. It is NOT ready for production use. The feature will be released in Preview mode in Apache Kafka 4.1. And it will only be in Apache Kafka 4.2 or beyond, that Queues will be ready for production use-cases.

That being said, it is good to some of the potential feature gaps that exists as of now in Queues, and what can you expect in the upcoming releases:

Dead-letter queues: Share groups are powerful enough to transition the record into "Archived" state. However, it could be more helpful to redirect such records into another Kafka topic which can then be processed later in a different fashion.

Delayed retry for failed queries: At present, records with non-successful attempt at processing are immediately made available. In case the processing involves dependency on third-party calls, it makes more sense to give some time between the retries, like an exponential backoff.

Rack-aware assignments: While the whole concept of Apache Kafka brokers is rack aware, this particular feature which has different consumers involved does not leverage this functionality at the moment.

Exactly-once semantics: This can be helpful to support transactions for share consumers.

Key-based ordering: Each of the record is independent on its own, and is free to be assigned to any consumer. Taking into account the key associated with the record can help ensure ordering by delivering records with the same key to the same consumer.

Conclusion

This blog focuses on introducing Apache Kafka Queues, its use-cases, how it works and the state transitions involved. While the feature is not ready for production use, it helps introduce you to this new concept and helps you gear up to the upcoming changes in Apache Kafka.

In case you like this article, and want to learn able Kafka Queues in-depth with a practical example, feel free to comment, and I can put up another article on practical demonstration of Apache Kafka Queues.