When naval warships sail together, they move in a precise formation, each vessel maintaining constant communication and acting as a single, cohesive unit. Similarly, a Kafka cluster and its surrounding components operate in strict coordination. Each node plays a synchronized role in keeping the system healthy and responsive. But what happens when a ship in the formation, say a node, needs to be pulled out due to a fault or scheduled replacement? The transition must be seamless, with a new ship promptly joining the formation to maintain operational integrity.

In this post, I'll walk you through how we safely replaced and rejoined nodes for a key component in our Kafka ecosystem: the KRaft Controller. While showing you how we kept the formation intact while swapping out critical parts, just like disciplined ships at sea, I'll also share a real-life operational challenge we faced. You'll learn how we troubleshooted a metadata fetch timeout issue that occurred during one of these operations and the steps we took to resolve it.

As the Trendyol Data Streaming team, we provide a data streaming platform powered by more than 40 Kafka clusters, serving over 2,000 developers. We ensure that this platform is used by teams following best practices in terms of security and availability. At Trendyol, we manage our Kafka clusters on virtual machines, which brings several operational challenges, particularly when infrastructure components such as the OS, network configurations, or VM images need to be updated. Instead of provisioning an entirely new Kafka cluster in such cases, we opt for a node-by-node replacement strategy. This minimizes disruptions for our internal software teams and helps keep the total number of managed clusters under control.

Previously, Yalın Doğu Şahin published a detailed medium article explaining how we replace Kafka brokers. As he pointed out, other Kafka components like Kraft Controller, Kafka Connect, KsqlDB and Schema Registry are designed to be stateless and replaceable. However, how can we be sure the replacement has been successful? This article explores in detail how we handle this process for KRaft controllers, with real-life insights from our operational experience.

Replacing a KRaft Controller Node

KRaft controllers are essential for maintaining metadata consistency and cluster coordination in modern Kafka clusters. When we need to replace a controller node, the process starts by stopping the controller service on the targeted VM.

Before terminating the VM, we back up the meta.properties file located at /var/lib/kafka/data/controller/, noting down the values of node.id and cluster.id. These values must match when provisioning the replacement node to ensure a smooth reintegration.

In our setup, all nodes are referred to via DNS hostnames rather than IP addresses. This allows us to easily update DNS A records, without requiring configuration changes on every broker node.

We use cp-ansible to redeploy the controller node. By default, node.id is assigned based on the order of nodes defined in the hosts.yml file, while cluster.id can be specified explicitly using the clusterid variable.

⚠️ Although hardcoding cluster.id has been debated due to risks of accidentally joining the wrong cluster for the brokers, you can still set it dynamically:

ansible-playbook -i hosts.yml confluent.platform.all \
- limit kraft-controller-new-node \
- tags kafka_controller \
-e "clusterid=MySuperDuperClusterId"

After the replacement node joins the cluster, it will show up in the quorum list. Monitoring the quorum lag is essential. If everything is working as expected, the new node will have Lag = 0, and its Status will be Follower or Leader, depending on the current cluster state.

Example output from kafka-metadata-quorum's describe command:

NodeId LogEndOffset Lag Status
9991 88183624 0 Leader
9992 88183624 0 Follower
9993 88183624 0 Follower
9994 88183624 0 Follower
9995 88183624 0 Follower

Troubleshooting a KRaft Controller Node Join

What happens if you encounter a problem with the metadata quorum? Let's say the node you added didn't join the cluster.

In this case, identifying the root cause of the problem is crucial. You'll need to examine the logs to understand why the node failed to join the cluster. During one of our operational experiences, we encountered this exact situation with a single node in a cluster. To diagnose the issue, we elevated the log levels to DEBUG.

We discovered that the newly provisioned controller node was timing out while attempting to fetch metadata from the other controllers.

[2025–07–24 11:44:16,615] DEBUG [RaftManager id=9995] Sending FETCH_SNAPSHOT request with header RequestHeader(apiKey=FETCH_SNAPSHOT, apiVersion=0, clientId=raft-client-9995, correlationId=2063, headerVersion=2) and timeout 2000 to node 9993: FetchSnapshotRequestData(clusterId='MySuperDuperClusterId', replicaId=9995, maxBytes=2147483647, topics=[TopicSnapshot(name='__cluster_metadata', partitions=[PartitionSnapshot(partition=0, currentLeaderEpoch=123371, snapshotId=SnapshotId(endOffset=89372885, epoch=123021), position=0)])]) (org.apache.kafka.clients.NetworkClient)

⚠️ Important Note: A misbehaving node that cannot join the cluster poses a significant risk to the entire cluster. Its repeated attempts to join can negatively impact other controller nodes. We have observed instances in non-production environments where this behavior led to a complete Kafka cluster outage due to KRaft cluster quorum issues.

To solve this, we increased the controller.quorum.request.timeout.ms and controller.quorum.fetch.timeout.ms values from 2 seconds to 10 seconds. With this change, the new node was able to fetch the metadata and seamlessly join the cluster. We had missed it by just one second; our cluster could fetch the metadata in 3 seconds.

[2025–07–24 12:04:40,782] DEBUG [RaftManager id=9995] Sending FETCH_SNAPSHOT request with header RequestHeader(apiKey=FETCH_SNAPSHOT, apiVersion=0, clientId=raft-client-9995, correlationId=29, headerVersion=2) and timeout 10000 to node 9991: FetchSnapshotRequestData(clusterId='MySuperDuperClusterId', replicaId=9995, maxBytes=2147483647, topics=[TopicSnapshot(name='__cluster_metadata', partitions=[PartitionSnapshot(partition=0, currentLeaderEpoch=123376, snapshotId=SnapshotId(endOffset=89372903, epoch=123021), position=0)])]) (org.apache.kafka.clients.NetworkClient)
[2025–07–24 12:04:43,952] DEBUG [RaftManager id=9995] Received FETCH_SNAPSHOT response from node 9991 for request with header RequestHeader(apiKey=FETCH_SNAPSHOT, apiVersion=0, clientId=raft-client-9995, correlationId=29, headerVersion=2): FetchSnapshotResponseData(throttleTimeMs=0, errorCode=0, topics=[TopicSnapshot(name='__cluster_metadata', partitions=[PartitionSnapshot(index=0, errorCode=0, snapshotId=SnapshotId(endOffset=89372903, epoch=123021), currentLeader=LeaderIdAndEpoch(leaderId=9991, leaderEpoch=123376), size=6606192, position=0, unalignedRecords=MemoryRecords(size=6606192, buffer=java.nio.HeapByteBuffer[pos=0 lim=6606192 cap=6606206]))])]) (org.apache.kafka.clients.NetworkClient)

Root Cause and Long-Term Solution

Further investigation revealed that the size of the metadata snapshot was a key factor in these timeouts. To address this, a long-term solution was proposed to the community: enabling compression for metadata requests. This improvement would help alleviate timeout issues in environments with slower network connections. We've opened a support ticket to Confluent, and they've subsequently raised a bug and a feature request for the Kafka community:

We confirmed that the controller node had fetched the metadata and was in sync with the quorum by using the kafka-metadata-quorum command. Additionally, we have the opportunity to directly read the metadata with the kafka-metadata-shell directory /var/lib/controller/data/__cluster_metadata-0 command. For troubleshooting, it's possible to check ACLs, brokers, controllers, metadata version, and topics from this directory as well.

$ kafka-metadata-shell - directory /var/lib/controller/data/__cluster_metadata-0
Loading…
Starting…
[ Kafka Metadata Shell ]
>> tree .
image:
acls:
byId:
cells:
clientQuotas:
cluster:
brokers:
controllers:
…

Why Not Just Add and Remove Controllers Dynamically?

You might be wondering, "But isn't adding and removing controllers a new feature of Apache Kafka 3.9 and Confluent Kafka 7.9?" That's a great question, and it's a feature we're eagerly waiting for.

At the time of this operation, we were running our clusters on Confluent Kafka 7.8.3. While Confluent Platform 7.9 and later versions allow for a dynamic KRaft metadata quorum on newly created clusters, it doesn't support upgrading an existing static quorum to a dynamic one. This means we couldn't simply use the dynamic controller removal feature.

We've been in touch with Confluent on this topic, and they've informed us that the ability to upgrade a static quorum to a dynamic one is planned for Confluent Platform 8.1. The corresponding fix is expected to be available in Apache Kafka 4.1.

Since our KRaft cluster was being upgraded from 7.8.2 to 7.9.2, we couldn't use the kafka-metadata-quorum command to remove controllers. This is precisely why we developed and documented the manual, node-by-node replacement strategy you just read about.

Can process status be monitored through metrics?

KRaft controller nodes obtain metadata from the leader. This process can be monitored using the fetch-records-rate metric within the raft-metrics group. Additionally, "last applied" metrics, detailed under Kraft quorum monitoring, should consistently show progress.

None
Once a new controller is provisioned, it immediately begins fetching records.

Conclusion

Replacing components in a Kafka cluster, especially critical ones like KRaft controllers, doesn't have to be risky or complex. With proper configuration management, metrics observability, and automation via tools like cp-ansible, AWX and Terraform, we've developed an approach that allows us to scale and upgrade with confidence.

The key is in understanding the statefulness of each component and monitoring the correct metrics during and after the replacement.

Before a controller upgrade, I recommend the following:

  • Make sure your cluster is healthy (you can check values like misconfigured topic and Under replicated partitions).
  • Confirm that the follower lags in the metadata quorum are zero using the kafka-metadata-quorum command.
  • Review the controller.quorum.request.timeout.ms and controller.quorum.fetch.timeout.ms values based on your metadata size.

At Trendyol, we view our infrastructure as a living system, capable of adapting and evolving without disruption. With every successful node replacement, we move closer to a platform that is as resilient as it is scalable.

At Trendyol, as the Data Streaming Team, we enjoy applying up-to-date best practices and sharing our hands-on experiences. If you're curious to learn more about how we approach distributed systems and streaming infrastructure, feel free to explore other articles written by our team, I've linked them below!

Join Us

Do you want to be a part of our growing company? We're hiring! Check out our open positions from the links below.