Kafka at Scale: Why ACLs Fail and Roles Win in 2025

Roles inside tokens look elegant — until production needs instant revocation.

Aleksei Aleinikov

Data Engineer Things

· ~6 min read · September 11, 2025 (Updated: September 11, 2025) · Free: No

If you "just flip the auth switch" in Kafka with thousands of clients, you'll trigger Permission Denied storms and unpredictable latency. In 2025, success comes from phased migration, clear observability, and local permission caches. Here is my practical plan to add authentication and authorization without stopping the world.

I've run Kafka under heavy load for a long time: many producers and consumers, tens of thousands of partitions, and steady multi-million message throughput. Early on I ran it "open": any service could read or write anywhere. That's great for speed of integration — and terrible for operations. Random writes into the wrong topic, fragile contracts, and endless investigations. In 2025 I finished migrating to authentication and authorization without downtime. This is the playbook that worked for me.

Apache Kafka Ecosystem

Why bother with authorization

Three payoffs justified the effort:

Controlled access. Explicit read/write permissions per topic cut an entire class of accidental mistakes — typos, misrouted messages, or format drift.

Visibility of integrations. A living graph of "who writes where, and who reads what" helps plan schema changes and infrastructure work.

Faster incident work. When something goes wrong, I can answer "who sent this message and under what context" in minutes, not hours.

How to authenticate

Two common paths:

Client certificates. Simple "one cert = one client," but expensive lifecycle work: issuance, rotation, revocation, and strict bookkeeping.

SASL with tokens. Authentication via JWT and an external identity system; language-agnostic; flexible attributes.

I chose tokens. In production I still had to provide a validator that turns a verified token into a principal with attributes the authorizer can read. That gave me control over caching and over which token fields matter for permissions.

ACLs or roles

ACLs are fine — until they aren't. With many users and many topics, every update becomes synchronization overhead and a risk of fat-fingered rules. I switched to roles: minimal, topic-scoped capabilities like read:topic-X and write:topic-Y. Administrative powers are granted separately and sparingly. Roles are created when a topic appears and removed when it's deleted. Authorization runs locally on the broker against the authenticated principal—fast and predictable.

Where to keep roles

I tried both ways.

Inside the token. It's fast and avoids external calls, but roles live as long as the token. Emergency revocation is awkward, and big tokens collide with proxies and increase overhead.

Outside the token (my choice). The token only carries the client identity; roles are fetched at session establishment and cached in memory on the broker. It's compact, centrally governed, and safe to change. To protect the identity system, I use batching, exponential backoff, and jitter so refresh timers don't fire in sync.

Example 1: No-downtime migration with two listeners

Situation. You can't flip every client overnight. What I did. I ran two listeners side by side: the legacy port (unauthenticated) and a new port with SASL/OAuth. My client wrapper exposes a single feature flag; once enabled, it picks the new endpoints and uses tokens. The old port remains briefly for stragglers.

Why it works. Phased migration and instant rollback. If a subset behaves oddly, I toggle them back, fix, and try again — no global outage.

Fine print. Permissions must exist before the switch. If a client moves early with missing rights, it fails immediately. I precomputed and preloaded baseline roles per topic to avoid that trap.

Example 2: Smoothing IAM load and speeding up startup

Situation. After network or platform events, many clients reconnect. Brokers refresh roles in bursts; if refresh timers align, the identity system gets hammered. Rate limiting returns errors, and some clients freeze.

What I did.

I added jitter to every permission refresh timer, so load spreads out naturally.
I placed a lightweight local permissions proxy on the same host as each broker. It batches lookups, caches results in memory and on disk, and serves brokers over a local socket.

Why it helps. Local hops are consistent and fast, and calls to the identity system become even, predictable traffic. If the identity system struggles, the cache buys time; I get alerts and resolve calmly instead of firefighting.

Safety levers. The proxy has two controlled modes for emergencies: "proxy everything" (if caching logic misbehaves) and "allow-all for valid tokens" (strictly under incident procedure and auditing). I've never needed them, but knowing they exist lowers operational risk.

Operational notes. I update the proxy with zero interruption: two processes share one port, the kernel balances new connections, and the service reloads config on signal. Cutover is just "start new → confirm healthy → retire old."

Example 3: The sneaky client pattern that caused a metadata storm

Situation. Some client libraries call metadata or partition leader discovery before the authenticated session is fully established, i.e., without attaching a token. That produces immediate authorization failures and triggers aggressive retries — flooding brokers with metadata requests.

What I did.

I added an explicit warm-up step in the client wrapper: an operation that always builds a proper authenticated session first.
I enabled metrics and selective logging for permission-denied events on the broker. That made it obvious when calls arrived "naked."
In hot paths, I changed the order of operations: authenticate, then fetch metadata.

Why it matters. Without this, the protection looks like random failures, and clients spin in pointless retries that stress the entire cluster. A tiny initialization change eliminated a whole class of incidents.

How I built the access map — without heavy logging

Full request logging on brokers is a fast path to bottlenecks. I embedded a lightweight counter in the central request handler: per client IP and per operation, with a short TTL. Data aggregates in memory and flushes periodically. With minimal storage I got a usable picture: "which clients touch which topics and how." In a containerized world, I matched IPs to services via orchestration telemetry and service discovery. It's not perfect, but it's good enough to pre-seed roles, and I validated edge cases manually.

This approach also surfaced rare consumers — like monthly report jobs — that conventional app logs or team surveys would miss. Those would have silently broken when I shut the legacy port.

Cutting the old entry point — safely

The final act is to close the unauthenticated listener. Repackaging brokers just to toggle a port is slow and risky. I blocked the old port at the host firewall, left a local loopback exception for operational tools, and rolled the rule out in fault-domain-sized batches. That kept leader moves and write availability safe while making rollback trivial.

What turned out critical

Observability. I track authorization error codes, the number of clients per listener, permission refresh latencies, and local-proxy hit rates. Without these, I'd be flying blind.

Jitter and batching. Even load on the identity system beats "freshest cache at all times." Predictability wins.

Session warm-up. Ensure the first authenticated operation happens before everything else. It erases entire categories of weird failures.

Two listeners and a client flag. The cheapest, safest migration path I've found.

Local permission cache next to the broker. Another "nine" of availability and faster restarts after any disturbance.

Takeaway

Kafka authorization isn't a checkbox — it's a sequence: dual listeners, role-based access, a local permission cache, deliberate session warm-up, solid telemetry, and a careful shutdown of legacy entry points. In 2025 this is basic hygiene for large fan-out/fan-in topologies. I shipped it with no downtime because I built visibility and caching first and tightened controls later.

🙏 If you found this article helpful, give it a 👏 and hit Follow — it helps more people discover it.

🌱 Good ideas tend to spread. I truly appreciate it when readers pass them along.

📬 I also write more focused content on JavaScript, React, Python, DevOps, and more — no noise, just useful insights. Take a look if you're curious.

#kafka #apache-kafka #data #big-data #software-development