When your business depends on processing millions of events per second, building a reliable and performant Kafka consumer-producer pipeline becomes critical. In this article, I'll walk you through the architecture, implementation, and lessons we learned scaling Kafka with Go to consistently handle 1 million messages per second — and how we got there without burning out our team or machines.
This isn't a theoretical guide. This is what we actually built.
Problem Statement
We needed to ingest and process data at >1M messages/sec:
- Events coming from IoT devices and user interactions
- High fan-out to multiple downstream services
- Durable and observable architecture
- Real-time processing with minimal latency
High-Level Architecture
+---------------+ +-----------------+
| Kafka Topic A | ----> | Go Consumer App |
+---------------+ +-----------------+
|
v
+------------------+
| Worker Pool |
| (Goroutines) |
+------------------+
|
v
+------------------+
| Output Layer |
| (Kafka, DB, etc.) |
+------------------+
Key Goals
- Throughput: Sustain 1M msg/sec with headroom.
- Backpressure handling: No message drops under load.
- Crash resilience: Auto-recovery with offset commits.
- Observability: Track lag, latency, and throughput.
Our Kafka Stack
We used:
- Apache Kafka 3.x
- Confluent Go client (
github.com/confluentinc/confluent-kafka-go
) - 6 partitions minimum per topic (more later)
- Kafka Connect for data export
- Go 1.21+
Code Setup: High Throughput Consumer
import "github.com/confluentinc/confluent-kafka-go/kafka"
func createConsumer(brokers, groupID, topic string) *kafka.Consumer {
c, err := kafka.NewConsumer(&kafka.ConfigMap{
"bootstrap.servers": brokers,
"group.id": groupID,
"auto.offset.reset": "earliest",
"enable.auto.commit": false,
"go.events.channel.enable": true,
"queued.max.messages.kbytes": 20480, // 20 MB
})
if err != nil {
panic(err)
}
err = c.SubscribeTopics([]string{topic}, nil)
if err != nil {
panic(err)
}
return c
}
Worker Pool for Message Handling
func startWorkerPool(workerCount int, messages <-chan *kafka.Message, wg *sync.WaitGroup) {
for i := 0; i < workerCount; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
for msg := range messages {
process(msg) // handle message logic
}
}(i)
}
}
Dispatcher Pattern
We separate consumption from processing:
func dispatchMessages(c *kafka.Consumer, out chan<- *kafka.Message) {
for ev := range c.Events() {
switch e := ev.(type) {
case *kafka.Message:
out <- e
case kafka.Error:
log.Printf("Kafka error: %v\n", e)
}
}
}
Batching for Output Efficiency
When writing to downstream services (like another Kafka topic or DB), we batch messages:
func batchWriter(input <-chan *kafka.Message) {
batch := make([]*kafka.Message, 0, 1000)
ticker := time.NewTicker(10 * time.Millisecond)
for {
select {
case msg := <-input:
batch = append(batch, msg)
if len(batch) >= 1000 {
flush(batch)
batch = batch[:0]
}
case <-ticker.C:
if len(batch) > 0 {
flush(batch)
batch = batch[:0]
}
}
}
}
Performance Benchmarks
Hardware:
- 8 vCPUs
- 32 GB RAM
- 1Gbps NIC
- Kafka and Go app on separate VMs
Settings:
- 12 partitions
- 6 consumers
go.maxprocs = runtime.NumCPU()
| Setup | Messages/sec | Avg Latency | CPU Usage |
| ---------------------- | ------------ | ----------- | --------- |
| Single Consumer | \~150K | \~12ms | \~30% |
| 6 Consumer Goroutines | \~950K | \~6ms | \~85% |
| Worker Pool + Batching | **1.2M** | **<3ms** | \~90% |
Observability: Monitoring Lag and Throughput
We used:
- Prometheus + Grafana
consumer_lag
via Kafka exporterruntime.NumGoroutine()
- Custom histograms for per-stage latency
messagesProcessed := prometheus.NewCounterVec(...)
latencyHistogram := prometheus.NewHistogramVec(...)
Lessons Learned
1. Partitions Matter
Kafka partitions are the unit of parallelism. Start with 6–12, scale as needed. Each partition maps to one goroutine in our setup.
2. Don't Auto-Commit Blindly
Manual commit ensures we only ack messages we've fully processed.
consumer.CommitMessage(msg) // after success
3. Backpressure Control
Bounded channels and batching helped prevent goroutine explosion.
4. Memory Pressure
High-throughput means buffering; avoid large allocations per message. Reuse buffers if possible.
5. Avoid Global Locks
Lock contention in shared state (e.g., metrics) can throttle throughput.
Final Architecture Diagram
Kafka Topic (12 Partitions)
|
+---------------------------+---------------------------+
| | |
+-----------+ +-----------+ +-----------+
| Consumer1 | | Consumer2 | | Consumer6 |
+-----------+ +-----------+ +-----------+
| | |
v v v
+------------+ +------------+ +-------------+
| WorkerPool | | WorkerPool | | WorkerPool |
+------------+ +------------+ +-------------+
| | |
v v v
+--------+ +---------+ +-------------+
| Output | | Metrics | | DB / Kafka |
+--------+ +---------+ +-------------+
Summary
We didn't throw servers at the problem. We designed for scale with Go's strengths:
- Goroutines made concurrency cheap.
- Channel-based pipelines gave structure.
- Backpressure and batching gave control.
If you're trying to hit 1M messages/sec, Go and Kafka are more than capable — if you design carefully.