When your business depends on processing millions of events per second, building a reliable and performant Kafka consumer-producer pipeline becomes critical. In this article, I'll walk you through the architecture, implementation, and lessons we learned scaling Kafka with Go to consistently handle 1 million messages per second — and how we got there without burning out our team or machines.

This isn't a theoretical guide. This is what we actually built.

Problem Statement

We needed to ingest and process data at >1M messages/sec:

  • Events coming from IoT devices and user interactions
  • High fan-out to multiple downstream services
  • Durable and observable architecture
  • Real-time processing with minimal latency

High-Level Architecture

              +---------------+        +-----------------+
              | Kafka Topic A | ---->  | Go Consumer App |
              +---------------+        +-----------------+
                                                 |
                                                 v
                                        +------------------+
                                        | Worker Pool       |
                                        | (Goroutines)      |
                                        +------------------+
                                                 |
                                                 v
                                        +------------------+
                                        | Output Layer      |
                                        | (Kafka, DB, etc.) |
                                        +------------------+

Key Goals

  • Throughput: Sustain 1M msg/sec with headroom.
  • Backpressure handling: No message drops under load.
  • Crash resilience: Auto-recovery with offset commits.
  • Observability: Track lag, latency, and throughput.

Our Kafka Stack

We used:

  • Apache Kafka 3.x
  • Confluent Go client (github.com/confluentinc/confluent-kafka-go)
  • 6 partitions minimum per topic (more later)
  • Kafka Connect for data export
  • Go 1.21+

Code Setup: High Throughput Consumer

import "github.com/confluentinc/confluent-kafka-go/kafka"

func createConsumer(brokers, groupID, topic string) *kafka.Consumer {
    c, err := kafka.NewConsumer(&kafka.ConfigMap{
        "bootstrap.servers":  brokers,
        "group.id":           groupID,
        "auto.offset.reset":  "earliest",
        "enable.auto.commit": false,
        "go.events.channel.enable": true,
        "queued.max.messages.kbytes": 20480, // 20 MB
    })
    if err != nil {
        panic(err)
    }
    err = c.SubscribeTopics([]string{topic}, nil)
    if err != nil {
        panic(err)
    }
    return c
}

Worker Pool for Message Handling

func startWorkerPool(workerCount int, messages <-chan *kafka.Message, wg *sync.WaitGroup) {
    for i := 0; i < workerCount; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            for msg := range messages {
                process(msg) // handle message logic
            }
        }(i)
    }
}

Dispatcher Pattern

We separate consumption from processing:

func dispatchMessages(c *kafka.Consumer, out chan<- *kafka.Message) {
    for ev := range c.Events() {
        switch e := ev.(type) {
        case *kafka.Message:
            out <- e
        case kafka.Error:
            log.Printf("Kafka error: %v\n", e)
        }
    }
}

Batching for Output Efficiency

When writing to downstream services (like another Kafka topic or DB), we batch messages:

func batchWriter(input <-chan *kafka.Message) {
    batch := make([]*kafka.Message, 0, 1000)
    ticker := time.NewTicker(10 * time.Millisecond)
    for {
        select {
        case msg := <-input:
            batch = append(batch, msg)
            if len(batch) >= 1000 {
                flush(batch)
                batch = batch[:0]
            }
        case <-ticker.C:
            if len(batch) > 0 {
                flush(batch)
                batch = batch[:0]
            }
        }
    }
}

Performance Benchmarks

Hardware:

  • 8 vCPUs
  • 32 GB RAM
  • 1Gbps NIC
  • Kafka and Go app on separate VMs

Settings:

  • 12 partitions
  • 6 consumers
  • go.maxprocs = runtime.NumCPU()
| Setup                  | Messages/sec | Avg Latency | CPU Usage |
| ---------------------- | ------------ | ----------- | --------- |
| Single Consumer        | \~150K       | \~12ms      | \~30%     |
| 6 Consumer Goroutines  | \~950K       | \~6ms       | \~85%     |
| Worker Pool + Batching | **1.2M**     | **<3ms**    | \~90%     |

Observability: Monitoring Lag and Throughput

We used:

  • Prometheus + Grafana
  • consumer_lag via Kafka exporter
  • runtime.NumGoroutine()
  • Custom histograms for per-stage latency
messagesProcessed := prometheus.NewCounterVec(...)
latencyHistogram := prometheus.NewHistogramVec(...)

Lessons Learned

1. Partitions Matter

Kafka partitions are the unit of parallelism. Start with 6–12, scale as needed. Each partition maps to one goroutine in our setup.

2. Don't Auto-Commit Blindly

Manual commit ensures we only ack messages we've fully processed.

consumer.CommitMessage(msg) // after success

3. Backpressure Control

Bounded channels and batching helped prevent goroutine explosion.

4. Memory Pressure

High-throughput means buffering; avoid large allocations per message. Reuse buffers if possible.

5. Avoid Global Locks

Lock contention in shared state (e.g., metrics) can throttle throughput.

Final Architecture Diagram

                    Kafka Topic (12 Partitions)
                                 |
     +---------------------------+---------------------------+
     |                           |                           |
+-----------+            +-----------+              +-----------+
| Consumer1 |            | Consumer2 |              | Consumer6 |
+-----------+            +-----------+              +-----------+
     |                         |                          |
     v                         v                          v
+------------+         +------------+            +-------------+
| WorkerPool |         | WorkerPool |            | WorkerPool  |
+------------+         +------------+            +-------------+
     |                         |                          |
     v                         v                          v
  +--------+              +---------+              +-------------+
  | Output |              | Metrics |              | DB / Kafka  |
  +--------+              +---------+              +-------------+

Summary

We didn't throw servers at the problem. We designed for scale with Go's strengths:

  • Goroutines made concurrency cheap.
  • Channel-based pipelines gave structure.
  • Backpressure and batching gave control.

If you're trying to hit 1M messages/sec, Go and Kafka are more than capable — if you design carefully.