Traffic just… doubled.

And within minutes, a system that had been "stable for months" started falling apart in ways we didn't expect.

This is a real postmortem-style story β€” not a benchmark flex, not a blame game. Just what actually broke first, and what we learned the hard way.

🚨 The Symptom (What Everyone Noticed)

At first, the alert looked harmless:

  • p95 latency: 300ms β†’ 1.8s
  • Error rate: still low
  • CPU: under 40%
  • Memory: fine

The dashboards didn't scream outage.

But users were already complaining:

  • "App feels slow"
  • "Requests randomly hang"
  • "Refreshing sometimes helps"

That's when you know something ugly is coming.

πŸ” What We Checked First (And Why It Misled Us)

Like most teams, we checked the usual suspects:

  • ❌ CPU β€” fine
  • ❌ Memory β€” fine
  • ❌ GC β€” normal
  • ❌ Redis β€” healthy
  • ❌ Database CPU β€” surprisingly low

Everything looked… calm.

And that was the first red flag.

πŸ’£ What Actually Broke First: Threads

Not the database. Not Redis. Not autoscaling.

Threads.

More specifically: πŸ‘‰ Threads waiting on other threads.

We took a thread dump.

🧡 The Thread Dump That Changed Everything

Here's a simplified version of what we saw:

"http-nio-8080-exec-47"
  WAITING on java.util.concurrent.FutureTask
"http-nio-8080-exec-52"
  WAITING on org.springframework.jdbc.datasource.DataSourceUtils
"http-nio-8080-exec-61"
  WAITING on com.zaxxer.hikari.pool.HikariPool

Hundreds of threads. All alive. All waiting. None doing useful work.

The app wasn't slow.

πŸ‘‰ It was stuck.

🧠 The Real Root Cause (Not Obvious at All)

Traffic doubled β†’ More concurrent requests β†’ More DB calls β†’ Connection pool exhausted β†’ Threads start waiting β†’ Waiting threads block other work β†’ Latency explodes.

And here's the twist:

❗ The database was NOT overloaded

The database could handle more queries.

But HikariCP could not hand out connections fast enough under the new concurrency pattern.

⚠️ The Code That Looked Innocent

public OrderSummary getSummary(Long userId) {
    User user = userRepo.find(userId);
    Orders orders = orderRepo.findByUser(userId);
    Payments payments = paymentRepo.findByUser(userId);
    return build(user, orders, payments);
}

Three queries. Sequential. Blocking.

At normal traffic β†’ fine. At double traffic β†’ thread starvation.

πŸ”₯ Why Autoscaling Made It Worse

Here's the cruel part.

Kubernetes saw latency β†’ Scaled pods β†’ Each pod created its own connection pool β†’ Total DB connections exploded β†’ More contention β†’ Even slower.

Autoscaling didn't save us.

πŸ‘‰ It amplified the failure.

πŸ› οΈ The First Fix (That Actually Helped)

We didn't touch the database.

We changed timeouts.

spring:
  datasource:
    hikari:
      maximum-pool-size: 10
      connection-timeout: 2000

Why?

Because fast failure beats slow death.

Threads now failed quickly instead of waiting forever.

Latency stabilized almost immediately.

πŸ§ͺ The Second Fix: Parallelism (Carefully)

We stopped doing everything sequentially.

CompletableFuture<User> user =
    CompletableFuture.supplyAsync(() -> userRepo.find(id), executor);
CompletableFuture<Orders> orders =
    CompletableFuture.supplyAsync(() -> orderRepo.findByUser(id), executor);
CompletableFuture.allOf(user, orders).join();

Not everywhere. Only where it mattered.

πŸ“‰ The Metric That Finally Told the Truth

Not CPU. Not memory.

πŸ‘‰ Active threads vs waiting threads.

Once we graphed:

  • Active HTTP threads
  • Hikari waiting threads

The problem became obvious in minutes.

🧠 What We Learned (The Hard Way)

  1. Latency increases before errors
  2. Thread starvation looks like "random slowness"
  3. Connection pools fail silently
  4. Autoscaling hides architectural limits
  5. The first bottleneck is rarely the loudest one

🧯 The One Rule We Follow Now

If traffic doubles, something must fail fast β€” not wait.

Timeouts are not a failure. Waiting forever is.

βœ‹ Final Thought

When traffic doubled, nothing "crashed."

The app politely waited itself to death.

That's how most real production failures happen β€” quietly, slowly, and confusingly.

If your system only works when everything is calm, it's not resilient. It's just lucky.