Traffic just⦠doubled.
And within minutes, a system that had been "stable for months" started falling apart in ways we didn't expect.
This is a real postmortem-style story β not a benchmark flex, not a blame game. Just what actually broke first, and what we learned the hard way.
π¨ The Symptom (What Everyone Noticed)
At first, the alert looked harmless:
- p95 latency: 300ms β 1.8s
- Error rate: still low
- CPU: under 40%
- Memory: fine
The dashboards didn't scream outage.
But users were already complaining:
- "App feels slow"
- "Requests randomly hang"
- "Refreshing sometimes helps"
That's when you know something ugly is coming.
π What We Checked First (And Why It Misled Us)
Like most teams, we checked the usual suspects:
- β CPU β fine
- β Memory β fine
- β GC β normal
- β Redis β healthy
- β Database CPU β surprisingly low
Everything looked⦠calm.
And that was the first red flag.
π£ What Actually Broke First: Threads
Not the database. Not Redis. Not autoscaling.
Threads.
More specifically: π Threads waiting on other threads.
We took a thread dump.
π§΅ The Thread Dump That Changed Everything
Here's a simplified version of what we saw:
"http-nio-8080-exec-47"
WAITING on java.util.concurrent.FutureTask
"http-nio-8080-exec-52"
WAITING on org.springframework.jdbc.datasource.DataSourceUtils
"http-nio-8080-exec-61"
WAITING on com.zaxxer.hikari.pool.HikariPoolHundreds of threads. All alive. All waiting. None doing useful work.
The app wasn't slow.
π It was stuck.
π§ The Real Root Cause (Not Obvious at All)
Traffic doubled β More concurrent requests β More DB calls β Connection pool exhausted β Threads start waiting β Waiting threads block other work β Latency explodes.
And here's the twist:
β The database was NOT overloaded
The database could handle more queries.
But HikariCP could not hand out connections fast enough under the new concurrency pattern.
β οΈ The Code That Looked Innocent
public OrderSummary getSummary(Long userId) {
User user = userRepo.find(userId);
Orders orders = orderRepo.findByUser(userId);
Payments payments = paymentRepo.findByUser(userId);
return build(user, orders, payments);
}Three queries. Sequential. Blocking.
At normal traffic β fine. At double traffic β thread starvation.
π₯ Why Autoscaling Made It Worse
Here's the cruel part.
Kubernetes saw latency β Scaled pods β Each pod created its own connection pool β Total DB connections exploded β More contention β Even slower.
Autoscaling didn't save us.
π It amplified the failure.
π οΈ The First Fix (That Actually Helped)
We didn't touch the database.
We changed timeouts.
spring:
datasource:
hikari:
maximum-pool-size: 10
connection-timeout: 2000Why?
Because fast failure beats slow death.
Threads now failed quickly instead of waiting forever.
Latency stabilized almost immediately.
π§ͺ The Second Fix: Parallelism (Carefully)
We stopped doing everything sequentially.
CompletableFuture<User> user =
CompletableFuture.supplyAsync(() -> userRepo.find(id), executor);
CompletableFuture<Orders> orders =
CompletableFuture.supplyAsync(() -> orderRepo.findByUser(id), executor);
CompletableFuture.allOf(user, orders).join();Not everywhere. Only where it mattered.
π The Metric That Finally Told the Truth
Not CPU. Not memory.
π Active threads vs waiting threads.
Once we graphed:
- Active HTTP threads
- Hikari waiting threads
The problem became obvious in minutes.
π§ What We Learned (The Hard Way)
- Latency increases before errors
- Thread starvation looks like "random slowness"
- Connection pools fail silently
- Autoscaling hides architectural limits
- The first bottleneck is rarely the loudest one
π§― The One Rule We Follow Now
If traffic doubles, something must fail fast β not wait.
Timeouts are not a failure. Waiting forever is.
β Final Thought
When traffic doubled, nothing "crashed."
The app politely waited itself to death.
That's how most real production failures happen β quietly, slowly, and confusingly.
If your system only works when everything is calm, it's not resilient. It's just lucky.