The download stopped midway. When I refreshed the page, it didn't start from zero — it resumed.

That small moment made me pause.

How does this actually work? And more importantly — how would I design something similar for large file uploads to S3?

I went through AWS documentation, multiple blog posts, and different architectural approaches. What follows is a compiled mental model of how resumable uploads can be designed using pre-signed URLs, chunking, fingerprinting, and a "trust but verify" approach — intentionally without relying on S3 Multipart Uploads or S3 Notifications.

This is not a tutorial, and it's not a production claim. It's how I reasoned through the problem and connected the dots.

The Problem I Was Trying to Understand

I wanted a solution that could handle:

  • Large file uploads
  • Unreliable networks
  • Browser refreshes or crashes
  • Resume uploads without re-uploading data
  • Minimal backend bandwidth usage

A single PUT upload clearly doesn't survive these conditions.

The Obvious AWS Way

The first thing AWS documentation points to is:

  • S3 Multipart Upload
  • S3 Event Notifications

On paper, this looks ideal:

  • Native support
  • Highly scalable
  • Battle-tested

So naturally, the next question was — do I really need all of this?

Why I Looked Beyond S3 Multipart Upload

To be clear: S3 Multipart Upload is not bad.

It's actually a great solution.

But while reading through it, I realized it introduces:

  • Multipart lifecycle management
  • Handling abandoned uploads
  • AWS-side upload state
  • Cleanup responsibilities
  • Less flexibility in defining custom verification logic

For my understanding, I wanted something more explicit and application-controlled.

So I explored an alternative mental model.

The Mental Model (Flow → Chunk → Verify → Resume)

Instead of thinking in terms of AWS features, I simplified the problem:

1. Flow The client asks the backend: "I want to upload this file. Where do I begin?"

2. Chunk The file is split into fixed-size chunks (for example, 5 MB). Each chunk is uploaded independently using a pre-signed URL.

3. Verify After each upload, the backend checks:

  • Does the chunk exist in S3?
  • Does its size or checksum match expectations?

4. Resume If anything fails, the client asks: "Which chunks are already valid?" And continues from there.

Client
  ↓
Split file into chunks
  ↓
Upload chunk via pre-signed URL
  ↓
Backend verifies chunk in S3
  ↓
Mark chunk as valid
  ↓
Resume from last verified chunk

Reducing it to these four steps made the entire problem much easier to reason about.

Trust, but Verify

One idea that kept coming up while reading was this:

Never blindly trust the client.

Even if the client says "this chunk is uploaded", the backend should verify:

  • Existence in S3
  • Size or checksum

Only then should the chunk be marked as valid.

This single principle explains how resumability can work reliably.

Failure Scenarios This Model Handles Well

Thinking in terms of verification makes it clear how the system behaves when things go wrong:

  • Browser refresh mid-upload
  • Network drops
  • Partial chunk uploads
  • Duplicate retries
  • Client crashes

Each failure leads to a retry, not a restart.

Why I Didn't Rely on S3 Notifications (Conceptually)

Instead of asynchronous notifications:

  • Verification happens synchronously
  • Upload state lives in the application
  • Completion becomes deterministic

This reduces hidden behavior and makes reasoning easier.

When S3 Multipart Upload Is Actually the Right Choice

While reading further, it became clear that Multipart Upload is the right choice when:

  • You want AWS to manage upload state
  • You don't need custom resumability logic
  • Cleanup complexity is acceptable
  • You prefer minimal application logic

In other words: Multipart Upload is excellent — just not always necessary.

Trade-offs of This Mental Model

This approach assumes:

  • More application-side logic
  • Additional state tracking
  • Responsibility for verification and cleanup

But in return, it offers clarity and control, which was my primary goal while learning.

What I Took Away From This

This understanding may not be perfect. There are many valid ways to solve this problem.

But one insight stuck with me:

Sometimes the most useful system isn't the one you blindly adopt — it's the one you can fully explain when something breaks.

If you've approached resumable uploads differently, I'd genuinely love to learn from your perspective.