How Tables Grew a Brain: Iceberg, Hudi, Delta, Paimon, DuckLake

Not a feature bingo — an idea flow across Iceberg, Delta/Hudi, Paimon, and DuckLake with practical guidance on fit.

Anton Borisov

fresha-data-engineering

· ~10 min read · August 28, 2025 (Updated: August 29, 2025) · Free: Yes

From Snapshots to Synapses

Over the last few years, "table formats" quietly rewired the lakehouse. We began with a simple wish: make huge, file-backed datasets safe and fast to read. Iceberg answered with atomic snapshots, hidden partitioning, and schema evolution, turning folders of Parquet into a consistent database-like surface.

Then the goalposts moved. Real systems don't just append, they update, upsert, and delete continuously. Hudi and Delta pushed lakes toward streaming semantics: merge-on-read vs copy-on-write, transaction logs, incremental reads, and routine compaction. Paimon went further and asked: what if a lake table behaved like a streaming system by design — LSM-style layout, native CDC ingestion, and configurable merge engines for the "latest state" or full audit?

Now a new idea is landing: maybe metadata shouldn't be a pile of JSON in object storage at all. DuckLake (and cousins in research) flip the model: put metadata in a real database with multi-table ACID, treat the object store as dumb bytes, and let the catalog become the lake's brain.

This article traces that flow of ideas: from batch guarantees to stream-native tables to metadata-first lakes, and offers a practical lens for choosing what fits your workload right now. No vendor cheerleading, just the mental models, trade-offs, and a pick-by-intent playbook you can use on Monday morning. By the end, you'll see why the question isn't "Which format is best?" but

"Which idea do I need today and what will I need next?"

The Iceberg That Sank Hive_tanic_

The Hive-era problem: "Tables" were folders of files. Writers copied data into partition paths, readers guessed layout from directory names. At scale that meant race conditions, half-visible loads, and schema changes that shattered downstream jobs.

Iceberg's idea shift: treat the lake like a database for readers. Engines don't scan a directory, they read a snapshot, a tiny piece of metadata that names the exact files for a consistent view. Writes become "prepare files, then atomically flip the pointer." If two writers race, one cleanly loses and retries.

Diagram of Iceberg’s metadata plane (catalog, metadata files, manifest lists, manifest files) separated from the data plane (data files). The catalog maps a table name to the current metadata; snapshots reference manifest lists and manifests used to plan/prune which data files to read.

How an Iceberg read is planned: the catalog(Polaris, AWS Glue) resolves the current metadata and the snapshot points to a manifest list and manifests, which prune to the exact data files.

What that unlocked (the big three):

Atomic snapshots for consistent reads and painless rollbacks/time travel.
Hidden partitioning & schema/partition evolution (predicates are logical and column IDs survive renames).
Multi-engine safety so Spark can compact, Flink can ingest, Trino/StarRocks can query against the same snapshot contract.

Quick note on this "big three." Iceberg shipped far more than these, of course. I'm deliberately not turning this into yet another "which LakeFormat to use" post or a killer-feature bake-off. The focus here is the ideas and the progression of thought how the mental model changed rather than listing every feature.

Operational impact: suddenly you can scale out readers aggressively without coordination. Ingestion stopped being scary: prepare → swap, no half-visible data. Path tricks died, so intent moved into table metadata.

A lake table isn't a directory. It's a versioned index over files: a SELECT is a read of a snapshot.

New responsibilities :

Metadata hygiene: expire old snapshots, then compact small files into query-sized chunks.
Planning cost awareness: on truly enormous tables, cache manifests, keep target file sizes healthy, and partition sensibly.

Mental model: Batch analytics on object storage — now with transactions.

What do you know about bad jokes xD

The Upsert Strikes Back (Apache Hudi & Delta)

Batch wasn't the whole story. Real systems leak change constantly: orders get updated, users churn, GDPR deletes arrive at 2 a.m. The next wave didn't try to bolt streaming onto the lake, it changed what a "table" meant in the face of continuous updates.

Hudi's lens was pragmatic: choose your poison. With Copy-on-Write, reads stay simple and fast because data files are rewritten on update. With Merge-on-Read, writes are cheap and instant, and readers reconcile base files with small delta logs. Either way, the table offers incremental views"what changed since snapshot N?", so pipelines can flow without full rescans. Compaction stopped being a janitor job and became part of the table's contract: ingest now, normalize later.

Matrix comparing COW vs MOR: COW has higher write cost/latency but faster queries; MOR has lower write cost/latency, slower reads before compaction then similar after. Overall cost: COW potentially higher (full-file rewrites); MOR optimized.

COW vs. MOR trade-offs across writes, latency, reads, and total cost.

* MOR "slower before compaction" — reads must merge base + delta files, adding I/O until compaction folds deltas into new bases.

** COW "potentially higher cost" — updates rewrite whole files, causing write-amplification and extra object-store ops on update-heavy workloads.

Delta's push was conceptual clarity: a transaction log as the single source of truth. Instead of treating metadata as static, the log narrates every change, adds, removes, schema tweaks, so the same table can serve both streaming and batch without a split brain. Data skipping and layout tricks followed, but the idea that stuck was simpler: one table, two tempos.

Diagram of a Delta Lake table: a _delta_log folder with numbered JSON files representing table versions, and partitioned Parquet data files beneath (e.g., date=…).

Delta Lake layout: _delta_log/ stores the transaction log (JSON table versions), partitions hold Parquet data files. The log is the source of truth.

What mattered about this era wasn't a checklist of features, it was the permission to think incrementally. ETL turned into continuous ELT. "Upsert into the lake" stopped sounding heretical. And teams began to accept that compaction, retention, and small-file control are not chores but levers how you dial between freshness and read cost.

The mental shift: a lake table isn't just a pile of immutable files, it's a history of intent you can replay or snapshot.

This set the stage for the next step. If updates are perpetual and compaction is policy, why pretend the lake is batch-first at all? The idea waiting in the wings was to build the table as a streaming system by design, not an afterthought — enter Paimon.

The Paimon Impact: LSM Turn

If Hudi and Delta taught lakes to live with change, Paimon asked a bolder question: what if change is the default physics? Instead of pretending we append forever and tidy up later, Paimon designs the table like a streaming system, then lets batch benefit as a side effect.

The big idea is an LSM mindset. New facts land fast as deltas, old facts settle into bases, and lightweight background compaction keeps the surface smooth. Reads don't negotiate with a thousand tiny files, they see a view that's continuously maintained. You pick a merge engine to express intent deduplicate to latest state, apply partial updates, or just append for audit, then the table enforces that contract as data flows.

Mental model: the lake table is a materialized view that never stops materializing.

This flips a few habits on their head. Upserts and deletes aren't chores bolted to batch, they're the native verbs. Freshness isn't a gamble, it's budgeted by compaction policy. And "incremental" isn't a special mode, it's the standard: every commit is both a change log you can replay and a state you can query.

Paimon LSM diagram showing memtable flush to L0 SST files and asynchronous compaction to deeper levels, with manifest metadata (M/ML) and snapshots (S1…S3) tracking ADD/DELETE over time so readers query the latest state.

Paimon's LSM layout: writes land in a memtable, flush to L0 SSTs, then async compaction merges into deeper levels. Snapshots (S) reference manifest metadata (M/ML) over time so readers see the latest state, adds/deletes are merged by policy. Source

There's an honest trade-off: freshness vs OLAP cost. Push compaction too lazily and query engines will feel the LSM layers, push it too hard and you'll spend more on rewrite work than you gain in latency. Paimon frames that dial explicitly, which is the real innovation. You reason about policy, not post-hoc cleanups.

Operationally, this stream-native stance pairs beautifully with Flink and CDC — ingest changes from Kafka (e.g., Debezium), apply merge semantics, emit tables that are always "caught up" Batch still works: snapshots, time travel, backfills — but it's no longer the center of gravity.

And once tables behave like living views, another friction point comes into focus: metadata. If the data plane is streaming and the table's state is always in motion, why is the brain still a pile of manifest files in object storage? The next turn in the story is to move that brain where it belongs, into a database, so multi-table consistency and planning stay fast even as everything flows. Enter DuckLake.

Quack to the Future: DuckLake's Catalog-First Turn

File-based metadata got us far, but at scale it bites. Listing manifests to plan a query, coordinating cross-table changes with ad-hoc jobs, and nursing "multi-table consistency" with runbooks… it all adds drag. If the table is a living view, why is the brain still a pile of JSON in S3?

DuckLake flips the model: put metadata in a database. Treat the catalog as the system of record — transactions, indexes, branches, constraints and treat the object store as dumb bytes. A commit becomes a fast, transactional update to rows in a catalog, planning becomes indexed lookups, not recursive scans. Need to update three tables atomically? That's just a multi-table transaction. Need a branch for a backfill? Create it, test, merge, all without manifest storms.

DuckLake architecture: Parquet data in cloud object storage; a catalog service backed by a relational database stores table metadata (replacing JSON manifests); multiple compute engines (Spark, Flink, DuckDB, Snowflake, BigQuery) read/write through the catalog

DuckLake, catalog-first: data stays as Parquet in object storage, metadata lives in SQL tables, not JSON manifests. Engines query via the catalog service, so multi-table ACID becomes natural. Source

This isn't a feature checklist so much as a simplification: move coordination where databases excel (concurrency, integrity, history) and let storage focus on throughput. You trade in a new piece of infra the catalog must be highly available, but you buy clearer failure modes, consistent recovery/rollback, and planning that scales with rows in a catalog rather than objects in a bucket.

Mental model: metadata is the database and files are the heap.

The broader current points the same way: versioned catalogs (branches/tags), multi-table ACID, and policy-driven compaction as first-class catalog concepts. The lake stops acting like a filesystem with extras and starts behaving like a database fronting an object store. Next, we'll turn this into a practical playbook: how to choose by intent today and how to evolve without repainting your whole lake.

Hitchhiker's Guide to DataLakes

The point isn't to crown a winner, but to choose the idea you need right now and know your next move. Formats are just carriers for these ideas.

Choose your ride (intent over brand):

Need stable, open analytics across engines? Start with Iceberg's snapshot model.
Need continuous upserts/deletes with low-lag views? Add Paimon/Hudi's incremental/LSM ideas where it matters.
Need multi-table ACID and fast planning? Move the brain into a catalog database (DuckLake-style).

Three-column matrix titled ‘Table idea / Use when you want / Mindset in one line’ comparing Iceberg, Hudi, Delta Lake, Paimon, and DuckLake with concise guidance on when to use each and the core mental model.

How to think about table formats: choose by intent, not brand — snapshots (Iceberg), incremental COW/MOR (Hudi), transaction-log streams (Delta), stream-native LSM (Paimon), catalog-first control (DuckLake).

Formats keep borrowing ideas e.g., Iceberg supports merge-on-read semantics via delete files (and similar mechanisms), while Hudi/Delta also offer snapshot reads, so think design center, not feature bingo.

Operability dials (the only ones that really matter):

Compaction policy = your latency vs. cost knob. Spend it where queries live.
File sizing & partitioning = planning speed. Fewer, right-sized files win.
Snapshot/retention hygiene = predictable storage and rollback.
Governance at the catalog = who can change what, and how you undo it.

The answer is "42": don't shop formats, shop mental models.

Iceberg made snapshots safe
Hudi/Delta made change continuous
Paimon made "latest state" native
DuckLake makes the catalog the brain

Do Catalogs Dream of Electrosheep?

In a world of replicas, reality is whatever the catalog points to. But the moral of the system isn't the pointer — it's the policy. Treat PII erasure, retention, lineage, and reproducibility as first-class, so encode that empathy in metadata, not runbooks. We moved from files to brains, what is the next leap is giving the brain a conscience.

Multi-table ACID, everywhere. Open stacks still juggle cross-table consistency across engines. Catalog-first designs are promising, but truly portable multi-table transactions are early days.

Autopilot compaction. We need workload-aware compaction/file sizing that tunes itself (per table, per hour), balancing freshness vs. cost without human babysitting.

Format bridges, not rewrites. Seamless moves between Delta ↔ Iceberg ↔ Paimon — preserving lineage, deletes, and history are rarer than they should be.

Unified streaming semantics. Exactly-once across CDC → lake → OLAP with standardized deletion vectors and late-event repair that's policy-driven, not ad-hoc.

Metadata at planetary scale. Planning latency must depend on indexed metadata, not bucket walks. Hierarchical/versioned catalogs need to be the norm.

Governance as code. Lineage, tags/branches, and access control belong in the catalog reviewed, diffed, rolled back like software.

Single thing to remember:

Formats evolve, but ideas persist. Pick the mental model that unblocks you today: snapshots, stream-native tables, or catalog-first, and keep your lake ready for the next idea to land.

Disclaimer:

Features omitted for dramatic effect. If I simplified something you ship in production, that's because nuance doesn't fit in a paragraph. Corrections, memes, and furious footnotes encouraged.

A thank-you to the builders

Huge thanks to the communities behind Apache Iceberg, Apache Hudi, Delta Lake, Apache Paimon, and DuckDB/DuckLake — and to the catalog folks powering this world (Project Nessie, AWS Glue, Hive Metastore, and the various REST catalog implementers). You shipped the primitives snapshots, manifests, transaction logs, COW/MOR, deletion vectors, LSM compaction, branches/tags, and you keep cross-pollinating them so the rest of us get saner, faster lakes.

Also cheers to the practitioners who file bugs, write docs, publish benchmarks, and share war stories. Ideas move because people do. If I bent or blurred any nuance here, consider it an invitation to correct me. PRs to reality always welcome.

#apache-iceberg #data-lake #streaming #data #kafka