Iceberg on DuckDB: End-to-End Example with Amazon S3 Tables

Apache Iceberg has quickly become the de-facto standard for building LakeHouses on AWS. It brings reliability and ACID guarantees to object…

Angel Conde

~4 min read · September 17, 2025 (Updated: September 17, 2025) · Free: Yes

Apache Iceberg has quickly become the de-facto standard for building LakeHouses on AWS. It brings reliability and ACID guarantees to object stores like Amazon S3, enabling multiple engines to work seamlessly on the same data. From Spark and Flink to Athena and Snowflake, Iceberg tables act as a shared contract for analytics and governance.

But not every analytics problem is "big data." In fact, most workloads don't need a large distributed cluster.

That's where DuckDB shines. Often called the "SQLite for analytics," DuckDB is an embedded OLAP database that runs blazingly fast on a single machine. With vectorized execution, efficient caching, and zero server overhead, DuckDB handles surprisingly large datasets interactively. Think: data exploration, prototyping, dashboards, or any workload where the startup overhead of Spark would be overkill.

And instead of wiring all of this up manually using Amazon S3 bucket with Glue as catalog, we are going to use Amazon S3 Tables. S3 Tables deliver the first cloud object store with built-in Apache Iceberg support and streamline storing tabular data at scale with automatic maintenance. This means that tools like DuckDB can plug into Iceberg on S3 Tables with just an ARN, and instantly interoperate with other engines (Athena, EMR, Spark, Flink, etc.).

In this post, I'll walk you through an end-to-end example of writing data into S3 Tables using DuckDB 1.4.0 with nightly extensions. All you need is:

An S3 Tables bucket in your AWS account.
A namespace called duckdb ont that bucket(created via the AWS console).
The ARN of your S3 Tables bucket.

Let's get started.

Open the DuckDB CLI by running:

duckdb

Step 1. Install the Extensions

DuckDB ships with a rich extension system. Since Iceberg support is still stabilising, we'll use nightly builds of the extensions:

FORCE INSTALL aws FROM core_nightly;
FORCE INSTALL httpfs FROM core_nightly;
FORCE INSTALL iceberg FROM core_nightly;

Step 2. Load AWS and Set Up Credentials

We'll use the built-in AWS extension to authenticate via the standard credential chain (env vars, IAM roles, or config files):

LOAD aws;
CREATE SECRET (
      TYPE s3,
      PROVIDER credential_chain
);

Step 3. Attach the Catalog

Next, attach your S3 Tables backed Iceberg catalog. Replace <<CUSTOM_ARN>> with the ARN of your S3 table bucket:

ATTACH '<<CUSTOM_ARN>>' AS s3_tables (
   TYPE iceberg,
   ENDPOINT_TYPE s3_tables
);

Step 4. Validate the Connection

Check that your catalog is live:

SHOW ALL TABLES;

If your namespace duckdb is configured correctly, you'll see it listed.

Step 5. Create Synthetic Data

Let's create a table by pulling in some real CSV data from the DuckDB public datasets:

CREATE TABLE s3_tables.duckdb.services AS
      FROM 'https://blobs.duckdb.org/nl-railway/services-2025-03.csv.gz';

This writes an Iceberg table into our S3 Table bucket directly from DuckDB.

Step 6. Explore with the DuckDB UI

DuckDB also ships with a simple web UI for interactive exploration:

CALL start_ui();

Open the provided URL in your browser and start querying your freshly created Iceberg table.

You can create a notebook a issue this query to see the inserted data. Note that you need to change the database to memory on each cells as shown in below image ( use this for all the cells you run).

Change database for the cell

from s3_tables3.duckdb.services
select
 "Service:RDT-ID",
 "Service:Date",
 "Service:Type",
 "Service:Company",
 "Service:Train number",
 "Service:Completely cancelled",
 "Service:Partly cancelled",
 "Service:Maximum delay",
 "Stop:RDT-ID",
 "Stop:Station code",
 "Stop:Station name",
 "Stop:Arrival time",
 "Stop:Arrival delay",
 "Stop:Arrival cancelled",
 "Stop:Departure time",
 "Stop:Departure delay",
 "Stop:Departure cancelled"
limit 100

This will show the data that we have previously created. But we are here to write some data too. Let's create a table from our existing data.

CREATE TABLE s3_tables3.duckdb.services_per_month AS
     SELECT
         month("Service:Date") AS month,
         "Stop:Station name" AS station,
         count(*) AS num_services
     FROM s3_tables3.duckdb.services
     GROUP BY ALL;

Then we can visualize our newly created data, that would be available also for any compliant Iceberg engine. For example: Amazon Athena, AWS Glue, Amazon EMR, Spark, Flink,…etc.

SELECT * FROM s3_tables3.duckdb.services_per_month;

You can explore the column statistics on the Duckdb ui along with the capability to download query results.

Results from our newly created table

Why This Matters

This workflow demonstrates the sweet spot of DuckDB: quick, local ( or remote if you wish) analytics with the ability to interoperate with your LakeHouse through Iceberg. You don't need to spin up Spark or EMR to publish clean, structured datasets back to S3 or S3 Tables.

Speed: DuckDB is optimized for vectorized columnar execution.
Simplicity: No servers, no daemons, just a single binary.
Interoperability: Iceberg tables you write with DuckDB can be read and extended by Spark, Flink, Athena, and beyond.

This makes DuckDB an increasingly important part of the modern analytics ecosystem: not a replacement for Spark, but a perfect companion for "small data" and interactive workloads.

#apache-iceberg #apache-spark #duckdb #s3