I love experimenting with new tech, but I'm not about to hand over my credit card to some cloud provider and wake up to a $500 bill because I forgot to shut down a service. Been there, done that, learned my lesson.

For production, I absolutely prefer managed services as the scalability, monitoring, and automatic upgrades are worth it. But while I'm still evaluating technology components and learning how they work together, Docker is still my go-to solution.

In this post, we're building a modern Data Lakehouse — that is, a system that combines the scalability of data lakes with the ACID guarantees of data warehouses — with Apache Iceberg, Apache Polaris, Trino, and MinIO, all running locally in Docker containers. No cloud accounts, no billing surprises, no "oops I left that cluster running over the weekend" moments.

The Architecture

For this setup, I had three key requirements:

For this setup, I had three key requirements:

  1. Open-Source only I decided it would use only free, open source software. You can experiment, break things, switch versions, and iterate without ever worrying about costs spiraling out of control.
  2. No Hive or Hadoop I was especially curious to explore Apache Polaris, the new Iceberg REST catalog, donated by Snowflake. Instead of using Hive, I decided to integrate Polaris. It's still a young project and the examples are a bit scarce, so getting it to work with the rest of the stack took me some effort.
  3. Local only Last, everything had to run locally using Docker, so no need to even think about AWS, Azure, or any other cloud provider accounts.

Here's what the architecture looks like:

None

Apache Iceberg is a popular open table format that brings modern database features to your data lake: ACID transactions, schema evolution, partition evolution, time travel, tagging and branching.

Apache Polaris is an Iceberg REST catalog. You can think of it as the central brain that keeps track of the Iceberg metadata layer.

Trino is a fast, distributed SQL query engine. It can write to and read from Iceberg tables thanks to its Iceberg connector.

MinIO provides S3-compatible storage without needing AWS (which is definitely "A Good Thing"®)

Here's the typical interaction:

  1. Trino talks to Polaris when it needs to locate the tables and their latest metadata.
  2. Polaris stores all Iceberg metadata in MinIO and serves it from there too.
  3. Trino writes and reads Iceberg data using MinIO directly (for performance).

Alright, let's get cracking.

Getting This Thing Running

In a new directory, copy the following to a docker-compose.yml file.

services:

  polaris:
    image: apache/polaris:latest
    platform: linux/amd64
    ports:
      - "8181:8181"
      - "8182:8182"
    networks:
      - local-iceberg-lakehouse
    environment:
      AWS_ACCESS_KEY_ID: admin
      AWS_SECRET_ACCESS_KEY: password
      AWS_REGION: dummy-region
      AWS_ENDPOINT_URL_S3: http://minio:9000
      AWS_ENDPOINT_URL_STS: http://minio:9000
      POLARIS_BOOTSTRAP_CREDENTIALS: default-realm,root,secret
      polaris.features.DROP_WITH_PURGE_ENABLED: true # allow dropping tables from the SQL client
      polaris.realm-context.realms: default-realm
    healthcheck:
      test: ["CMD", "curl", "http://localhost:8181/healthcheck"]
      interval: 5s
      timeout: 10s
      retries: 5

  trino:
    image: trinodb/trino:latest
    ports:
      - "8080:8080"
    environment:
      - TRINO_JVM_OPTS=-Xmx2G
    networks:
      - local-iceberg-lakehouse
    volumes:
      - ./trino/catalog:/etc/trino/catalog

  minio:
    image: minio/minio:latest
    environment:
      AWS_ACCESS_KEY_ID: admin
      AWS_SECRET_ACCESS_KEY: password
      AWS_REGION: dummy-region
      MINIO_ROOT_USER: admin
      MINIO_ROOT_PASSWORD: password
      MINIO_DOMAIN: minio
    networks:
      local-iceberg-lakehouse:
        aliases:
          - warehouse.minio
    ports:
      - "9001:9001"
      - "9000:9000"
    command: ["server", "/data", "--console-address", ":9001"]

  minio-client:
    image: minio/mc:latest
    depends_on:
      - minio
    networks:
      - local-iceberg-lakehouse
    volumes:
      - /tmp:/tmp
    environment:
      AWS_ACCESS_KEY_ID: admin
      AWS_SECRET_ACCESS_KEY: password
      AWS_REGION: dummy-region
    entrypoint: >
      /bin/sh -c "
      until (mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      mc rm -r --force minio/warehouse;
      mc mb minio/warehouse;
      mc anonymous set public minio/warehouse;
      tail -f /dev/null
      " 
   
networks:
  local-iceberg-lakehouse:
    name: local-iceberg-lakehouse

Don't start the stack just yet. Before Trino can connect to Polaris, it needs to be configured. Create a subdirectory for the configuration file:

mkdir -p ./trino/catalog/

To create the iceberg.properties file with the necessary configuration, run the following command in your terminal:

cat << 'EOF' > ./trino/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://polaris:8181/api/catalog/
iceberg.rest-catalog.warehouse=polariscatalog
iceberg.rest-catalog.vended-credentials-enabled=true
iceberg.rest-catalog.security=OAUTH2
iceberg.rest-catalog.oauth2.credential=root:secret
iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL

# required for Trino to read from/write to S3
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.region=dummy-region
EOF

That's everything you need, now let's launch the stack and get it running.

docker compose up

You'll have these services running on your machine:

Setting Up Polaris (don't skip this)

Polaris needs some initial configuration. It's a bit verbose, but you only do this once.

Create an Iceberg catalog

You need to create an Iceberg catalog in Polaris, but to do that, you first need an access token. In a new terminal run this command:

ACCESS_TOKEN=$(curl -X POST \
  http://localhost:8181/api/catalog/v1/oauth/tokens \
  -d 'grant_type=client_credentials&client_id=root&client_secret=secret&scope=PRINCIPAL_ROLE:ALL' \
  | jq -r '.access_token')

When creating the catalog, you need to tell Polaris where to store the data and how to access it. In our case, everything will be stored in MinIO (note that we're going to use a bogus IAM role and a bogus region).

curl -i -X POST \
  -H "Authorization: Bearer $ACCESS_TOKEN" \
  http://localhost:8181/api/management/v1/catalogs \
  --json '{
    "name": "polariscatalog",
    "type": "INTERNAL",
    "properties": {
      "default-base-location": "s3://warehouse",
      "s3.endpoint": "http://minio:9000",
      "s3.path-style-access": "true",
      "s3.access-key-id": "admin",
      "s3.secret-access-key": "password",
      "s3.region": "dummy-region"
    },
    "storageConfigInfo": {
      "roleArn": "arn:aws:iam::000000000000:role/minio-polaris-role",
      "storageType": "S3",
      "allowedLocations": [
        "s3://warehouse/*"
      ]
    }
  }'

Check that the catalog was correctly created in Polaris:

curl -X GET http://localhost:8181/api/management/v1/catalogs \
  -H "Authorization: Bearer $ACCESS_TOKEN" | jq

The result should look like this:

{
  "catalogs": [
    {
      "type": "INTERNAL",
      "name": "polariscatalog",
      "properties": {
        "s3.path-style-access": "true",
        "s3.access-key-id": "admin",
        "s3.secret-access-key": "password",
        "default-base-location": "s3://warehouse",
        "s3.region": "dummy-region",
        "s3.endpoint": "http://minio:9000"
      },
      "createTimestamp": 1750257800389,
      "lastUpdateTimestamp": 1750257800389,
      "entityVersion": 1,
      "storageConfigInfo": {
        "roleArn": "arn:aws:iam::000000000000:role/minio-polaris-role",
        "externalId": null,
        "userArn": null,
        "region": null,
        "storageType": "S3",
        "allowedLocations": [
          "s3://warehouse/*",
          "s3://warehouse"
        ]
      }
    }
  ]

Set Up Permissions

Yeah, security is boring, but you'll thank me later when you're not debugging permission issues.

Polaris has a pretty sophisticated role-based access control model that honestly threw me off at first. But once I got how catalog roles and principal roles work together, I realized it's actually a clean and effective way to enforce least-privilege access without making things too complicated.

If your access token expired, you can create a new one with the command in the previous section.


# Create a catalog admin role
curl -X PUT http://localhost:8181/api/management/v1/catalogs/polariscatalog/catalog-roles/catalog_admin/grants \
  -H "Authorization: Bearer $ACCESS_TOKEN" \
  --json '{"grant":{"type":"catalog", "privilege":"CATALOG_MANAGE_CONTENT"}}'

# Create a data engineer role
curl -X POST http://localhost:8181/api/management/v1/principal-roles \
  -H "Authorization: Bearer $ACCESS_TOKEN" \
  --json '{"principalRole":{"name":"data_engineer"}}'

# Connect the roles
curl -X PUT http://localhost:8181/api/management/v1/principal-roles/data_engineer/catalog-roles/polariscatalog \
  -H "Authorization: Bearer $ACCESS_TOKEN" \
  --json '{"catalogRole":{"name":"catalog_admin"}}'

# Give root the data engineer role
curl -X PUT http://localhost:8181/api/management/v1/principals/root/principal-roles \
  -H "Authorization: Bearer $ACCESS_TOKEN" \
  --json '{"principalRole": {"name":"data_engineer"}}'

Check that the role was correctly assigned to the root principal:

curl -X GET http://localhost:8181/api/management/v1/principals/root/principal-roles -H "Authorization: Bearer $ACCESS_TOKEN" | jq

It should return:

{
  "roles": [
    {
      "name": "service_admin",
      "federated": false,
      "properties": {},
      "createTimestamp": 1751733238263,
      "lastUpdateTimestamp": 1751733238263,
      "entityVersion": 1
    },
    {
      "name": "data_engineer",
      "federated": false,
      "properties": {},
      "createTimestamp": 1751733315678,
      "lastUpdateTimestamp": 1751733315678,
      "entityVersion": 1
    }
  ]
}

Everything is now ready, so let's get cracking!

Actually Using This Thing

It's time to create an Iceberg table and run some queries. Open a session in Trino which will connect to the Polaris Iceberg catalog (note the--catalog iceberg part).

docker compose exec -it trino trino --server localhost:8080 --catalog iceberg

In the Trino prompt, create a schema (which maps to a namespace in Polaris) then activate it.

-- Create a schema first (a namespace in Polaris).
CREATE SCHEMA db;

-- Activate the schema
USE db;

Next, create a simple table:

CREATE TABLE customers (
  customer_id BIGINT,
  first_name VARCHAR,
  last_name VARCHAR,
  email VARCHAR
);

Let's insert a few records in that table:

INSERT INTO customers (customer_id, first_name, last_name, email) 
VALUES (1, 'Rey', 'Skywalker', 'rey@rebelscum.org'),
       (2, 'Hermione', 'Granger', 'hermione@hogwarts.edu'),
       (3, 'Tony', 'Stark', 'tony@starkindustries.com');

When we query the table, the records will show up.

SELECT * FROM customers;

Iceberg is Cool

(did you get it?)

If you're not yet deeply familiar with Apache Iceberg, one of its standout features is time travel. Iceberg maintains a versioned history of your data through snapshots, which are automatically created whenever you insert, update, or delete records. You can view the snapshot history using the following command:

-- Check out your table's history
SELECT snapshot_id, committed_at, summary
FROM "customers$snapshots"
ORDER BY committed_at DESC;

Here's the result on my machine:

   snapshot_id     |        committed_at         |                                                                                                                                                                                                                                                  summary
---------------------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 9159391083335207972 | 2025-07-05 17:08:37.171 UTC | {engine-version=476, added-data-files=1, total-equality-deletes=0, added-records=3, trino_query_id=20250705_170836_00010_k8me9, total-records=3, changed-partition-count=1, engine-name=trino, total-position-deletes=0, added-files-size=770, total-delete-files=0, iceberg-version=Apache Iceberg 1.9.1 (commit f40208ae6fb2f33e578c2637d3dea1db18739f31), trino_user=trino, total-files-size=770, total-data-files=1}
 1159208684170116500 | 2025-07-05 17:08:34.410 UTC | {changed-partition-count=0, engine-version=476, total-equality-deletes=0, engine-name=trino, trino_query_id=20250705_170834_00009_k8me9, total-position-deletes=0, total-delete-files=0, iceberg-version=Apache Iceberg 1.9.1 (commit f40208ae6fb2f33e578c2637d3dea1db18739f31), trino_user=trino, total-files-size=0, total-records=0, total-data-files=0}
(2 rows)

Let's update Hermione's last name:

UPDATE customers
SET last_name = 'Granger-Weasley'
WHERE customer_id = 2;

If we query the table again, we can see that the record has been updated accordingly:

 customer_id | first_name |    last_name    |          email
-------------+------------+-----------------+--------------------------
           1 | Rey        | Skywalker       | rey@resistance.org
           2 | Hermione   | Granger-Weasley | hermione@hogwarts.edu
           3 | Tony       | Stark           | tony@starkindustries.com

And if you list the snapshots again, you should now have an additional entry:

   snapshot_id     |        committed_at         |                                                                                                                                                                                                                                                  summary
---------------------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 8439500909258006696 | 2025-07-05 17:26:59.122 UTC | {engine-version=476, added-data-files=1, added-position-deletes=1, total-equality-deletes=0, added-records=1, trino_query_id=20250705_172658_00016_k8me9, added-position-delete-files=1, added-delete-files=1, total-records=4, changed-partition-count=1, engine-name=trino, total-position-deletes=1, added-files-size=1810, total-delete-files=1, iceberg-version=Apache Iceberg 1.9.1 (commit f40208ae6fb2f33e578c2637d3dea1db18739f31), trino_user=trino, total-files-size=2580, total-data-files=2}
 9159391083335207972 | 2025-07-05 17:08:37.171 UTC | {engine-version=476, added-data-files=1, total-equality-deletes=0, added-records=3, trino_query_id=20250705_170836_00010_k8me9, total-records=3, changed-partition-count=1, engine-name=trino, total-position-deletes=0, added-files-size=770, total-delete-files=0, iceberg-version=Apache Iceberg 1.9.1 (commit f40208ae6fb2f33e578c2637d3dea1db18739f31), trino_user=trino, total-files-size=770, total-data-files=1}
 1159208684170116500 | 2025-07-05 17:08:34.410 UTC | {changed-partition-count=0, engine-version=476, total-equality-deletes=0, engine-name=trino, trino_query_id=20250705_170834_00009_k8me9, total-position-deletes=0, total-delete-files=0, iceberg-version=Apache Iceberg 1.9.1 (commit f40208ae6fb2f33e578c2637d3dea1db18739f31), trino_user=trino, total-files-size=0, total-records=0, total-data-files=0}
(3 rows)

Now, we may not have Professor McGonagall's Time-Turner handy, but with Iceberg's AS OF keyword, we can still travel through time (well, at least through our data):

-- Go back in time (use the timestamp from before you updated the record)
SELECT * FROM customers FOR TIMESTAMP AS OF TIMESTAMP '2025-07-05 17:20:00.000 UTC';

If you query the table again, you should see the previous state of the table:

customer_id | first_name | last_name |          email
-------------+------------+-----------+--------------------------
           1 | Rey        | Skywalker | rey@resistance.org
           2 | Hermione   | Granger   | hermione@hogwarts.edu
           3 | Tony       | Stark     | tony@starkindustries.com
(3 rows)

This opens up a bunch of useful use cases, like handling compliance requests, debugging ingestion issues, or digging into time-based data analysis.

If you are curious, you can check out the various Iceberg files written in the metadata and data layer by pointing your browser to the MinIO console at http://localhost:9001.

None

There are many more powerful features in Iceberg, but don't just take my word for it, go check out how Apache Iceberg can improve your data lakehouse architecture.

DuckDB Alternative

If Trino feels like overkill for querying your data, DuckDB can also read your Iceberg tables (but not write).

Since DuckDB isn't a long-running server process, it doesn't need to be included in the Docker Compose file. Instead, use a simple docker run command when needed:

docker run -it --network=local-iceberg-lakehouse datacatering/duckdb:v1.3.0

Once inside the DuckDB prompt, you need to enable the Iceberg extension:

INSTALL iceberg;
        
LOAD iceberg;
  
CREATE SECRET iceberg_secret (
    TYPE ICEBERG,
    CLIENT_ID 'root',
    CLIENT_SECRET 'secret',
    OAUTH2_SERVER_URI 'http://polaris:8181/api/catalog/v1/oauth/tokens',
    OAUTH2_SCOPE 'PRINCIPAL_ROLE:ALL',
    OAUTH2_GRANT_TYPE 'client_credentials'
);
       
ATTACH 'polariscatalog' AS iceberg_catalog (
   TYPE iceberg,
   SECRET iceberg_secret,
   ENDPOINT 'http://polaris:8181/api/catalog'
);
       
USE iceberg_catalog.db;
    
SET s3_endpoint='minio:9000';
SET s3_use_ssl='false';
SET s3_access_key_id='admin' ;
SET s3_secret_access_key='password';
SET s3_region='dummy-region';
SET s3_url_style='path';

DuckDB is now fully setup to query your Iceberg data:

SELECT * FROM customers;
┌─────────────┬────────────┬─────────────────┬──────────────────────────┐
│ customer_id │ first_name │    last_name    │          email           │
│    int64    │  varchar   │     varchar     │         varchar          │
├─────────────┼────────────┼─────────────────┼──────────────────────────┤
│           2 │ Hermione   │ Granger-Weasley │ hermione@hogwarts.edu    │
│           1 │ Rey        │ Skywalker       │ rey@resistance.org       │
│           3 │ Tony       │ Stark           │ tony@starkindustries.com │
└─────────────┴────────────┴─────────────────┴──────────────────────────┘

What You've Built

You now have a legitimate test Data Lakehouse running entirely on your machine, with zero cloud dependencies. You learned how to :

✅ Use Apache Polaris as a REST catalog (goodbye Hive!) ✅ Query and time-travel with Apache Iceberg ✅ Use both Trino and DuckDB

Obviously, it's just a basic example. For production workloads, you'd need to use Kubernetes for scaling, AWS S3 instead of MinIO, and properly secure access to Polaris with more roles.

But, in my opinion, it's the perfect platform to learn how things work, without any risk of unexpected bills.

Coming Up Next

In the next post in this series, we're going to add real-time streaming to this setup. We'll bring in Apache Flink for stream processing and Kafka for event streaming. All still running locally, all still completely safe from surprise charges.