Elegant ETL with Apache Iceberg Branching

These days, the Big Data space is dominated by AI hype. But behind the scenes, there's still plenty of development happening on another hot…

Turóczy Attila

~8 min read · August 11, 2025 (Updated: August 11, 2025) · Free: Yes

These days, the Big Data space is dominated by AI hype. But behind the scenes, there's still plenty of development happening on another hot topic: table formats.

Once upon a time, there was the Hive table format. Many complained about its rigid features and limitations, but to be fair, Hive tables were the first in the Big Data market. Every pioneer comes with its share of issues.

Now, we have multiple table formats, both open-source and proprietary. Among them, one truly stands out in the open-source world: Apache Iceberg. This format rethinks and fixes many of the pain points Hive tables had. Even Apache Hive itself has become one of the most active adopters of the Iceberg specification. In my opinion, Hive currently has the second-richest feature set for Apache Iceberg.

But there's one feature that really caught my attention lately: Branching in SQL.

This is one of the coolest things to happen in the SQL world since somebody invented the Spotify. But why I'm so excited? Let me explain.

Why Branching Matters in ETL

Apache Hive has supported INSERT, UPDATE, and DELETE for years, making it a reliable choice for ETL workloads.

Let's start with a simple raw logs external table:

CREATE EXTERNAL TABLE raw_logs (
    ip STRING,
    timestamp STRING,
    request STRING,
    status INT,
    bytes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/data/raw_logs/';

As a Data Engineer, I want to clean this data and put it into my data lake. So I create the destination table:

CREATE TABLE cleaned_logs (
    ip STRING,
    request STRING,
    status INT,
    bytes INT,
    date STRING
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET;

Now, let's insert the cleaned data (here's a deliberately ugly SQL just for the demo. Hurting my eyes):

INSERT OVERWRITE TABLE cleaned_logs PARTITION (year, month, day)
SELECT
    ip,
    request,
    status,
    bytes,
    from_unixtime(unix_timestamp(timestamp, 'dd/MMM/yyyy:HH:mm:ss Z'), 'yyyy-MM-dd') AS date,
    year(from_unixtime(unix_timestamp(timestamp, 'dd/MMM/yyyy:HH:mm:ss Z'))) AS year,
    month(from_unixtime(unix_timestamp(timestamp, 'dd/MMM/yyyy:HH:mm:ss Z'))) AS month,
    day(from_unixtime(unix_timestamp(timestamp, 'dd/MMM/yyyy:HH:mm:ss Z'))) AS day
FROM raw_logs
WHERE status = 200;

It works, but every time new data arrives, we have to run it again. And new data isn't always clean: maybe the IP is missing, maybe the status code is invalid, or maybe there's an error message mixed in.

If we insert bad data into the target table, my data consumers could see it instantly. That could break reports or cause bugs downstream. Sure, I could clean it afterward or use a temporary table, but that often means copying data, increasing storage costs (especially at terabyte scale), and cluttering the data warehouse.

I've even seen customers with over 250,000 tables in their warehouse because no one wanted to delete unused temp tables. This caused serious metastore performance issues.

Clearly, this approach is heavy and inefficient.

NOTE: There are many patterns, practices, and especially tools that can reduce this heavy pattern, but I chose the simplest way to show the contrast.

The industry has been asking for something lightweight, elegant, and easy to use, and that's where Apache Iceberg Branching comes in.

So, what is a branch in Iceberg? Branches are independent lineages of snapshots that point to the head of that lineage. Sounds complicated?

Every Iceberg table has a main branch by default. When you create a table — for example, an audit table, it starts with main. You can create a branch from the table's current HEAD or from a specific snapshot. In this article, we'll focus on creating branches from the head.

Create branch:

ALTER TABLE table_name CREATE BRANCH branch_name;

Dropping branch

ALTER TABLE table_name DROP BRANCH branch_name;

NOTE: Branching is only supported for Iceberg table. Hive version has to be 4.0 or above!

Check the branches (and tags), you can execute the following sql command. Where you are able to select it with the full name the references, with the refs keyword.

 SELECT * FROM database_name.table_name.refs;

Sample result:

+-----------------+---------------+----------------------+----------------------------------+--------------------------------+---------------------------------+
|  payroll.name   | payroll.type  | payroll.snapshot_id  | payroll.max_reference_age_in_ms  | payroll.min_snapshots_to_keep  | payroll.max_snapshot_age_in_ms  |
+-----------------+---------------+----------------------+----------------------------------+--------------------------------+---------------------------------+
| main            | BRANCH        | 6416958171399394992  | NULL                             | NULL                           | NULL                            |
| audit_2025      | BRANCH        | 335492812108130379   | NULL                             | NULL                           | NULL                            |
+-----------------+---------------+----------------------+----------------------------------+--------------------------------+---------------------------------+

Write-Audit-Publish

Now we know the syntax, but where does this fit into an ETL pipeline? This is where the WAP (Write–Audit–Publish) pattern comes to the rescue.

With WAP, we can maintain a main branch that contains only clean, high-quality data. When we need to ingest new data, we create a branch, which is just a lightweight snapshot pointer and ingest the data there first.

Once the batch ingestion is complete, we use this staging branch to clean, filter, and validate the data. After the data is ready, we simply commit it back to the main branch. This way, consumers never see dirty data, and we avoid polluting the main table with temporary or invalid records

No need to copy data files.
No impact on main branch data during processing.
Every branch has the same ACID-compliant feature set as the main branch.

With this approach, we can build an elegant, lightweight, and reliable data pipeline.

Let's visualize the flow:

We have a table, and we create a branch from snapshot S35. The branch is named Staging, and it points to S35.
We start the ingestion process — let's say we load 200K rows. In the next step, we clean out invalid rows, such as those with empty or null values, or rows that fail business logic checks. After the cleanup, the branch moves to snapshot S36 (keeping it simple here).
At this point, we might run a few SELECT queries to check for duplicates or perform cross-checks. If everything looks good, we can merge the changes back into the main branch. Once the branch is no longer needed, we simply drop it.
This approach especially with these lightweight snapshot pointers is far more elegant than creating a CTAS table and manually playing around with it.

Branching in Practice

Let's try a quick demo with a payroll table. In this example we will mimic a payroll table, with minimal ingestion.

Create an Iceberg table. The table name is: payroll

CREATE TABLE payroll (id int, name string, amount int) STORED BY ICEBERG;

2. Insert sample data

INSERT INTO payroll VALUES (1, "Daniel", 5000);
INSERT INTO payroll VALUES (2, "Attila", 10000);
INSERT INTO payroll VALUES (3, "Tom", 2900);

3. Check the table

SELECT * FROM payroll;
+-------------+---------------+-----------------+
| payroll.id  | payroll.name  | payroll.amount  |
+-------------+---------------+-----------------+
| 3           | Tom           | 2900            |
| 1           | Daniel        | 5000            |
| 2           | Attila        | 10000           |
+-------------+---------------+-----------------+

4. Create an audit branch

ALTER TABLE payroll CREATE BRANCH audit;

5. Query the audit branch

NOTE: To use the branches you need to use the databasename.tablename.branch_branchname full format.

SELECT * FROM default.payroll.branch_audit;
+-------------+---------------+-----------------+
| payroll.id  | payroll.name  | payroll.amount  |
+-------------+---------------+-----------------+
| 3           | Tom           | 2900            |
| 1           | Daniel        | 5000            |
| 2           | Attila        | 10000           |
+-------------+---------------+-----------------+

6. Now the branch is created, let's insert just 1 row into the audit branch.

INSERT INTO default.payroll.branch_audit VALUES (4, "Bianca", 10000)

7. For the demo purpose query the branch table

SELECT * FROM default.payroll.branch_audit;
+-------------+---------------+-----------------+
| payroll.id  | payroll.name  | payroll.amount  |
+-------------+---------------+-----------------+
| 2           | Attila        | 10000           |
| 3           | Tom           | 2900            |
| 4           | Bianca        | 10000           |
| 1           | Daniel        | 5000            |
+-------------+---------------+-----------------+

8. We already with the ingestion, let's merge it back to the main branch.

ALTER TABLE payroll EXECUTE FAST-FORWARD 'main' 'audit';

9. Validate if the merge executed.

SELECT * FROM payroll;
+-------------+---------------+-----------------+
| payroll.id  | payroll.name  | payroll.amount  |
+-------------+---------------+-----------------+
| 2           | Attila        | 10000           |
| 3           | Tom           | 2900            |
| 4           | Bianca        | 10000           |
| 1           | Daniel        | 5000            |
+-------------+---------------+-----------------+

10. Now it is the time to drop the branch

ALTER TABLE payroll DROP BRANCH audit;

It was easy, right? Again branches let you modify data freely without touching main branch until you're ready. If your table is 100GB and you add 3 rows on a branch, the storage overhead is just those 3 rows, thanks to Iceberg's fantastic snapshot model.

Syntax sugar

I personally like how quickly Apache Hive integrated the Iceberg table format, but I have an issue with its syntax. Hive generally follows the SQL standard, and some keywords, like CREATE BRANCH. These are doesn't belong to the classical SQL syntax. Because of this, most new Hive syntax starts with ALTER TABLE. This keyword is a bit overloaded in many SQL engines, and that's especially true for Apache Hive. But recently my friend Ayush Saxena came up with a more modern syntax for branching.

CREATE BRANCH audit FROM payroll;

NOTE: Available in Hive 4.2 and above.

Limitations

Needs to mention some significant limitations of branching that you may notice when you try it.

Even though the branching function works like Git, but just for tables there's a catch. If the main branch changes (even a single row inserted, updated, or deleted) after a branch is created, the HEAD will no longer match, and you won't be able to merge your changes back. In other words, the system isn't designed to resolve conflicts. If a conflict occurs, you'll see an error like: Cannot fast-forward: main is not an ancestor of branch_name.

2. When you create a branch, you can try out many things in isolation. BUT there's an exception! You can't experiment with different table properties. Table properties in Hive are global, not branch-specific. Any change to table properties in a branch also affects the main table, because these settings are shared across all branches. This limitation makes branches unsuitable for experimenting with configuration-level changes.

Apache Hive is one of the fastest adopters of Apache Iceberg, and branching opens a new era for cleaner, lighter, and safer ETL pipelines. If you're creating a new pipeline, consider using Iceberg branches, whether in Hive or Spark. As they can bring more flexibility and elegance than ever before.

#big-data #apache-hive #apache-iceberg #etl #cloudera