Boost Iceberg Performance and Cut Compute Costs with Well-Scoped MERGE Statements

Avoid full-table scans, slash costs, and simplify ETL pipelines with partition-targeted merges.

Vincent DANIEL

~7 min read · July 2, 2025 (Updated: August 29, 2025) · Free: Yes

Apache Iceberg has transformed data lake management by offering ACID transaction support and SQL-friendly operations like MERGE INTO. This has made upserts and slowly changing dimension (SCD) workloads more manageable at scale. However, there is a subtle yet significant performance pitfall when using the WHEN NOT MATCHED BY SOURCE clause in MERGE statements. While it might seem like a convenient way to delete rows that don't match the source dataset, this clause can trigger full table scans, leading to significant increases in both runtime and cost.

In this article, we'll explore why this happens, how a well-scoped MERGE operation can avoid this issue, and why this approach is often more efficient than splitting the logic into a MERGE followed by a separate DELETE.

The Problem: Full Table Scans with "WHEN NOT MATCHED BY SOURCE"

In Iceberg's 1.4.0 release, the WHEN NOT MATCHED BY SOURCE clause was introduced to facilitate the deletion of rows in the target table that don't have a corresponding entry in the source dataset. This feature can be particularly useful in data synchronization tasks.

Consider the following example:

-- 🔄 MERGE statement illustrating common performance pitfall in Iceberg
MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
AND target.event_date = CURRENT_DATE
-- ⚠️ This predicate *looks* like it limits processing to today's partition...
-- ❌ BUT: Iceberg ignores this filter when using WHEN NOT MATCHED BY SOURCE clause,
--      causing a FULL TABLE SCAN of the entire target table regardless of the predicate
-- 🔥 This leads to:
--    - Excessive I/O and CPU usage
--    - Out-of-memory errors on large datasets
--    - Severe scalability and performance issues

WHEN MATCHED THEN
-- 🔁 Update existing rows when matching IDs exist in both source and target
  UPDATE SET target.value = source.value

WHEN NOT MATCHED BY TARGET THEN
-- ➕ Insert new rows from source that don't exist in the target
  INSERT *

WHEN NOT MATCHED BY SOURCE AND target.event_date = CURRENT_DATE THEN
-- ❌ Delete rows from the target that are not in the source (e.g., obsolete) 
  DELETE

The Issue:

When you use the WHEN NOT MATCHED BY SOURCE clause, Iceberg does not push down the predicates from the ON clause or the WHEN NOT MATCHED BY SOURCE condition itself. As a result, Iceberg ends up scanning the entire target table, even if only a small subset of data (e.g., a specific date range or partition) is relevant.

As data volume grows, the cost of scanning large tables increases significantly — not only in terms of execution time but also in resource consumption. Full table scans for wide-scope merges can lead to memory pressure on compute engines, potentially causing out-of-memory errors. This makes such approaches increasingly inefficient at scale and highlights the importance of minimizing the scope of operations whenever possible.

For more context or ongoing discussion around this topic, see the related GitHub issue #11248.

Cost of Full-Table-Scan by Table Size

In Contrast:

If you remove the WHEN NOT MATCHED BY SOURCE clause and adjust the logic accordingly, Iceberg can apply filtering earlier in the process. This enables partition pruning, meaning only the relevant partitions are scanned. The result is a significant improvement in performance and a reduction in resource consumption.

Why Not Just Use a Separate DELETE?

A common workaround to avoid the full table scan caused by the WHEN NOT MATCHED BY SOURCE clause is to split the operation into two steps:

Use MERGE for upserts (insert and update operations).
Execute a separate DELETE for obsolete rows, often with explicit filtering conditions.

While this avoids the full table scan, it introduces additional challenges:

Increased Complexity: Splitting the logic into two operations complicates the pipeline. Both MERGE and DELETE operations must be coordinated, which increases the risk of errors.
Consistency Issues: In streaming or concurrent environments, changes may occur between the two operations, leading to potential data inconsistencies.
Atomicity: Managing two separate operations makes ensuring atomicity more difficult. If one operation fails, the pipeline may end up in an inconsistent state, requiring additional error handling and retries.

The Efficient Solution: Well-Scoped MERGE

Rather than relying on the WHEN NOT MATCHED BY SOURCE clause or splitting the operations, the most efficient approach is to scope the MERGE operation to the relevant partitions. This allows Iceberg to avoid a full table scan while still achieving the desired result—upserting and deleting rows in a single, atomic operation.

What Makes This Efficient?

Partition Pruning: By scoping the MERGE to a specific date range (e.g., event_date BETWEEN ...), Iceberg will only scan the relevant partitions of the target table, significantly reducing the data that needs to be processed.
No "WHEN NOT MATCHED BY SOURCE" Clause: By avoiding this clause, Iceberg can apply filtering early and push down the filter, thus bypassing the need to scan irrelevant data.
All-in-One Operation: Using a single MERGE operation handles both upserts and deletes atomically, simplifying the logic and reducing the complexity of your pipeline.

Example Code:

-- 1. Scope your target to just today's partition for efficient partition pruning
WITH target AS (
  SELECT
    t.*,
    MD5(concat_ws('|', t.id, t.column_a, t.event_date)) AS checksum -- Remove if present in the table
  FROM my_table t
  WHERE t.event_date = CURRENT_DATE  -- limits processed data to today's partition
),
-- 2. Prepare the source dataset, computing a checksum for efficient change detection
source AS (
  SELECT 
    s.*,
    MD5(concat_ws('|', s.id, s.column_a, s.event_date)) AS checksum
  FROM source_table s
  WHERE s.event_date = CURRENT_DATE  -- only relevant source rows
),
-- 3. Compute the change set using a full outer join to identify inserts, updates, and deletes
changes AS (
  SELECT
    COALESCE(s.event_date, t.event_date) AS event_date,
    COALESCE(s.id, t.id)               AS id,
    s.column_a,
    s.checksum,
    CASE
      WHEN s.checksum IS NULL THEN 'DELETE'  -- present in target but not in source
      WHEN t.checksum IS NULL THEN 'INSERT'  -- new row in source
      ELSE 'UPDATE'                    -- existing row with changed data
    END AS cdc
  FROM target t
  FULL OUTER JOIN source s
    ON t.checksum = s.checksum
   AND t.event_date = s.event_date
  WHERE NOT (s.checksum <=> t.checksum)  -- filter out unchanged rows
)
-- 4. Perform a single MERGE: scoped to today's partition, without scanning full table
MERGE INTO my_table AS t
USING changes c
  ON t.checksum = c.checksum
 AND t.event_date = CURRENT_DATE -- Filter pushdown on the partition — now working.
WHEN MATCHED AND c.cdc = 'DELETE' THEN
  DELETE                                -- remove obsolete rows
WHEN MATCHED AND c.cdc = 'UPDATE' THEN
  UPDATE SET *                          -- update changed rows
WHEN NOT MATCHED BY TARGET THEN
  INSERT *                              -- insert new rows
;

In this example:

The MERGE operation is scoped to a specific date, ensuring only relevant partitions are scanned.
By omitting WHEN NOT MATCHED BY SOURCE, Iceberg avoids the full table scan and processes only the relevant partitions.
Upserts and deletes are efficiently handled in a single operation.

The Role of Checksum

The checksum in this example is used as a lightweight, efficient way to detect changes between the source and target records. Instead of performing a deep comparison of all columns in the record, the checksum allows for a fast comparison of the data content. If the checksum differs between the source and target, it signifies that at least one of the relevant fields has changed.

Efficient Change Detection: The checksum comparison (s.checksum <=> t.checksum) ensures that only records with actual content changes (in id, column_a, or event_date) are considered for upsert operations. This saves computation by skipping rows that haven't changed.
Performance Improvement: By leveraging the checksum, the query avoids performing full row comparisons. Instead, it compares a compact hash value, which is computationally cheaper. This is especially beneficial when dealing with large datasets, where a direct column-by-column comparison would be inefficient.
Integrity Check: The checksum also acts as a safeguard to detect any discrepancies or corruption in the data. If two records have the same checksum, it can be assumed that they are identical in terms of the selected columns (though note that there is a small risk of checksum collisions).

Real-World Impact

By using a well-scoped MERGE like the one described, we've observed significant performance improvements in our production pipelines. Specifically:

Query Times Reduced: What used to take hours can now be executed in just minutes.
Cost Savings: In most cases, we saw a 10x reduction in compute costs, as Iceberg no longer needs to scan & "merge" the entire target table.
Improved Pipeline Simplicity: By keeping the logic in a single, efficient operation, we reduced the complexity of the pipeline and maintained consistent performance over time.

Conclusion

While the WHEN NOT MATCHED BY SOURCE clause in Iceberg's MERGE INTO is a convenient feature for handling deletions, its current implementation can lead to expensive full table scans. Instead, by scoping your MERGE to the relevant partitions and avoiding the WHEN NOT MATCHED BY SOURCE clause, you can significantly improve the efficiency of your operations.

This well-scoped MERGE approach helps keep your data lake fast, scalable, and cost-effective—without the added complexity and risks of managing separate delete operations. It's a simple yet powerful pattern that minimizes resource usage while ensuring that your pipelines remain efficient and easy to maintain.

Bug Reference & Further Reading:

#apache-iceberg #data-engineering #spark #sql #performance