11 Apache Iceberg Optimization Tools You Should Know

If you've worked with data lakes, you know the story: they start clean and fast, but as data piles up, queries slow down and costs creep…

DavidW (skyDragon)

overcast blog

· ~13 min read · September 1, 2025 (Updated: September 1, 2025) · Free: Yes

If you've worked with data lakes, you know the story: they start clean and fast, but as data piles up, queries slow down and costs creep higher. Apache Iceberg helps by bringing structure — schema evolution, ACID transactions, and powerful metadata — to those lakes. It's why so many teams are adopting it as the foundation of their lakehouse.

But Iceberg doesn't magically keep itself optimized. Over time, you'll still hit familiar pain points: millions of tiny files that bog down planning, snapshots that never get cleaned up, or tables that don't align with how your queries actually run. Left unchecked, these issues lead to wasted CPU cycles, ballooning storage bills, and frustrated engineers.

In this guide, we'll walk through 11 tools and techniques that can help you:

Keep queries fast as your tables grow.
Control storage and compute costs with smarter file layouts.
Automate the housekeeping work that Iceberg needs to stay efficient.

There are also automation-focused solutions such as LakeOps, which continuously optimize Iceberg tables in the background to cut query time and operation costs without any manual tuning.

By the end, you'll have a practical playbook for keeping Iceberg tables running smoothly. Let's jump in.

1. LakeOps — Data Lake Control Plane with Smart Optimization

LakeOps is a powerful yet simple solution for optimization automation.

It's a control plane for your data lake that continuously optimizes your tables to significantly reduce costs and improve performance. Using AI, it learns real usage patterns and tailors data organization to your system's specific needs. It is natively built on Apache Iceberg.

Key features:

Smart control plane for your data lake
Automated compaction that optimizes cost and performance
AI-Powered optimization based on real data usage patterns
Enterprise-grade analytics and features

Auto-Optimize Data Lake Cost and Performance (lakeops.dev)

You can fully automate and optimize table maintanace, including the hard parts like compaction, rewrite manifests, orphan files cleanups, snapshot expirations, and schema management. And, trigger jobs on events like table size, query latency, and cost spikes.

Within 10 minutes and zero risk of data changes, LakeOps will start optimizing your data ,cutting around 50–80% CPU cost almost instantly as well as 25% storage saving. Queries will run about 4x-8x faster because your data will be organized to optimize your queries.

Cut compation time as well as query performance and cost (lakeops.dev)

Lern more: https://lakeops.dev

2. Dremio OPTIMIZE Command

One of the biggest performance killers in Iceberg is the "small files problem." As new data is ingested — especially from streaming sources or micro-batches — you end up with thousands or even millions of tiny Parquet files. Each query then has to plan and open all of those files, which makes execution slower and drives up compute cost.

Dremio addresses this with the OPTIMIZE command, a built-in way to reorganize Iceberg tables. Under the hood, it rewrites data and metadata files into fewer, more evenly sized files, while also consolidating manifests. The result is faster queries and more efficient use of resources.

You can choose different strategies depending on your workload:

BIN_PACK — Combines small files into larger ones that hit your target size, reducing overhead.
Sort-based compaction — Rewrites files while sorting rows by chosen columns, improving data locality and pruning efficiency.
Manifest consolidation — Merges fragmented metadata files so query planning doesn't choke on too many manifests.

When to use it:

After heavy streaming ingestion that produces lots of small files.
When query latency suddenly jumps due to metadata bloat.
As part of a scheduled maintenance job (e.g., nightly or weekly).

Example:

-- Compact small files in a table
OPTIMIZE TABLE sales.iceberg_orders BIN_PACK;
-- Optimize while clustering by customer_id for faster lookups
OPTIMIZE TABLE sales.iceberg_orders SORT (customer_id);

3. AWS Glue Table Optimizers

Keeping Iceberg tables healthy in production usually means running regular compaction jobs, cleaning up snapshots, and dealing with orphaned files. On AWS, that work can now be automated.

Glue has built-in table optimizers for Iceberg that quietly run in the background and take care of the routine maintenance most teams would otherwise script themselves.

The optimizers cover the main pain points:

Compaction: By default, Glue runs binpack compaction to merge lots of small files into fewer, well-sized ones. You can also switch to sort or Z-order strategies if you want to cluster data for faster filtering.
Snapshot cleanup: Old snapshots get expired automatically so metadata doesn't grow out of control.
Orphan file removal: Glue safely deletes files that are no longer referenced by any snapshot, keeping storage costs under control.

Configuration is straightforward. You can enable optimizers through the AWS console, CLI, or API, and choose the right strategy per table. For example, a large fact table with time-based queries might benefit from sort compaction on the date column, while a wide analytics table could perform better with Z-order clustering.

For teams running Iceberg on S3 with Glue catalogs, this is one of the easiest ways to ensure queries stay fast without scheduling and maintaining custom compaction pipelines.

Example (CLI):

aws glue update-table-optimizer \
  --database-name mydb \
  --table-name sales_iceberg \
  --compaction-strategy sort \
  --columns "order_date,customer_id"

4. Sort & Z-Order Compaction in Amazon S3 Tables

Compaction isn't just about merging small files. The way your data is physically ordered on disk has a huge impact on query performance. If your users often filter on the same columns — like event_date, region, or customer_id—keeping those rows clustered together means queries can skip more files and scan less data.

AWS recently introduced sort and Z-order compaction for Iceberg tables in Amazon S3. These strategies go beyond basic bin-packing:

Sort compaction reorganizes data files by sorting on one or more columns. This works best for time-based queries or workloads where a single dimension (like order_date) is the main filter.
Z-order compaction distributes rows across multiple columns at once, which is useful when queries commonly filter on a combination of fields — say customer_id and region. It's a multi-dimensional clustering technique that helps reduce the number of files scanned without relying on a single sort key.

Both approaches are fully integrated into Glue and S3 Iceberg tables. You can choose the right strategy per table depending on how your queries behave. For example, a log table queried mostly by time might use sort compaction on timestamp, while an ad-tech dataset with many dimensions could benefit more from Z-ordering.

For analytics teams, this feature means you can shape your Iceberg tables to match real query patterns instead of relying on generic compaction. The payoff is faster queries and lower compute cost, especially in wide analytics workloads where data skipping really matters.

5. Native Iceberg Stored Procedures

Iceberg itself ships with a set of built-in stored procedures that act like a lightweight toolbox for keeping tables clean. They're engine-agnostic, meaning you can call them from Spark, Flink, Trino, Dremio, or any engine that supports Iceberg procedures. For many teams, these are the first optimizations worth scheduling because they address the most common maintenance pain points:

CALL system.remove_orphan_files – Cleans up files that are no longer referenced by any snapshot, reclaiming storage and preventing clutter.
CALL system.expire_snapshots – Trims old snapshots and metadata, keeping query planning fast and avoiding metadata bloat.
CALL system.rewrite_data_files – Compacts small files into larger ones for more efficient query execution.

These procedures don't require new infrastructure or external services — you can run them directly within your compute engine. A typical setup is to schedule them as part of a nightly or weekly job, depending on how much data you ingest and how sensitive your workloads are to query latency.

6. Partitioning, Sorting, and Metadata Pruning

Not all Iceberg optimizations come from external tools — some of the most powerful ones are baked into the format itself. If you understand how Iceberg handles partitions and metadata, you can unlock big performance wins without adding any new services.

Hidden partitioning: In traditional Hive-style tables, you had to manage partitions manually and remember to include them in queries. Iceberg hides that complexity. Queries don't need to reference partitions explicitly, but the engine still prunes files behind the scenes. This reduces both query errors and unnecessary scans.
Partition evolution: Workloads change, and the way you partition data today might not work a year from now. Iceberg lets you evolve partitioning strategies over time — say, moving from daily to hourly partitions — without rewriting all historical data.
Rich metadata statistics: Iceberg stores column-level statistics in its manifests, including min/max values, null counts, and bounds. Query engines use this to skip over irrelevant files entirely. The result is faster planning and execution, especially on wide tables with billions of rows.

In practice, these features mean you can design tables around query patterns instead of just ingestion speed. A customer-facing analytics table might be partitioned by customer_id and sorted by event_time, while an operational log table might evolve from daily partitions to hourly as volume grows. Meanwhile, the built-in metadata pruning ensures queries scan only what's needed.

These aren't just "nice-to-haves." They're the foundation that other optimizations build on. Even if you later layer in compaction jobs, Glue optimizers, or a control plane like LakeOps, good partitioning and metadata design will make every other optimization more effective.

7. Cloudera Iceberg Optimization Strategies

Cloudera has been an early adopter of Iceberg and has published a set of best practices that apply no matter what engine or cloud you're running on. While some optimizations come from tools and automation, these practices focus on operational discipline — the things you should regularly schedule and monitor to keep Iceberg in good shape.

The three big areas they emphasize are:

Snapshot management — Iceberg keeps a snapshot for every commit. That's great for time travel, but if you never expire them, metadata grows quickly and queries take longer just to plan. Cloudera recommends expiring old snapshots on a predictable schedule — daily or weekly depending on your data volume.
File compaction — Streaming pipelines in particular create tons of small files. Without regular compaction, query planning slows down and costs rise. Even simple bin-pack compaction can make a noticeable difference in query times.
Metadata cleanup — Manifests and other metadata files can also accumulate over time. Running cleanup jobs ensures your planning layer isn't wasting resources parsing hundreds or thousands of stale manifests.

The value of these strategies is that they're engine-agnostic. Whether you're running Spark, Hive, Impala, or another query engine, they apply universally. Many enterprises mix batch and real-time workloads on the same Iceberg tables, and without these practices, performance can degrade unevenly across workloads.

Cloudera's approach is a reminder that Iceberg performance isn't just about advanced features like Z-ordering or AI-driven compaction. Sometimes, the basics — expire old snapshots, compact files, clean metadata — are what make the difference between a system that feels "self-healing" and one that requires constant firefighting.

8. Tinybird Real-Time Optimization Practices

When Iceberg is used for real-time analytics, performance bottlenecks tend to appear much faster than in batch-only systems. Tinybird's approach is to focus on three fundamentals: partitioning, sorting, and compaction. Get these right, and you'll avoid most of the pitfalls that cause dashboards or APIs to slow down.

Partition around query patterns. If users always filter by time and customer, design your table with time-based partitions and evolve as data grows — from daily to hourly, for example. Avoid over-partitioning, since too many small partitions will backfire and create file sprawl.
Use sorting to boost pruning. Within partitions, sorting data on columns that queries filter on most often (like event_time or customer_id) makes Iceberg's metadata more effective. The engine can then skip entire files instead of scanning them, which keeps latency low.
Compact continuously. Streaming data pipelines naturally generate small files. Instead of waiting until the table becomes unmanageable, run rolling compaction on recent partitions (for example, the last 24–48 hours). This keeps file counts in check without competing with active writers.

A typical setup might look like this: partition by day, sort within partitions by event_time, and run an hourly job to compact small files for the last two days. The result is a table that stays fresh for real-time queries while still being efficient to scan at scale.

The lesson here is simple: before reaching for advanced optimization tools, make sure the basics are covered. Smart partitioning, helpful sorting, and steady compaction will solve 80% of real-time Iceberg performance issues.

9. AutoComp: Automated Compaction Framework

Most Iceberg teams eventually realize that manual or scheduled compaction jobs don't always line up with how the data is actually queried. Sometimes you compact too often and waste compute; other times you don't compact enough and queries slow down. This is where AutoComp comes in — a framework designed to make compaction workload-aware.

AutoComp automatically decides when and what to compact by weighing the cost of rewriting files against the performance benefits for queries. Instead of blindly compacting everything, it looks at usage patterns and focuses effort where it matters most. For example, if a certain partition is queried frequently and has a high number of tiny files, AutoComp will prioritize it over cold data that rarely gets touched.

Key advantages:

Smarter scheduling — avoids wasting resources on unnecessary compaction.
Workload awareness — aligns optimization with how data is actually being queried.
Hands-off operation — reduces the need for manual tuning or over-provisioning.

AutoComp has been evaluated in large-scale environments, including production systems at LinkedIn, where it showed meaningful reductions in small files and query latency without driving up compaction costs.

While it's more of a research-driven framework than a plug-and-play tool, AutoComp is a strong proof point that the future of Iceberg optimization is adaptive automation — compaction that responds to workload patterns instead of being rigidly scheduled.

10. Upsolver for Iceberg-Optimized Ingestion

Key advantages:

Smarter scheduling — avoids wasting resources on unnecessary compaction.
Workload awareness — aligns optimization with how data is actually being queried.
Hands-off operation — reduces the need for manual tuning or over-provisioning.

11. Estuary Flow for Real-Time Iceberg Pipelines

Real-time pipelines often fall apart at ingestion: late/out-of-order events, schema drift, and tiny files spray your table with fragmentation. Estuary Flow slots in between your sources (CDC from databases, Kafka, SaaS APIs) and Iceberg to keep writes tidy from the start.

What it helps with:

CDC and upserts: turns change streams into clean inserts/updates/deletes so Iceberg tables reflect current state without ad-hoc merge jobs.
Schema evolution: propagates compatible source changes (new columns, type relaxations) so pipelines don't break on the first drift.
Right-sized files: buffers by partition/time window and writes target-sized Parquet files to avoid small-file blowups.
Ordering and dedupe: idempotent delivery and checkpointing keep duplicates and gaps out of your tables.

A simple operating pattern:

Define keys and a partition spec that mirrors reads (start with days(event_time), evolve to hours if volume demands).
Add a sort order within partitions (event_time, optionally a high-selectivity dimension).
Tune flush windows (e.g., 5–15 minutes) to hit 256–512 MB target file sizes while keeping freshness high.
Allow bounded lateness so late events land in the right partitions without creating straggler files.
Schedule light maintenance (rewrite_data_files on the most recent partitions; expire_snapshots on a rolling basis).

Used this way, Flow reduces downstream compaction pressure and keeps manifests lean, so queries stay fast without constant cleanup work.

Closing Thoughts

Keeping Apache Iceberg fast isn't just about solving today's performance issues — it's about staying ahead of file sprawl, metadata growth, and shifting workloads. The good news is you don't need to babysit every table by hand. Platforms like LakeOps provide continuous smart optimization without manual work or changes to your infrastructure.

Between Iceberg's built-in procedures, cloud-managed optimizers, and external tools, you've got plenty of ways to keep things lean and cost-efficient. The common thread across all 11 approaches is balance:

Automate the routine stuff — snapshot expiration, orphan file cleanup, rolling compaction.
Design tables with intent — partitions and sort orders should reflect how data is actually queried, not just how it's ingested.
Match the tool to the workload — use native procedures for light maintenance, cloud services like Glue for managed operations, or platforms like LakeOps.

As Iceberg adoption grows, the path forward will be about smarter automation and workload-aware optimization. The more you align table structure and housekeeping with real usage, the less firefighting you'll have to do.

I'd love to hear how your team approaches Iceberg optimization. What's worked best for you — and what's been painful? Drop your thoughts in the comments, and let's compare notes.

Thanks for reading 🍻

Learn More

Performance — Apache Iceberg™

Iceberg is designed for huge tables and is used in production where a single table can contain tens of petabytes of…

apache.org

Introduction — Apache Iceberg™

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including…