
They say curiosity killed the cat, but in my case, it just killed my weekend. It started innocently enough: "How Iceberg's Merge-on-Read is implemented in Starrocks?" I asked myself, blissfully unaware that I was about to descend into a rabbit hole of queues, bitmaps, and anti-joins that would make Alice's adventure look like a casual stroll.
You see, most people encounter Iceberg MoR tables and think, "Cool, it handles deletes efficiently." Normal people. Reasonable people. But not me. No, I had to know how StarRocks actually pulls this off. What started as a simple grep for "iceberg" in the codebase turned into a three-day odyssey through Java optimizers, Thrift serialization, and C++ scanners that left me questioning my life choices and developing an unhealthy relationship with queue data structures.
The thing about MoR is that it sounds deceptively simple: "Just track the deletes and apply them when reading." Sure, just like building a house is "just stacking some bricks." But when you have position deletes that need bitmaps, equality deletes that need hash joins, multiple equality ID combinations that need separate queues, and all of this needs to work at scale while being parallelizable… well, suddenly you're not in Shoreditch anymore— you're deep in the code wilderness.
So grab your coffee (you'll need it), and let me take you on the journey I wish someone had documented before I started. We'll trace how a simple SELECT statement becomes an elaborate dance of frontend orchestration and backend execution, where delete files are first-class citizens and queues are the unsung heroes of the story.
Warnings:
Side effects may include newfound appreciation for query optimizers and an irresistible urge to explain queue-based architectures at parties. All code in this article relates to Starrocks 3.5.4.
Deletes are different
Iceberg's merge-on-read mode revolves around two kinds of delete files: positional deletes and equality deletes. StarRocks treats them fundamentally differently.
Two flavours
Positional deletes are handled immediately during scanning. They encode (file_path, pos) pairs that tell the scanner exactly which rows to skip. The backend builds a bitmap and checks it inline - deleted rows never even materialize into chunks.
Equality deletes take a completely different path. They don't point at positions but carry key values (like user_id = 42). These become separate scan ranges that later participate in anti-joins to filter out matching rows.

Why the different treatment?
Position deletes are cheap, it's just a bitmap check in the hot path. Equality deletes require matching on arbitrary columns, which needs the full power of the hash join machinery. The frontend decides which path each file takes:
// IcebergMORParams.java - The routing decision
enum ScanTaskType {
DATA_FILE_WITHOUT_EQ_DELETE, // Fast path - no equality deletes
DATA_FILE_WITH_EQ_DELETE, // Needs anti-join treatment
EQ_DELETE, // The equality delete files themselves
ORIGIN // Tables without identifier columns
}Schema evolution resilience
Because Iceberg supports schema and partition spec evolution, delete files reference columns by field IDs rather than names. If you rename a column from customer_id to client_id, the equality deletes still work because they reference field ID 5, not the column name. StarRocks preserves these IDs throughout.
These IDs and spec metadata are then shipped to the backend via Thrift:
// gensrc/thrift/PlanNodes.thrift#L311
enum TIcebergFileContent {
DATA,
POSITION_DELETES,
EQUALITY_DELETES,
}
// gensrc/thrift/PlanNodes.thrift#L317
struct TIcebergDeleteFile {
1: optional string full_path
2: optional Descriptors.THdfsFileFormat file_format
3: optional TIcebergFileContent file_content
4: optional i64 length
}
// gensrc/thrift/PlanNodes.thrift#L340
struct THdfsScanRange {
21: optional list<Types.TSlotId> delete_column_slot_ids;
28: optional map<Types.TSlotId, Exprs.TExpr> extended_columns; // e.g. spec_id
}This ensures that equality deletes remain valid even if columns are renamed or repartitioned.
In short:
Positional deletes are consumed directly in backend scanners with a skip bitmap.
Equality deletes are not; they're carried as separate ranges and applied later.
FieldIds and specId make equality deletes robust to schema evolution.
So, the rest is optional, save your mental health

Frontend: constructing MoR scan ranges
When StarRocks plans a query against an Iceberg merge-on-read table, the frontend decides how to split work between data file scans with positional deletes and stand-alone equality delete ranges.
Data files with positional deletes attached
When the frontend encounters a data file, it attaches only the positional delete files and skips equality ones:
// IcebergConnectorScanRangeSource.java#L155
private List<TScanRangeLocations> buildScanRanges(FileScanTask task, Long partitionId)
throws AnalysisException {
THdfsScanRange hdfsScanRange = buildScanRange(task, task.file(), partitionId);
List<TIcebergDeleteFile> posDeleteFiles = new ArrayList<>();
for (DeleteFile deleteFile : task.deletes()) {
FileContent content = deleteFile.content();
if (content == FileContent.EQUALITY_DELETES) {
continue; // Skip these - they'll be handled separately
}
TIcebergDeleteFile target = new TIcebergDeleteFile();
target.setFull_path(deleteFile.path().toString());
target.setFile_content(TIcebergFileContent.POSITION_DELETES);
target.setLength(deleteFile.fileSizeInBytes());
posDeleteFiles.add(target);
}
if (!posDeleteFiles.isEmpty()) {
hdfsScanRange.setDelete_files(posDeleteFiles);
}
return Lists.newArrayList(buildTScanRangeLocations(hdfsScanRange));
}The Queue Orchestra: Managing scan tasks at scale
Here's where things get interesting. StarRocks doesn't just blindly process files one by one. Instead, it orchestrates scan tasks through a sophisticated queue system that categorizes and routes work based on delete file associations.

Enter the IcebergRemoteSourceTrigger
Think of IcebergRemoteSourceTrigger as a traffic controller at a busy intersection. It receives file scan tasks from Iceberg and routes them to the appropriate processing lanes:
// IcebergRemoteSourceTrigger.java#L39
public class IcebergRemoteSourceTrigger {
private final RemoteFileInfoSource delegate;
// The three main highways for our scan tasks
private Optional<Deque<RemoteFileInfo>> dataFileWithEqDeleteQueue = Optional.empty();
private Optional<Deque<RemoteFileInfo>> dataFileWithoutEqDeleteQueue = Optional.empty();
private Optional<Deque<RemoteFileInfo>> eqDeleteQueue = Optional.empty();
// For the complicated case: multiple equality ID combinations
private final Map<IcebergMORParams, Deque<RemoteFileInfo>> queues = new HashMap<>();
}
Why separate queues? Performance. Clean data files (no equality deletes) can zoom through the fast lane. Files with equality deletes need to merge at the anti-join intersection. And equality delete files themselves? They're building the roadblock that the anti-join will use.
The dispatch logic
The trigger's dispatch method is where the routing happens. It's like a sorting hat for scan tasks:
// IcebergRemoteSourceTrigger.java#L99
public synchronized void trigger() {
if (!delegate.hasMoreOutput()) {
return;
}
IcebergRemoteFileInfo remoteFileInfo = delegate.getOutput().cast();
FileScanTask scanTask = remoteFileInfo.getFileScanTask();
// Check what kind of deletes we're dealing with
List<DeleteFile> eqDeleteFiles = scanTask.deletes().stream()
.filter(f -> f.content() == FileContent.EQUALITY_DELETES)
.collect(Collectors.toList());
// Route to the appropriate queue
if (eqDeleteFiles.isEmpty()) {
// Fast lane - no equality deletes!
dataFileWithoutEqDeleteQueue.map(queue -> queue.add(remoteFileInfo));
} else {
// Needs anti-join processing
dataFileWithEqDeleteQueue.map(queue -> queue.add(remoteFileInfo));
if (!needToCheckEqualityIds) {
// Simple case: all equality deletes use the same columns
eqDeleteQueue.map(queue -> queue.add(remoteFileInfo));
} else {
// Complex case: different equality ID combinations
// A data file might have multiple eq delete files with different IDs
// We need to fill ALL corresponding queues
// (more on this nightmare scenario below)
}
}
}The multiple equality IDs nightmare
Here's a fun scenario that'll keep you up at night. Imagine a table where different equality delete files use different identifier columns:
- Delete file 1: Uses column
user_id - Delete file 2: Uses columns
user_id, timestamp - Delete file 3: Uses column
transaction_id
A single data file might need to be checked against all three! Miss one queue, and you've got phantom data appearing in your results. The comment in the code puts it perfectly:
// IcebergRemoteSourceTrigger.java#L119-122
// now the different equality ids have its own file info queue.
// but a data file may corresponding to many eq delete files.
// and these eq delete files may not have the same equality ids,
// we should fill in all the corresponding queue.
// otherwise delete table scan with certain equality ids can get no results
// and miss the join result.When does the rewrite trigger?
The rule has specific conditions before it rewrites anything:
// IcebergEqualityDeleteRewriteRule.java#L75
@Override
public boolean check(final OptExpression input, OptimizerContext context) {
Operator op = input.getOp();
if (!op.isLogical()) {
return false;
}
LogicalIcebergScanOperator scan = (LogicalIcebergScanOperator) op;
// Already processed? Skip it
if (scan.isFromEqDeleteRewriteRule()) {
return false;
}
// No equality deletes? Nothing to do here
if (!scan.hasEqualityDeletes()) {
return false;
}
return true;
}Incremental processing: Not drowning in metadata
For tables with thousands of files, loading all metadata upfront would be like downloading an entire Netflix series when you just want to watch one episode. StarRocks implements incremental scan range generation to avoid this:

The batching mechanism
// IcebergConnectorScanRangeSource.java#L121
@Override
public List<TScanRangeLocations> getSourceOutputs(int maxSize) {
try (Timer ignored = Tracers.watchScope(EXTERNAL, "ICEBERG.getScanFiles")) {
List<TScanRangeLocations> res = new ArrayList<>();
while (hasMoreOutput() && res.size() < maxSize) {
RemoteFileInfo remoteFileInfo = remoteFileInfoSource.getOutput();
IcebergRemoteFileInfo icebergRemoteFileInfo = remoteFileInfo.cast();
FileScanTask fileScanTask = icebergRemoteFileInfo.getFileScanTask();
res.addAll(toScanRanges(fileScanTask));
}
return res;
}
}The maxSize parameter (controlled by connector_incremental_scan_ranges_size), determines how many scan tasks to generate per batch. Too small, and you're constantly asking for more work. Too large, and you risk OOM on metadata.
Async metadata fetching
When enable_connector_incremental_scan_ranges is enabled, StarRocks can fetch metadata asynchronously:
// IcebergScanNode.java#L163-172
if (enableIncrementalScanRanges) {
remoteFileInfoSource = GlobalStateMgr.getCurrentState()
.getMetadataMgr().getRemoteFilesAsync(icebergTable, params);
} else {
List<RemoteFileInfo> splits = GlobalStateMgr.getCurrentState()
.getMetadataMgr().getRemoteFiles(icebergTable, params);
remoteFileInfoSource = new RemoteFileInfoDefaultSource(splits);
}The async version returns immediately with a handle, letting the query start executing while metadata loads in the background. It's like ordering your coffee on the app while you're still driving to the store.

Smart optimizations: Not all deletes are created equal
Position delete pruning
Here's a clever trick. Not every position delete file affects every data file. StarRocks checks if a position delete actually references the current data file:
// IcebergConnectorScanRangeSource.java#L201-210
for (DeleteFile del : fileScanTask.deletes()) {
if (del.content() == FileContent.POSITION_DELETES) {
int filePathId = 2147483546; // Magic number from Iceberg spec
// Is this a file-level delete (references specific file)?
if (del.referencedDataFile() == null &&
del.lowerBounds() != null &&
!del.lowerBounds().get(filePathId).equals(
del.upperBounds().get(filePathId))) {
// This is partition-level delete - might affect this file
res.add(fileScanTask);
}
// File-level delete for different file? Skip it!
}
}This optimization can save massive amounts of I/O when position deletes are file-specific rather than partition-wide.
Manifest pruning
Before even looking at files, StarRocks can prune entire manifests:
// Session variable: enable_prune_iceberg_manifest
// Skips manifests that don't overlap with query predicatesIt's like checking the table of contents before reading every chapter of a book.
The Physical Plan: IcebergScanNode
After the optimizer splits the logical scan into multiple operations, each becomes an IcebergScanNode in the physical plan. This is where the MoR strategy becomes concrete:
// fe/fe-core/src/main/java/com/starrocks/planner/IcebergScanNode.java#L60
public class IcebergScanNode extends ScanNode {
private final IcebergTableMORParams tableFullMORParams;
private final IcebergMORParams morParams;
// ...
}Each IcebergScanNode instance knows its specific role through these parameters. The critical setup happens in two methods:
Setting up the queue-specific view
// IcebergScanNode.java#L130
if (morParams != IcebergMORParams.EMPTY) {
boolean needToCheckEqualityIds = tableFullMORParams.size() != 3;
IcebergRemoteSourceTrigger trigger = new IcebergRemoteSourceTrigger(
remoteFileInfoSource, morParams, needToCheckEqualityIds);
Deque<RemoteFileInfo> remoteFileInfoDeque = trigger.getQueue(morParams);
remoteFileInfoSource = new QueueIcebergRemoteFileInfoSource(trigger, remoteFileInfoDeque);
}This wraps the file source with a queue-specific view. Each scan node only sees files relevant to its role — a clean separation of concerns.
The incremental vs batch decision
The scan node can operate in two modes:
// IcebergScanNode.java#L119
if (enableIncrementalScanRanges) {
// Async mode - returns immediately with a handle
remoteFileInfoSource = GlobalStateMgr.getCurrentState().getMetadataMgr()
.getRemoteFilesAsync(icebergTable, params);
} else {
// Batch mode - loads all metadata upfront
List<RemoteFileInfo> splits = GlobalStateMgr.getCurrentState().getMetadataMgr()
.getRemoteFiles(icebergTable, params);
remoteFileInfoSource = new RemoteFileInfoDefaultSource(splits);
// ... then wraps with queue if needed
}Backend Execution: Where Theory Meets Reality
The BE doesn't know what an "IcebergScanNode" is — and it doesn't need to. Through Thrift serialization, the Java planner's sophisticated MoR strategy becomes simple HDFS scan ranges with delete files attached.

The Translation Layer
When the FE's IcebergScanNode serializes itself, it becomes an ordinary HDFS_SCAN_NODE:
// IcebergScanNode.java#L292
@Override
protected void toThrift(TPlanNode msg) {
msg.node_type = TPlanNodeType.HDFS_SCAN_NODE;
THdfsScanNode tHdfsScanNode = new THdfsScanNode();
tHdfsScanNode.setTuple_id(desc.getId().asInt());
msg.hdfs_scan_node = tHdfsScanNode;
}The Iceberg-specific magic travels in the scan ranges:
// PlanNodes.thrift#L340
struct THdfsScanRange {
21: optional list<Types.TSlotId> delete_column_slot_ids;
22: optional list<TIcebegDeleteFile> delete_files; // The secret sauce
28: optional map<Types.TSlotId, Exprs.TExpr> extended_columns;
}
IcebergScanNode produces a TPlanNode, which is serialized into THdfsScanNode and a list of TScanRange objects. Each scan range carries metadata for positional deletes, equality deletes, and extended columns before being dispatched for execution.Position Deletes: The Bitmap Tarantella
When a scanner encounters position delete files, IcebergDeleteBuilder springs into action:
// iceberg_delete_builder.cpp#L66-72
Status IcebergDeleteBuilder::fill_skip_rowids(const ChunkPtr& chunk) const {
const ColumnPtr& file_path = chunk->get_column_by_slot_id(k_delete_file_path.id);
const ColumnPtr& pos = chunk->get_column_by_slot_id(k_delete_file_pos.id);
for (int i = 0; i < chunk->num_rows(); i++) {
if (file_path->get(i).get_slice() == _params.path) {
_deletion_bitmap->add_value(pos->get(i).get_int64());
}
}
}The builder reads the delete file (Parquet or ORC), extracts file_path and pos columns, and builds a bitmap. These special columns are identified by magic slot IDs:
// iceberg_delete_builder.cpp#L35
static const IcebergColumnMeta k_delete_file_path{
.id = INT32_MAX - 101, // Magic number recognized by BE
.col_name = "file_path",
.type = TPrimitiveType::VARCHAR
};The bitmap is then checked during data file scanning — deleted positions simply never materialize into chunks. It's elegant: a simple bitmap check in the hot path instead of complex join logic.
Equality Deletes: The Hash Join Tango
For equality deletes, the BE leverages its existing hash join machinery, but with a twist. The HashJoiner recognizes it's handling MoR through a special parameter:
// hash_joiner.cpp#L82 (constructor)
HashJoiner::HashJoiner(const HashJoinerParam& param)
: _mor_reader_mode(param._mor_reader_mode), // MoR awareness!
// ... other initializationThe hash table parameters get MoR-specific configuration:
// hash_joiner.cpp#L148
void HashJoiner::_init_hash_table_param(HashTableParam* param, RuntimeState* state) {
param->mor_reader_mode = _mor_reader_mode; // Propagate MoR mode
// ... rest of configuration
}The join operates as anti-join: data files on the build side, equality delete files on the probe side. Rows that match equality delete keys get filtered out:
The Metrics Trail
The BE tracks MoR overhead separately from regular scan metrics:
// iceberg_delete_builder.cpp#L229
void IcebergDeleteBuilder::update_delete_file_io_counter(RuntimeProfile* parent_profile, ...) {
const std::string ICEBERG_TIMER = "ICEBERG_V2_MOR";
ADD_COUNTER(parent_profile, ICEBERG_TIMER, TUnit::NONE);
// Track delete file I/O separately
ADD_CHILD_COUNTER(parent_profile, "MOR_AppIOBytesRead", TUnit::BYTES, prefix);
ADD_CHILD_TIMER(parent_profile, "MOR_AppIOTime", prefix);
// Track bitmap operations
ADD_CHILD_COUNTER(parent_profile, "MOR_FSIOBytesRead", TUnit::BYTES, prefix);
// ... more metrics
}
For hash joins handling equality deletes:
// hash_joiner.cpp#L40
void HashJoinProbeMetrics::prepare(RuntimeProfile* runtime_profile) {
search_ht_timer = ADD_TIMER(runtime_profile, "SearchHashTableTime");
other_join_conjunct_evaluate_timer = ADD_TIMER(runtime_profile, "OtherJoinConjunctEvaluateTime");
// Tracks the anti-join overhead
}
IcebergDeleteBuilder path, while equality deletes pass through a hash joiner. Both lanes produce cleaned ranges that feed into result chunks and execution metrics.Why This Design Works
The BE's approach is pragmatic:
- Reuse existing infrastructure: HDFS scanners for file reading, hash joiners for equality deletes
- Simple abstractions: Delete files are just metadata attached to scan ranges
- Performance isolation: Position deletes use fast bitmaps, equality deletes use proven hash join code
- Clear metrics: MoR overhead is tracked separately, making performance tuning straightforward
The BE doesn't need to understand Iceberg's complex MoR semantics. It just sees:
- Files to scan with bitmaps to check (position deletes)
- Two sets of files to join (equality deletes)
This separation of concerns — FE handles complexity, BE handles execution — is what makes StarRocks' MoR implementation both sophisticated and efficient.
The Layered Abstraction
What makes StarRocks' design distinctive is this clean layering:
- Optimization Layer (
IcebergEqualityDeleteRewriteRule): Decides on the MoR strategy - Planning Layer (
IcebergScanNode): Creates queue-based file sources - Execution Layer (
IcebergDeleteBuilder and HashJoin): Consumes pre-organized scan ranges
The QueueIcebergRemoteFileInfoSource wrapper is the key abstraction - it presents a unified interface while actually pulling from specific queues. The backend executors don't need to understand MoR complexity fully; they just process scan ranges mostly.
This contrasts with other engines:
- Spark: Makes delete handling decisions at task runtime
- Trino: Builds delete filters on-demand during execution
- Dremio: Integrates delete handling into vectorized execution
StarRocks trades runtime flexibility for predictability and simpler backend execution.
Putting it all together: The complete pipeline
Let's trace a query through the entire system:

IcebergEqualityRewriteRule, generates TPlanNode variants for data with and without equality deletes, and serializes them into THDFSScanRange. In the BE/CN, positional deletes are applied with IcebergDeleteBuilder, equality deletes flow through a hash joiner, and the final cleaned chunks are merged into query results.- Query arrives:
SELECT * FROM iceberg_table WHERE date = '2025-01-01' - Optimizer kicks in:
IcebergEqualityDeleteRewriteRuleexamines the table's delete files - Scan node splits: If equality deletes exist, creates multiple scan operators: one for data files without equality deletes, one for data files with equality deletes, one for each unique equality delete column combination.
- Queue initialization:
IcebergRemoteSourceTriggersets up the appropriate queues - Metadata fetching begins: Async if enabled, pulling manifest files
- Scan task routing: Each
FileScanTaskgets sorted into its queue - Incremental generation: Scan ranges produced in batches of 500 (or configured size)
- Backend execution: Fast-lane files processed directly with position delete bitmaps, equality delete files scanned for their key values, anti-joins applied for data files with equality deletes
- Results merged: All paths converge to produce the final result set
Configuration knobs you should know
# Enable incremental processing for large tables
enable_connector_incremental_scan_ranges
# Batch size for scan range generation (memory vs latency tradeoff)
connector_incremental_scan_ranges_size
# Async queue size for parallel metadata fetching
connector_remote_file_async_queue_size
# Prune manifests that don't match predicates
enable_prune_iceberg_manifest
# Handle tables with partition evolution
enable_read_iceberg_equality_delete_with_partition_evolutionThe beauty of this design
What makes StarRocks' approach elegant is the separation of concerns:
- Frontend handles all the metadata complexity and routing logic
- Backend just executes simple operations: scan with bitmap, or scan for anti-join
- Queues provide natural parallelism and work distribution
- Incremental processing keeps memory bounded regardless of table size
It's turning Iceberg's complex MoR semantics into well-defined, parallelizable operations. The backend doesn't need to understand the full complexity of Iceberg's specification — it just needs to execute bitmap checks for position deletes and hash joins for equality deletes, using work packages the frontend has carefully prepared and labeled with MoR parameters.
The key insight is that the BE understands what to do (check bitmaps, perform anti-joins) but doesn't need to understand why (Iceberg's MoR specification, time-travel semantics, etc.). The FE handles the complexity of deciding which files need which treatment.
