What the Fuss with Fluss: Flink Delta Force

Released July 31, 2025, Flink 2.1 dropped a bombshell on the streaming world: DeltaJoin and MultiJoin. These aren't your typical join…

Anton Borisov

fresha-data-engineering

· ~11 min read · September 22, 2025 (Updated: September 22, 2025) · Free: Yes

Released July 31, 2025, Flink 2.1 dropped a bombshell on the streaming world: DeltaJoin and MultiJoin. These aren't your typical join operators, they're surgical instruments designed to excise the "state tumor" that's been metastasizing in production Flink clusters for years.

Every Flink team knows the dirty secret: join state inflates faster than a London rental listing. You start with gigabytes, graduate to terabytes, then watch helplessly as checkpoints stretch from seconds to minutes to "let's just restart the job." Traditional streaming joins hoard everything — every customer record, every order, every update, because they might need it someday. It's digital hoarding at datacenter scale.

The numbers are sobering. For many teams, a modest e-commerce platform joining orders with customers accumulates tens of gigabytes of state per billion events. Add product catalogs, inventory, shipping — suddenly you're managing terabytes of state across your topology. Checkpoints that started at 30 seconds stretch to 5 minutes, then 20. Recovery time follows the same trajectory. Your on-call rotation becomes a game of Russian roulette: whose turn is it when the next OOM brings down production?

Sure, you can fight back — state TTLs to expire old data, interval joins with time bounds, broadcasting small dimension tables, or even falling back to batch reprocessing. Some teams pre-partition data to reduce per-operator state or switch to temporal joins where possible. But these workarounds either limit functionality or add operational complexity. That's why the industry is racing toward fundamental solutions: DeltaJoin externalizes state entirely, RisingWave uses cloud-native shared storage, Feldera pre-materializes everything, so different philosophies, same recognition that traditional join state is broken.

DeltaJoin flips the script: why store when you can fetch? Instead of maintaining partner history in operator state, it queries an external indexed store at emit time. Flink's DeltaJoin with eventual consistency always probes the latest counterpart at emit time (no snapshot pinning). The payoff is dramatic: minimal checkpoint overhead, lightning-fast recovery, and jobs that actually scale elastically instead of just claiming to.

Streaming is being reinvented in front of us.

Flink/Fluss, RisingWave, and Feldera are not rivals so much as different philosophies of state and paying attention now is how we learn which ideas will define the next decade.

I'll focus mainly on Flink and Fluss, but the story wouldn't be complete without glancing at how RisingWave and Feldera tackle the same challenge from other angles.

Enter Fluss: The Storage Layer That Gets It

This lookup-based rebellion demands a cooperative storage layer. Apache Fluss (Incubating) isn't just another streaming store, it's purpose-built for this exact pattern. While other stores make you choose between streaming updates and efficient lookups, Fluss delivers both through its prefix lookup capability on primary-key tables.

The magic is in the prefix. Most lookup stores demand exact keys, meaning you need the complete primary key or you're out of luck. Fluss lets you probe with just a prefix. Got a composite key (customer_id, order_id, item_id) but only know the customer? Traditional stores would force a scan or secondary index if supported. Fluss says "no problem" as your prefix lookup hits a tight RocksDB range scan on a single tablet. Paimon's PK-based lookup works great when your join keys match the primary key exactly; Fluss's prefix lookup makes composite-key joins practical when only a prefix is available, and it's the real-world case in enrichment pipelines.

Comparison of Apache Paimon vs Fluss lookup behavior. On the left, Paimon can only join when the entire primary key matches; joining on a partial key like customer_id with a composite (customer_id, order_id) fails, requiring an aggregate workaround. On the right, Fluss supports prefix lookups, allowing joins on just customer_id with subranges resolved by order_id, enabling efficient composite-key joins.

Prefix lookups make the difference: Paimon requires full primary key matches (or workarounds like nested aggregates), while Fluss supports probing with just a prefix — making composite-key joins practical in real pipelines.

Under the hood, each Fluss table bucket maps to a KV tablet backed by RocksDB, plus a child log tablet for the changelog. When your lookup hits the tablet leader, it translates directly into RocksDB operations. The dual structure: mutable KV store plus changelog gives you both point-in-time lookups and CDC streams. This isn't accidental, it's architected specifically for patterns like DeltaJoin.

Comparison of regular join vs DeltaJoin. On the left, two Kafka streams feed into a regular join operator that keeps left and right state internally, leading to large RocksDB state. On the right, DeltaJoin receives changelogs from Fluss, performs index lookups instead of maintaining state, and avoids holding left and right state inside the operator.

Regular joins bloat operator state by holding both sides in memory, while DeltaJoin externalizes history into Fluss. Instead of hoarding state, each side emits a changelog and probes the index on demand.

DeltaJoin builds on two critical innovations. FLIP-486 provides the core operator StreamingDeltaJoinOperator with bilateral LRU caches and the planner intelligence to know when to use them. When a record arrives on either side, it checks its cache, then triggers an async probe on miss. Two AsyncDeltaJoinRunner instances handle the bilateral lookups, each maintaining its own cache to avoid constant external calls.

FLIP-519 solves the harder problem: async ordering chaos. The KeyedAsyncWaitOperator ensures updates for the same key process sequentially while different keys run concurrently. Without this, your upserts would corrupt faster than a politician's promises. It's the difference between eventual consistency and eventual catastrophe.

Diagram of the KeyedAsyncWaitOperator. Three updates for key A (A1, A2, A3) arrive at times T1, T2, and T3. Even though the async tasks overlap in time, the operator enforces output order as A1 → A2 → A3. This ensures same-key updates are emitted sequentially while maintaining concurrency across different keys.

The KeyedAsyncWaitOperator serializes updates per key while letting different keys run concurrently—turning async chaos into ordered streams.

Configuration: Where Dreams Meet Reality

Your setup starts simple:

SET 'table.optimizer.delta-join.strategy' = 'AUTO';
SET 'table.exec.async-lookup.key-ordered-enabled' = 'true';
SET 'table.exec.async-lookup.output-mode' = 'ALLOW_UNORDERED';
-- DeltaJoin caches are configurable per side; tune per workload

But the optimizer is picky. Between source and join, only stateless operators survive the rewrite: scans with pushdowns, key-preserving calcs, watermarks, exchanges. One stateful operator in the chain? Back to traditional joins and their state baggage. No cascaded joins yet — the planner stays conservative.

Changelog streams face tighter scrutiny. Streams heavy in UPDATE_BEFORE operations get rejected—DeltaJoin can't reconstruct "before" values from point lookups. Delete operations follow strict type-based rules: INNER joins tolerate deletes from one side only; LEFT/RIGHT joins only from the outer side. The planner stays conservative, choosing correctness over optimization every time.

able listing DeltaJoin limitations. Topology: only stateless operators allowed between source and join. Multi-join: cascaded joins not supported, only a single DeltaJoin per plan. Changelog support: streams with UPDATE_BEFORE are rejected. Delete semantics: for INNER joins, deletes allowed only on one side; for LEFT/RIGHT joins, deletes allowed only from the outer side. Consistency: eventual consistency without snapshot isolation. Cache: bounded LRU caches per side, cache misses trigger async pr

Current limitations of Flink's DeltaJoin operator across topology, semantics, and consistency.

Your Fluss table design determines performance:

CREATE TABLE fluss_customers (
    customer_id BIGINT,
    region_id INT,
    order_count BIGINT,
    customer_data STRING,
    PRIMARY KEY (customer_id, region_id) NOT ENFORCED
) WITH (
    'bucket.num' = '16',
    'bucket.key' = 'customer_id',  -- single-tablet path for lookups by customer
    'table.log.ttl' = '7d'          -- correct option name for retention
);

PK tables default to hash bucketing by primary key (excluding partition keys) unless you set bucket.key. This configuration is mission-critical—set bucket.key = 'customer_id' and every probe hits a single tablet. Skip this step, and Fluss distributes your lookups across the cluster.

The Architecture Wars: Different Philosophies, Same Problem

Here's where the streaming world splits into camps. RisingWave takes the "streaming database" approach: it's PostgreSQL-wire-compatible and maintains all join state in shared storage with tiered caching. When you write a join in RisingWave, you're essentially creating a materialized view that updates incrementally. The engine manages multi-version concurrency control (MVCC) internally, giving you snapshot consistency by default. Your joins always see a consistent view of both sides, but you pay for it with storage overhead and checkpoint coordination across actors.

RisingWave's architecture is fascinating: compute nodes are stateless, with all state living in Hummock shared storage (typically S3-compatible object stores). Hot data stays in multi-level caches: block cache, meta cache, and optional disk cache (NVMe/EBS). When a join processes a record, it might hit memory, local SSD, or remote storage. The consistency protocol ensures all actors see the same epoch, preventing temporal anomalies. But this coordination has a cost: checkpoint barriers must traverse all actors, with latency bounded by the slowest operator and storage I/O.

Their Disk Cache shows impressive results, up to 94% fewer remote reads and ~75% fewer S3 GETs in their tests, drastically reducing both latency and cloud costs. Atome, a BNPL payments company, migrated parts of their Flink-based pipeline to RisingWave specifically for operational simplicity and consistency guarantees.

Feldera goes even further with incremental everything. Built on Differential Dataflow principles (DBSP), Feldera doesn't just maintain state — it maintains the entire computation graph as incremental operators. Every join is a Z-set transformation that tracks insertions and deletions as weighted updates. When you query a join in Feldera, you're reading from a pre-computed, constantly updating index. Feldera recently rounded this out with first-class backfill orchestration, labeled connectors and staged historical-then-realtime ingest — so bootstrapping big state is part of the product, not a runbook.

Feldera's advantage comes from maintaining the entire computation graph as incremental state. When you join the result of Join A with Join B (multi-hop), Feldera doesn't recompute Join A from scratch, it maintains Join A's output as a differential index that updates incrementally. Each downstream join operates on pre-materialized, indexed results from upstream joins. For recursive queries like supply chain tracing or graph algorithms, Feldera's pre-computation model can outperform lookup-based approaches on multi-hop or recursive queries, because each iteration reuses incrementally maintained indexes instead of issuing fresh lookups or recomputing upstream joins.

Three-column comparison of streaming architectures. Left column: Flink + Fluss — minimal operator state, external KV and log storage. Middle column: RisingWave — stateless compute actors, shared storage with multi-tiered cache. Right column: Feldera — incremental indexes and pre-materialized arrangements. Each column is visually boxed with corresponding project logos.

Different philosophies of state management: Flink + Fluss use minimal operator state with external KV + log, RisingWave relies on stateless compute actors over shared storage, while Feldera pre-materializes results using incremental indexes.

DeltaJoin + Fluss takes the opposite bet: externalize everything. No internal state management, no MVCC overhead, no incremental index maintenance. Just probes. The philosophical triangle is clear:

Flink+Fluss (DeltaJoin): "probe on demand, minimal job state, eventual consistency"
RisingWave: "MVCC + materialized views over shared storage for snapshot-consistent results"
Feldera: "pre-materialize and index every intermediate; strongest for multi-hop/recursive"

The Performance Reality Check

RisingWave checkpoints can reach 100–200MB/s throughput with consistent latency, but checkpoint size grows linearly with state. A billion-row dimension table means gigabytes of checkpoint data, even with compression. Recovery requires rebuilding consistent state across all actors, think minutes for large deployments. However, the shared storage model enables parallel recovery: new actors can start processing immediately while warming their caches asynchronously.

Feldera's checkpoints are incremental diffs, often smaller than RisingWave's, but the storage footprint is significant. That same billion-row table requires substantial storage to maintain efficient Z-set indexes, and that's before considering join output cardinality. The payoff? Sub-millisecond query latency on pre-computed results. Feldera can process millions of updates per second on a single node for moderately complex queries, but each join multiplies storage requirements.

DeltaJoin checkpoints are tiny compared to stream-stream joins, it's just cache entries and in-flight requests. Recovery means warming caches, not rebuilding state. The billion-row table lives in Fluss, accessed on-demand. The trade-off: uncached lookups add low single-digit milliseconds of latency, and cache misses during traffic spikes can cascade into backpressure. With proper cache configuration and hot-key patterns, high cache hit rates keep most lookups fast.

The implementation uses pooled result handlers and bounded in-flight async probing to manage resources efficiently. Failed lookups retry with exponential backoff, while partial caching helps balance memory usage against lookup frequency.

A table comparing three streaming systems. RisingWave has large checkpoints, slow recovery, moderate latency, and medium memory/storage use. Feldera has medium checkpoints, moderate recovery, fast latency, and high storage requirements for large differential indexes. DeltaJoin+Fluss has small checkpoints, fast recovery, low latency, and low memory usage limited to LRU caches.

Performance trade-offs matrix comparing RisingWave, Feldera, and DeltaJoin+Fluss across checkpoint size, recovery time, latency, and memory/storage footprint.

Real-World Patterns and Anti-Patterns

Choose RisingWave when you need SQL compatibility and epoch-based snapshot consistency for analytics and dashboards. Financial dashboards, operational analytics, anywhere business users write SQL directly. The PostgreSQL compatibility means your BI tools just work. The MVCC consistency means your reports always balance. The trade-off is operational complexity: you're running a distributed database, not just a stream processor.

Choose Feldera for complex, multi-hop incremental queries. Graph analytics, recursive CTEs, anything where you're joining join results with other join results. Say a logistics company tracking packages through their network, like packages join with trucks, trucks join with routes, routes join with weather, all joining with real-time GPS. In Feldera, this entire graph updates incrementally. In DeltaJoin, you'd need millions of lookups per second, each adding latency.

Choose DeltaJoin + Fluss for high-volume enrichment with bounded dimensions. User profile lookups, feature assembly, entity resolution, essentially the cases where one side is orders of magnitude larger than the other. Taobao enriching millions of events per second with user profiles and advertiser configs sees dramatic state reduction. Traditional join state would be massive. With DeltaJoin + Fluss? Just cache overhead, with dimension data living in Fluss's optimized storage.

A table compares three stream processing systems. RisingWave is suitable when SQL and snapshot consistency are required, such as dashboards, but should be avoided if you want to minimize database-like operational overhead. Feldera is good for multi-hop or recursive incremental queries but less suited when storage and I/O budgets are tight. DeltaJoin + Fluss works well for high-volume enrichment with bounded dimensions but struggles when dimensions change at very high frequency.

Guidelines for choosing between RisingWave, Feldera, and DeltaJoin+Fluss: best-fit use cases and situations to avoid.

But know the anti-patterns. DeltaJoin fails catastrophically with high-cardinality, high-mutation dimensions. If your dimension table has millions of rows changing every second, your cache becomes useless, every join becomes a lookup, and your network becomes the bottleneck. Imagine real-time inventory tracking: every sale updates inventory, invalidating caches instantly. It's a lot better to stick to traditional joins with aggressive TTLs in this case.

Diagram showing DeltaJoin between Sales and Inventory streams with external indexed storage. Cache is constantly invalidated due to high churn, forcing every join to query external storage. Problems illustrated include near 100% cache miss rate, lookup storms from constant external queries, and network saturation from bandwidth exhaustion. Highlights why DeltaJoin is unsuitable for highly mutable dimension tables.

Anti-pattern: using DeltaJoin with high-churn dimension tables causes cache misses, lookup storms, and network saturation.

The Uncomfortable Consistency Question

Here's what the documentation confirms: DeltaJoin delivers eventual, not snapshot consistency by design. Each probe sees the latest row version at lookup time. Concurrent updates to both join sides create transient mismatches that eventually converge.

Consider this scenario: An order for customer C arrives at T1. DeltaJoin looks up customer C, gets version V1. At T2, customer C updates their address. At T3, another order arrives and gets version V2. Your output stream now has two orders for the same customer with different addresses, even though they were placed seconds apart. RisingWave and Feldera users might scoff — they get consistency guarantees built-in. But those guarantees aren't free — they're paid in checkpoint size, recovery time, and operational overhead.

Diagram comparing DeltaJoin and Feldera consistency guarantees. In the DeltaJoin lane, Order O1 joins with customer version V1 at T1, and Order O2 joins with version V2 after the customer updates their address at T2 — resulting in inconsistent output for the same customer. In the Feldera lane, both orders are aligned to version V1 until the checkpoint, after which they converge to V2. Highlights the trade-off between eventual consistency (faster, lighter) and snapshot consistency (heavier checkp

Eventual vs. snapshot consistency: DeltaJoin emits orders with different customer versions until checkpoint convergence, while systems like Feldera align versions at snapshot boundaries.

Your downstream must be built for this reality. Upsert sinks that handle duplicates gracefully. Idempotent operations that survive replays. Business logic that tolerates temporary inconsistencies. Monitoring that tracks convergence time, not just throughput. For many real-time analytics use cases, this trade-off is brilliant: you exchange perfect consistency for operational sanity. For financial reconciliation requiring ACID guarantees? Keep walking.

The Future State

Futuristic and optimistic vision of Data Streaming Architecture

Data Future is bright and shiny with these new tools!

While Fluss 0.6 introduced prefix lookups, version 0.8 targets full Flink 2.1/DeltaJoin integration as a key milestone. The roadmap includes potential optimizations like automatic bucket key inference and adaptive cache sizing, though specifics remain subject to change.

RisingWave is doubling down on cloud-native architecture, with their Elastic Disk Cache showing significant S3 I/O reductions, investing a lot of effort in streaming Iceberg. Feldera continues pushing incremental computation boundaries, with their backfill orchestration making it easier to bootstrap complex stateful pipelines.

The streaming world isn't converging on a single solution — it's diverging into specialized tools for different problems. Traditional approaches like Flink's disaggregated state still treat state as job-private, offloaded but isolated. Materialize, Feldera, and RisingWave maintain rich indexed state within the engine, offering stronger consistency and incremental processing with smart intermediate caches at higher operational cost and a whole different set of tradeoffs.

DeltaJoin + Fluss isn't trying to be RisingWave or Feldera. It's solving a specific problem: making traditional Flink joins scale without the traditional state explosion. For teams already invested in Flink, facing the state management wall, this combination offers a pragmatic escape route without switching engines entirely.

The constraints are real: eventual consistency, strict topology requirements, and the need for cooperative storage. But for teams willing to embrace these trade-offs, the payoff is substantial: joins that actually scale, checkpoints measured in seconds not hours, and operators who can sleep through the night.

Shoutout to the Flink, Fluss, Feldera, and RisingWave teams for quietly revolutionizing how we think about data in motion.

While the rest of tech argues about AI, these folks are solving the unglamorous problems that actually keep the world's data flowing and making it look elegant in the process.

#data-science #data #data-engineering #apache-flink #programming