DARE #2: Cross-cloud storage adventure S3 🤝 R2

This is the second article in my DARE series, where I will share how to move data between S3 and R2 using Spark (don't be afraid, not EMR…

Sanchit Vijay

Data Engineer Things

· ~8 min read · July 3, 2025 (Updated: July 3, 2025) · Free: Yes

This is the second article in my DARE series, where I will share how to move data between S3 and R2 using Spark (don't be afraid, not EMR this time - I know the reaction when readers see "I…used EMR cluster!!… for my experiment".

D.A.R.E. (Data and AI Repository of Experiments) is my safe space to create, break, learn from chaos, and document what works. These articles discuss Data and AI Engineering experiments that matter.

In my previous article, I demonstrated how to migrate Apache Iceberg tables from standard S3 storage to S3 Tables, staying entirely within the AWS ecosystem. This article demonstrates how to migrate Apache Iceberg tables from the AWS Glue catalog to the Cloudflare R2 Data Catalog using Spark.

This experiment addresses a critical question: Are data lakes portable between cloud providers? Everyone talks about "open standards," but does that marketing hold up when moving real workloads from AWS to Cloudflare infrastructure?

The migration uses Spark's dual catalog configuration to read from Glue and write to R2. I'll show you how to handle the authentication complexity, manage JAR dependencies, and process large datasets efficiently. The experiment utilizes glue-managed iceberg tables present in Standard S3 (source: Backblaze hard drive data), but this time moving between two completely different cloud providers. Here's how to make Spark work across AWS and Cloudflare infrastructure.

Before all that, what is R2 Data Catalog?

R2 Data Catalog is a managed Apache Iceberg catalog built directly into the R2 bucket, exposing a standard Iceberg REST catalog interface that works with existing engines such as DuckDB, Spark, Snowflake, and PyIceberg. It eliminates vendor lock-in while maintaining full Iceberg compatibility. Check out more about how R2 stands out and saves storage costs here.

The Migration Experiment Setup

I already had the Backblaze hard drive dataset as Iceberg tables in AWS S3, managed by the Glue catalog. The goal here is to migrate this 584 million row dataset to Cloudflare R2 Data Catalog while preserving all Iceberg features and partitioning.

Source Configuration:

Apache Iceberg tables in AWS S3
Glue catalog managing metadata
Year-based partitioning with quarterly subdivisions

Destination Configuration:

Cloudflare R2 bucket with Data Catalog enabled
REST catalog authentication via API tokens
Same partitioning strategy to preserve query performance

Infrastructure:

EC2 instance (I used c6a.4xlarge) for local Spark session
Dual catalog configuration reading from Glue, writing to R2
JAR dependency management for cross-cloud compatibility

The challenge is configuring Spark to authenticate with both AWS Glue and Cloudflare R2, then processing 13 years of data efficiently without running out of memory.

Implementation

Our script brings unique challenges compared to the previous migration (S3 to S3 Tables). This section explains how to handle these complexities.

Dual Catalog Configuration

Always the first step, configuring Glue Catalog and R2 Data Catalog. Each catalog requires different authentication and endpoint configuration. The key difference is that R2 uses REST catalog with OAuth2 authentication, while Glue uses AWS SDK authentication. Both catalogs use S3FileIO but with different endpoints and credentials.

# Glue catalog configuration
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.glue_catalog.warehouse", os.getenv("S3_WAREHOUSE"))

# R2 catalog configuration  
.config("spark.sql.catalog.r2_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.r2_catalog.type", "rest")
.config("spark.sql.catalog.r2_catalog.uri", os.getenv("R2_CATALOG_URI"))
.config("spark.sql.catalog.r2_catalog.token", os.getenv("R2_CATALOG_API_TOKEN"))
.config("spark.sql.catalog.r2_catalog.oauth2-server-uri", f"{os.getenv('R2_CATALOG_URI')}/oauth/tokens")

The type: rest config tells Iceberg to communicate with Cloudflare's catalog via HTTP calls rather than direct AWS service integration. It's the difference between having a direct phone line versus using the internet to make calls.

JAR Dependencies

We need specific JAR files that aren't bundled with the standard Spark setup. I am using this script to download the following JARs in the working directory.

required_jars = [
    "iceberg-spark-runtime-3.5_2.12-1.6.1.jar",
    "bundle-2.20.18.jar", 
    "url-connection-client-2.20.18.jar",
    "hadoop-aws-3.3.4.jar",
    "aws-java-sdk-bundle-1.12.262.jar"
]

The script validates that all required dependencies are present in the given location before starting.

Year-by-Year strategy

Instead of bulk transfer (which would be expensive and risky across cloud boundaries), the script processes data in annual batches.

def migrate_table_batch_by_year(spark, source_table, dest_table, table_info):
    # Auto-detect available years from source
    years_df = spark.sql(f"SELECT DISTINCT year FROM {source_table} ORDER BY year")
    years = [row.year for row in years_df.collect()]
    
    for i, year in enumerate(years):
        # Read specific year with predicate pushdown
        df = spark.sql(f"SELECT * FROM {source_table} WHERE year = {year}")
        
        # Add quarter partitioning for R2 optimization
        df = df.withColumn("quarter", 
            when(month(col("date")).between(1, 3), "Q1")
            .when(month(col("date")).between(4, 6), "Q2")
            .when(month(col("date")).between(7, 9), "Q3")
            .when(month(col("date")).between(10, 12), "Q4"))

The year-by-year approach provides three key benefits: minimized cross-cloud data transfer through predicate pushdown, granular recovery points if specific years fail, and controlled memory usage for large datasets. If 2019 data fails to migrate, you don't lose 2013–2018 progress.

Schema Evolution and Partitioning Strategy

R2 migration preserves the original schema while adding enhanced partitioning for Cloudflare's global edge optimization. The migration keeps your data structure exactly the same but reorganizes how files are split up to work better with Cloudflare's distributed storage network.

# First batch - create table with partitioning
if i == 0:
    df.writeTo(f"r2_catalog.default.{dest_table}") \
        .partitionedBy("year", "quarter") \
        .tableProperty("format-version", "2") \
        .tableProperty("write.target-file-size-bytes", "268435456") \
        .tableProperty("write.parquet.compression-codec", "snappy") \
        .createOrReplace()
else:
    # Subsequent batches - append with schema merging
    df.writeTo(f"r2_catalog.default.{dest_table}") \
        .option("mergeSchema", "true") \
        .append()

The script creates 256MB files with snappy compression because this size works well for both fast queries and efficient storage across Cloudflare's global network. Too small and you get too many files to manage, too large and data transfer becomes slow.

Memory Management

Cross-cloud migrations are memory-intensive, so the script includes automatic memory allocation and cleanup.

# Auto-detect available memory
available_memory_gb = psutil.virtual_memory().available // (1024**3)
driver_memory_gb = max(4, min(28, int(available_memory_gb * 0.8)))

# Memory cleanup after each batch
df.unpersist()
del df
gc.collect()

Each year batch is processed independently, with aggressive memory cleanup between batches to prevent OOM errors during large cross-cloud transfers. The 80% memory allocation leaves headroom for Spark's internal operations, while explicit garbage collection prevents memory leaks that accumulate across 13 annual batches.

Optimization

The optimization step is crucial for R2 performance, as it rewrites files for optimal distribution across Cloudflare's global edge network.

spark.sql(f"CALL r2_catalog.system.rewrite_data_files(table => 'default.{table_name}')")
spark.sql(f"CALL r2_catalog.system.rewrite_manifests('default.{table_name}')")

The rewrite_data_files operation consolidates small files created during the migration into optimal 256MB chunks, while rewrite_manifests optimizes metadata for faster query planning. These operations are specific to Iceberg and run after migration completes.

This implementation handles the complexity of coordinating two different catalog systems while maintaining data integrity and performance. The year-by-year batch processing proved essential for both memory management and cost control during cross-cloud transfers.

The logs below show how this configuration performs during actual migration execution.

The entire 584 million row migration completed in 1 hour 48 minutes on a single node setup. Running this on an EMR cluster with multiple executors would likely reduce processing time significantly, but that's an experiment for another day.

These code snippets highlight the critical configuration points, but the complete migration script includes additional error handling, verification steps, and optimization routines.

Querying with DuckDB

Let's verify the data accessibility and test query performance. DuckDB provides the perfect tool for this with its native R2 Data Catalog support.

Follow the basic setup instructions from the documentation (ensure you have DuckDB version 1.3.0 or later). Start by installing and loading the required extensions. Then, configure a DuckDB secret to store the R2 API token. Followed by attaching R2 Data Catalog (read-only).

First, you need the name of your Iceberg warehouse in R2. hdd_r2_catalog is the alias we will be using to reference this catalog in DuckDB. Then you need a catalog URI. You can get these 2 from the R2 console of your bucket as shown below:

Now, let's check the tables present in our R2 Catalog.

We will query hdd_iceberg_r2 from default schema. Starting with the basics, getting the count of data. This confirms that all our data has been migrated properly, showing 584 million rows.

Now let's try something complex. We will use the same query that we used earlier in Athena to query our S3 Table. This query analyzes hard drive reliability by model for Q2 2024, calculating failure rates and assigning reliability grades for drive models with significant sample sizes (>1000 drives).

DuckDB's integration with R2 Data Catalog was smooth. No performance overhead from the REST catalog interface, and all Iceberg features (time travel, schema evolution) worked as expected.

This confirms that the migrated data maintains full Iceberg compatibility while benefiting from Cloudflare's global distribution network. The 256MB file sizes created during migration proved optimal for both DuckDB's columnar processing and R2's edge performance.

What did I learn?

This cross-cloud migration experiment taught me three practical things about moving Iceberg data to R2 Data Catalog:

REST catalog integration works smoothly. Once configured correctly, DuckDB and Spark treat R2 Data Catalog exactly like any other Iceberg catalog. The OAuth2 authentication setup was straightforward, and query performance matched expectations.
Year-by-year processing is essential for cross-cloud migrations. The batch approach prevented memory issues, controlled costs, and provided recovery points when network hiccups occurred. Processing 584 million rows took 1 hour 48 minutes on a single node, which felt reasonable for moving data between AWS and Cloudflare.
JAR dependency management matters more than expected. Missing or incompatible JARs caused authentication failures that only surfaced during actual data processing.

The cross-cloud migration proved that vendor lock-in is more about operational convenience than technical impossibility. Sometimes, the biggest barrier to data portability is just the effort required to figure out all the configuration details. The best part about experimenting with multiple cloud providers? You realize they're all just different flavors of the same complexity.

Each experiment answers one question and raises three more. Some experiments work brilliantly, some fail spectacularly, but all teach me something I didn't know yesterday. That is exactly how I like it.

(My cloud bill occasionally disagrees 🫠).

DARE Continues…