Data Lakehouse: Infrastructure

Architecting a sophisticated Data Lakehouse platform on AWS, using EKS and cutting-edge technologies like StarRocks, Flink and Iceberg

Paritosh Anand

fresha-data-engineering

· ~6 min read · September 19, 2025 (Updated: September 20, 2025) · Free: Yes

In today's data-driven world, organizations face the challenge of building infrastructure that can handle both real-time streaming analytics and traditional batch processing while maintaining cost efficiency and operational simplicity.

At Fresha, we tackled this challenge by architecting a sophisticated Data Lakehouse platform on AWS, using EKS and cutting-edge technologies like Apache Paimon, StarRocks, Apache Flink & more.

This article explores how we designed a multi-account AWS architecture with dedicated data engineering infrastructure, cross-account VPC peering, and advanced Kubernetes orchestration to create a unified, scalable data platform that serves as the backbone for our entire analytics ecosystem.

Data Lakehouse: The Best of Both Worlds

Imagine you're building a data platform and you're torn between two approaches: storing everything cheaply in a data lake (like dumping files in S3) or organizing everything perfectly in a data warehouse (like a well-structured database). What if you didn't have to choose?

A Data Lakehouse solves this dilemma by combining the cost-effective storage of data lakes with the query performance and ACID transactions of data warehouses. You get to store petabytes of raw, structured or unstructured data while still running lightning-fast SQL queries and real-time analytics on the same platform.

Check this article for more insights How We Accidentally Became One of UK's First StarRocks Production Pioneers

Architecture: Core Components

Data Lakehouse Components

PostgreSQL Database receives all transactional data from our applications in real-time.

Kafka Connect with CDC streams every database change in PostgreSQL DBs to AWS MSK using AVRO format. Schema Registry ensures data consistency by managing AVRO schemas across the entire pipeline.

Apache Flink processes streaming data in real-time, applying transformations and business logic.

StarRocks serves as one of our analytical database, optimized for fast OLAP queries across terabytes of data.

The Infrastructure That Powers Our Data Lakehouse

Building a Data Lakehouse isn't just about choosing the right tools — it's about architecting a platform that can scale, secure, and serve data at enterprise levels. This is the story of how we containerized our entire data platform on EKS, implemented cross-account secret management, and created a unified infrastructure that handles both real-time streaming and batch processing workloads.

EKS Cluster: data-engineering

We deployed our entire Data Lakehouse platform on a dedicated EKS cluster. This isn't just any Kubernetes cluster — it's specifically tuned for data workloads with custom node groups, storage classes, and resource limits optimized for our streaming and analytical components.

Why a separate EKS cluster? Data engineering workloads Flink, Spark & StarRocks are fundamentally different from our application workloads. They need higher compute resources, specialized storage classes, and extended access for experimentation. By isolating these components, we can scale independently, apply different monitoring strategies, and also maintain clear cost attribution.

External Secrets: Cross-Account Secret Management

Here's where things get interesting. Our data platform needs access to secrets from multiple AWS accounts, but we don't want to duplicate credentials or create security vulnerabilities.

We achieved that by configuring some resources in our EKS cluster to store:

Database connection strings and configuration parameters from our production account
Credentials for Kafka authentication
Internal configuration and service-specific parameters

The magic happens through cross-account IAM roles. Our EKS cluster can assume roles in the production account to fetch secrets, while maintaining the principle of least privilege. That allows our applications to consume them transparently.

This approach eliminates the need to manually manage secrets across accounts while maintaining the security boundaries that enterprise environments demand.

ALB Ingress: Unified Access to Our Data Platform

Instead of creating separate load balancers for each service, we designed a unified approach using a single Application Load Balancer called data-lakehouse-alb. This ALB acts as the single entry point for all our data platform services, using host-based routing to direct traffic to the appropriate backend services.

Why a single ALB? Cost optimization and operational simplicity. Managing one load balancer with multiple listener rules is far more efficient than maintaining separate ALBs for each service. Plus, it gives us a single point of control for SSL termination, access logging, and security policies.

Traffic Routing

The data-lakehouse-alb is configured as an internal load balancer, meaning it's only accessible from within our VPC. This isn't a limitation — it's a security feature. Our data platform contains sensitive business data, so we don't want it exposed to the public internet.

Access is controlled through two channels:

VPN Users (Data Engineers) connect via our Client VPN to access management UIs
Fresha Applications connect via VPC peering to query data from StarRocks

StarRocks gets special treatment with its own Network Load Balancer (NLB) because it needs to handle high-throughput database connections from our consumer applications. These applications connect to StarRocks using the MySQL protocol, requiring TCP-level load balancing for optimal performance.

Infrastructure as Code

Terraform manages all AWS infrastructure — EKS clusters, VPC peering, ALBs, and IAM roles — with full version control and repeatability.

ArgoCD handles GitOps deployments, automatically syncing Kubernetes manifests from Git repositories to our EKS cluster.

Custom StarRocks Terraform Provider manages database schemas, RBAC policies, and user permissions directly from Terraform.

Result: Complete infrastructure and application lifecycle managed through code, versioned in Git, and deployed automatically via CI/CD pipelines.

Lessons Learned: Building a Modern Data Platform

Building this Data Lakehouse platform taught us that modern data infrastructure isn't just about choosing the right tools — it's about creating a cohesive ecosystem that can evolve with your business needs.

Key Takeaways:

Challenge Existing Patterns: We didn't just containerize existing workloads; we fundamentally reimagined how data platforms should be architected for the AI/ML era, breaking away from traditional data warehouse approaches.

Cross-Account Security is an Art: Designing seamless cross-account networking and IAM access patterns while maintaining enterprise-grade security required us to think beyond standard AWS patterns and create custom solutions that most organizations shy away from.

Multi-Account Strategy Enables Innovation: By isolating data engineering workloads, we created a sandbox for rapid experimentation with cutting-edge technologies like Apache Iceberg, StarRocks, and real-time stream processing, something impossible in a shared production environment.

The Result:

A robust, enterprise-grade data platform that demonstrates the power of strategic architectural thinking. By establishing a dedicated AWS account with meticulously designed cross-account networking and IAM access patterns, we've created a secure, scalable foundation that not only meets today's requirements but positions us perfectly for future growth and innovation.

Acknowledgments

This Data Lakehouse platform wouldn't have been possible without the incredible collaboration between our Platform Engineering and Data Engineering teams. Their shared vision, technical expertise, and relentless pursuit of excellence transformed a complex architectural challenge into a robust, scalable solution that serves as the backbone of our data infrastructure.

Special recognition goes to the team members who pushed the boundaries of what's possible with modern cloud infrastructure and showed us that the best solutions often come from thinking beyond conventional approaches.

Dreaming about data :)

#aws #kubernetes #data-lake #infrastructure-as-code #ai