At my current company, one of Australia's leading beverage companies, data is at the heart of nearly everything we do. In fact, about 80% of all the data we handle comes directly from SAP systems. These SAP sources aren't limited to just one system; we're talking about multiple interconnected subsystems like SAP S4/HANA for core transactions, Cloud for Customer (C4C) for CRM and sales data, Integrated Business Planning (IBP) for forecasting and supply chain, SAP Concur for travel and expense management and more. Each subsystem generates critical insights, but they also present unique integration challenges.
Here's the story of how we built a scalable, AWS-based Data & Analytics (D&A) platform from scratch, the architectural decisions we made, and how it all comes together behind the scenes.
Why it is important to have a unified data platform for us?
A few years back, our analytics setup relied primarily on business warehouse that hosted within SAP environment, which was excellent for structured, SAP-only data but struggled significantly when faced with broader integration requirements. The limitations became clear as we started receiving increasing volumes of data from non-SAP sources:
- Third-party vendor market research data,
- Static files via SharePoint,
- And more
We needed a solution flexible enough to seamlessly handle structured, semi-structured, and unstructured data alike, whether sourced from SAP or elsewhere.

Designing the right architecture
Our initial considerations revolved around two central challenges:
#1 - How could we reliably ingest data from diverse sources into a centralized repository?
As I mentioned from the get-go, our team deals with SAP and non-SAP data alike, and every source comes with its own ingestion mechanism.
For instance, within the SAP landscape alone, we pull transactional data from SAP S4/HANA and CRM data from SAP Cloud for Customer (C4C) using OData services, which is the API-flavoured object of SAP to make connection to other system, securely. But that's just part of the story!
SAP Concur, another key SAP subsystem we rely on for travel and expense management data, doesn't send data via API at all. Instead, we pull encrypted files directly from an SFTP server.
On top of that, external vendors push their market data directly to our platform. Other teams might share data with us through static files uploaded to SharePoint, while yet others store their data in simple CSV files hosted on third-party SFTP servers or being sent to us via email.
Now, if the variety in ingestion mechanisms wasn't challenging enough, the types of files we receive are just as diverse. We handle everything from strictly structured data ( csv , xlxs), to semi-structured data (think jsonor xml).
One project I'm personally working on receives data in an encrypted, unstructured format called
.jsonl.pgp— which, in layman terms, a JSON lines file encrypted with Pretty Good Privacy.
Each unique scenario forced us to thoughtfully design our ingestion workflows. Whether we were pushing or pulling data, structured or unstructured, encrypted or plain text, we had to ensure every ingestion pipeline reliably moved data into a central Data Lake.
#2 - Once we landed data into AWS, how would we transform and prepare it efficiently for business users?
Now, once we successfully landed all this data into this Data Lake area, another big question remained: How do we turn this raw, often messy data into something genuinely useful and understandable for the business?
Let's take SAP S4/HANA data as an example. If you've ever peeked at raw SAP tables, you probably know exactly what I mean. Technical field names can feel cryptic and confusing at best. Instead of friendly terms like "Customer ID" or "Material ID," SAP tends to spit out table headers like kunnr for customer ID or matnr for material number.
While these abbreviations might be perfectly logical for SAP consultants or anyone fluent in German (because yes, they're mostly acronyms based on German terminology), they can leave business users scratching their heads. I mean, no offense to our German friends, but MATNR doesn't exactly scream "material number" if you're not familiar with the term, right?
So part of our data processing challenge involved transforming these obscure, technical column names into clear, business-friendly language. It wasn't enough just to land the data safely into Data Lake, we had to actively interpret and reshape it.
AWS emerged as our top choice precisely because it addressed these critical pain points. However, creating an effective architecture wasn't as simple as plugging in a few AWS services. To handle these challenges, our platform architecture was divided into clear stages: Data Ingestion, Data Processing/Loading, and Data Warehousing.
And the story begins from here.

Data Ingestion: Gathering Data from Everywhere
The raw-data pipeline is organised around three independent state machines: Ingestion, Processing, and Loading. Each state machine runs on AWS Step Functions, which gives us deterministic task ordering, built-in retries, and centralised error handling while keeping the JSON definition of every workflow in version control.

We manage every ingestion workflow with AWS Step Functions. Each workflow is defined as a state machine that lists the exact sequence of tasks, the success criteria for each task, and the error-handling branch if any step fails.

When a new load is triggered, the state machine starts by invoking an AWS Lambda function that collects connection metadata: endpoint URLs, authentication tokens, and the specific objects or tables requested for that run. The Lambda function captures the run-time metadata and passes a reference to the next state.

The next task in the state machine launches an AWS Glue job. Glue connects to the source system using the metadata provided, extracts the data, and writes the raw payload to an S3 prefix named ingested. All payloads are stored in JSON to preserve schema details exactly as received.
After Glue finishes, Step Functions routes execution to another Lambda function that validates the job result.

The validator checks row counts, file size, and schema conformance, then updates an execution log in DynamoDB.

If validation fails or the Glue job errors or times out, the state machine switches to a failure branch that publishes an alert through Amazon SNS. The alert contains the run-ID and error details so support engineers can diagnose the issue quickly.

If validation succeeds, Step Functions marks the ingestion workflow as complete and emits a success event on Amazon EventBridge. Downstream processing pipelines subscribe to this event and begin the processing stage.
Processing: same state machine, new purpose
The Processing pipeline re-uses the exact Step Functions template we built for Ingestion — metadata Lambda, Glue task, completion Lambda, then a split to SNS (failure) or EventBridge (success). Nothing changes in the orchestration layer; the distinction lives entirely inside the Glue script.

The script does three things:
- First, it scans every JSON object we just landed under the
ingested/prefix. - Second, it appends a single column,
ingested_time, stamped at UTC so we can trace every record back to a specific run without touching the source system. - Third, it writes the result to the
formatted/prefix as compressed Parquet, preserving column types and order while cutting query latency on Redshift by more than half.
All retry policies, timeouts, and alert routes inherit from the shared state-machine definition, which keeps operational behaviour consistent across every stage of the raw-data pipeline.
Loading: finishing the trip in Redshift
The third state machine, Loading, mirrors the same Step Functions skeleton we used for Ingestion and Processing : initial Lambda for run-time metadata, a single Glue task, completion Lambda, then a fan-out to SNS or EventBridge. Again, orchestration is unchanged; only the Glue logic shifts.

The loader starts by picking up the Parquet files we just wrote to the formatted/ prefix. The cleaned dataset is streamed directly into a staging table in Redshift staging.<table_name>, using Amazon Redshift's native COPY command behind the scenes. All temp operations land in an S3 scratch directory so we avoid cluttering the warehouse with intermediate files.
Operational behaviour stays consistent with the earlier pillars. Any error triggers SNS with a run ID and the stack trace so on-call can jump straight to the failing task.
With this final load step in place, the raw-data pipeline closes its loop: JSON lands in ingested/, gets normalised to Parquet in formatted/, and arrives in Redshift staging ready for type casts, SCD logic, and business modelling. End-to-end latency for a typical source drop is now measured in minutes, not hours, and alignment across the three state machines lets us operate the stack with one shared runbook.
Lessons Learned from Building the Raw-Data Pipeline
1. One state-machine per stage keeps troubleshooting narrow
Splitting the flow into three independent Step Functions — Ingestion, Processing, Loading — gave us clean blast-radius boundaries. If Loading fails, we re-run only that state machine; Ingestion and Processing stay green. Sainsbury's Data & Analytics adopted the same approach after discovering that chaining ten Lambdas in one giant workflow made it "very difficult to find single points of failure" and polluted their S3 buckets with half-finished files
2. Treat metadata capture as a first-class task
Our first Lambda in every state machine writes a snapshot of connection details, source object names, and run IDs to S3. That decision paid off during audits: we can reproduce any historical load without digging through CloudWatch logs. AWS prescriptive guidance flags the same pattern as a prerequisite for reconstructing failed runs and for partitioning large data sets later.
3. Visual orchestration beats log digging
Step Functions' runtime graph immediately shows which state failed (green vs red). That visibility slashed mean-time-to-diagnose during the first month of go-live. Sainsbury's team called the visual trace "the lovely graphic" that pinpoints failing states faster than combing through Lambda logs.
What's Next
Part 2 will cover how we take the raw Redshift staging tables you've just read about and turn them into fully modelled, Slowly-Changing-Dimension-aware business layers — with dbt. Stay tuned!
Gratitude corner 💙
A huge thank-you to Harold and Jamie, if you guys are reading this :) hands-down the best managers a data nerd could hope for. Your support and mentorship made staying up late to finish this blog post on a Friday night feel more like a passion project.
I'm equally grateful to Gurpreet, Michael, Praveen, and Kazuma for answering every "Is this an AWS thing or am I just tired?" question on my zero-to-hero cloud journey. Couldn't have done it or kept my sanity without all of you.