A look inside Guidewire's scalable Data Integrity Service — built to detect data loss in real time, validate CDC events, and run lean with Lakehouse architecture.
Authors: Lokesh Bandi (Senior Software Engineer) & Raghav Jha (Software Engineer III)
What if a single missing database event could derail your quarterly report — or your compliance posture? In petabyte-scale data systems, even the smallest data gap can ripple through business logic, reporting pipelines, and regulatory checks.
That was the challenge we faced while building Guidewire's big data pipelines. Performance and cost constraints were real — but so was the risk of silent data loss. Our solution: the Data Integrity Service (DIS), a system purpose-built to detect and recover missing Change Data Capture (CDC) events across distributed systems.
DIS continuously monitors data flow from InsuranceSuite applications, such as ClaimCenter and PolicyCenter, into the Guidewire Data Platform. If any event goes missing, DIS flags it in near real time — before it can disrupt downstream systems.
What the Redesigned Architecture Achieved: 80% Lower Operating Costs
When we first deployed DIS, it ensured real-time data integrity, but it came at a cost. Running even a single pipeline was expensive, primarily due to managed services such as Kinesis Firehose and Redshift.
The Data Integrity Service — redesigned from the ground up on a Data Lakehouse-based architecture — reduces cost and increases flexibility.
We replaced costly managed services with custom-built components, optimized data formats, and introduced lakehouse-style storage for long-term analytics.
The result? An 80% reduction in operating costs, improved system scalability, and faster data validation. In the sections that follow, we'll show you exactly how we achieved those results — step by step.
So what powers this system beneath the surface — and how did we engineer such a leap in efficiency?
Tracking Data from Source to Stream in Guidewire Data Platform
To capture changes as they occur, Guidewire's ingestion pipeline begins with the Debezium connector, an open-source tool that captures change events from databases, creating a replication slot in PostgreSQL. This ensures that database changes are recorded and delivered in the correct sequence, so no updates are missed.
Each time a database record is added, updated, or deleted, Debezium generates a CDC event and publishes it to Apache Kafka, a distributed messaging system used to stream data between services, where it becomes available for downstream processing.
DIS monitors the pipeline and ensures that no data is silently dropped, automatically recovering gaps where needed.
But how does DIS actually track what has been sent and what is missing? That's where its architecture and validation flow come into play.

How the Data Integrity Service Tracks and Validates Data
In its original architecture, the Data Integrity Service used two independent checkpoints and a precise system of identifiers to ensure every record was accounted for from source to stream.
When something changes in the PostgreSQL database, it's written to a special log called the Write-Ahead Log (WAL). This log acts like a black box recorder, capturing every change in sequence. Each entry gets a unique Log Sequence Number (LSN) — think of it as a timestamped bookmark that tells us exactly where a change occurred.
DIS creates its own replication slot — a kind of "save point" that tracks which changes it has already seen. This works in parallel with Debezium, allowing DIS to independently verify that the data pipeline is complete and functioning correctly.
To do this, DIS uses two microservices:
- The DB Transaction Monitor reads changes directly from the database and sends summary data to Amazon Kinesis Data Firehose. This managed service reliably streams data to storage and analytics tools.
- Kafka Message Processor reads the same data from Kafka and also sends metadata to Firehose.
The metadata is processed by a Lambda function (a lightweight compute function triggered automatically in AWS) and stored in:
- Amazon S3, a cloud storage service used for long-term retention
- Amazon Redshift, a cloud-based data warehouse for running analytics queries
Then, a special Datagap job — a scheduled Lambda function that compares logs from two data streams to detect missing records — automatically checks for data gaps. It compares the LSNs from the source and Kafka paths to detect any missing records.
Finally, AWS QuickSight, a dashboard and data visualization tool within the AWS ecosystem, creates visual dashboards that show whether records were lost — and where — so teams can act quickly if something's wrong.
Together, these components enable DIS to detect and report data loss automatically, before it impacts downstream systems or business performance.
While the system delivered on its promise of real-time data integrity, it came at a cost. As usage scaled, the price of maintaining high-frequency validation pipelines became increasingly harder to justify. This led us to investigate what was driving our infrastructure costs and how we could optimize them.
Performance Challenges and AWS Cost Pressures
The biggest challenge? Scaling introduced significant cost pressures.
As usage grew, AWS services — especially Firehose and Redshift — introduced performance limits and rising costs that made the original DIS design unsustainable.
Most of the expense came from AWS services, such as Amazon Kinesis Data Firehose and Amazon Redshift, which handled essential tasks but were expensive to run at scale.
To reduce costs, we closely examined the pipeline's spending and identified ways to make it more efficient, without compromising data integrity or real-time visibility.
Firehose Quotas: What We Hit, What It Cost
As described earlier, Amazon Kinesis Data Firehose is used to stream data reliably into destinations like S3 and Redshift. It can also trigger Lambda functions to transform records in real time before delivery. If a delivery fails, Firehose backs up the data to Amazon S3 by default.
But like any managed service, Firehose comes with quotas and limits that can impact performance, cost, and scale.
Firehose Pricing: Volume, Format, and Overhead
Kinesis Firehose charges separately for the data you send and any file format conversions. You're billed by the total data volume, measured in gigabytes, with a minimum charge of 5 KB per message, so even small records can add up.
Limits on Throughput and Batch Size
- PutRecordBatch (an API operation used to send data records to Firehose in batches) supports a maximum of 500 records or 4 MiB per call, whichever comes first
- Maximum record size (before it's encoded for transmission): 1,000 KiB
- Write capacity: up to 1 MiB per second or 1,000 records per second
- Read capacity: up to 5 read transactions per second
Limits on Buffer Timing
- Buffering intervals can range from 60 to 900 seconds.
- Firehose can't transmit data more frequently than every 60 seconds, which can introduce latency for real-time use cases.
Throttling: When Firehose Slowed Us Down
In high-throughput environments, we've experienced throttling (AWS temporarily slowing down the service) when Firehose's default limits were exceeded. In these cases, it was necessary to request increases to AWS-imposed soft limits to maintain throughput and stability.
How We Optimized Firehose for Cost
To reduce AWS costs, the DIS began by batching multiple JSON records into a single list, maximizing the data sent in each transaction. This improved the efficiency of the PutRecordBatch operation and significantly lowered Firehose-related costs.
For more details, we published a separate deep dive: How Data Integrity Service Reduced AWS Firehose Costs by 7x.
We also needed to store data in Parquet format on Amazon S3 to speed up Athena queries, particularly for the DataGap Detection Lambda job. However, enabling format conversion directly in Firehose would have added extra AWS charges. To avoid these file format conversion fees, we implemented a custom S3 upload mechanism in the client-side logic of the ECS container — a Docker container managed by Amazon's Elastic Container Service (ECS) — enabling us to write optimized Parquet files directly without incurring the cost of using Firehose.
While these optimizations helped reduce costs, Firehose still imposed limits on performance and flexibility. That led us to consider a more customized approach — one that would give us complete control over how data was uploaded and stored.
Transitioning from Kinesis Firehose to a Custom S3 Upload Client
After optimizing our use of Kinesis Firehose, we decided to take it a step further: replacing it entirely with a custom Amazon S3 upload client. This transition gave us more flexibility, control, and long-term cost savings — without compromising performance.
Why We Switched to a Custom S3 Client
- Customization: Firehose works well for standard streaming use cases, but a custom client enables us to tailor the upload process to meet our exact needs, particularly in terms of buffering, formatting, and S3 integration.
- Cost Efficiency: By reducing reliance on managed services and controlling how data is batched and stored, we lowered both data transfer and storage costs.
- Greater Control: A custom solution provided us with more precise control over retries, error management, and file output — something not easily tunable with Firehose.
What the Custom Client Does Differently
We built the new upload client from the ground up to support more advanced control and cost-saving strategies. Key features include:
- Client-side buffering using a configurable threshold parameter. This allows incoming data to accumulate before upload, giving us control over file size and transmission frequency.
- Support for Parquet format with SNAPPY compression, resulting in files up to 20× smaller than JSON, drastically reducing S3 storage and Athena query costs.
- Custom file size limits: We can define the maximum file size for uploads, optimizing for downstream performance and S3 efficiency.
- Improved throughput in the redesigned architecture by embedding buffering directly into the client logic, removing delays introduced by external dependencies.

With a more efficient upload system in place, our next challenge was optimizing how we stored and queried that data, especially as Redshift costs began to rise.
From Redshift to Lakehouse: Reducing Cost and Improving Scale
Why Redshift Was Holding Us Back
In Redshift, improving query performance for long tables (those with many rows, not many columns) was difficult, especially when handling high-cardinality columns (those with a large number of unique values) or multi-tenant data. Redshift didn't support skipping previously processed rows, which led to inefficient scans.
To solve this, we adopted a lakehouse architecture — storing incoming data in partitioned Hive tables (a format that organizes data into directories based on column values, allowing faster filtering). This allowed us to skip unused data during queries and significantly reduce S3 GET requests.
Redshift also became costly at scale. Processing large volumes meant scaling up the cluster — adding nodes and inflating costs.
What Is a Lakehouse?
A lakehouse combines the low-cost flexibility of data lakes with the performance and structure of data warehouses. It enables teams to store raw data in formats such as Parquet while still supporting structured queries, governance, and versioning.
We used Apache Iceberg, an open table format that provides ACID guarantees (atomicity, consistency, isolation, and durability) for Athena queries. By combining Amazon S3 for storage, Iceberg for table management, and Athena for querying, we built a scalable, modern analytics platform.

Benefits of the Lakehouse Approach
Partitioning in Hive tables enables us to avoid scanning irrelevant data, which significantly improves query speed and reduces costs.
Instead of validating the entire dataset, we focus on a smaller rolling window — just the partitions relevant to that timeframe.
To track progress, we utilized Iceberg tables for Data Manipulation Language (DML) operations, such as inserts and updates. We store the last processed window time in a checkpoint table, eliminating the need to reprocess already validated data, making our architecture leaner, faster, and less expensive to scale.
Like any powerful system, the Lakehouse architecture introduced new challenges, particularly in metadata management.
Managing Iceberg Metadata Growth
Iceberg introduced some new challenges, especially with small or frequent inserts, which caused metadata in S3 to grow rapidly.
After VACUUM in Athena proved ineffective at controlling metadata bloat, we transitioned to using Iceberg cleanup procedures in Glue Spark jobs. This approach kept metadata growth manageable and significantly improved query performance.
Data Lakehouse Architecture: Streamlined, Scalable, and Cost-Efficient
The Data Integrity Service's redesigned architecture reduces complexity and cost while improving flexibility across the pipeline.

At the top layer, the Cloud Data Platform (CDP) ingestion pipeline now bypasses Kinesis Firehose entirely. Instead, buffered Parquet data is uploaded directly to Amazon S3 via client-side code running in an ECS container. In parallel, the Kafka Message Processor buffers Kafka data, converts it from JSON to Parquet, and uploads it to S3.
The analytics layer has also evolved. We've replaced Redshift with a Lakehouse architecture using:
- Apache Iceberg for ACID-compliant updates and deletes
- Hive for read-only operations
- Athena is the query engine
- AWS Glue Data Catalog to manage table metadata
A scheduled maintenance job removes old or unused metadata files from S3 to keep storage lean. Finally, Amazon QuickSight dashboards run on top of Athena queries to visualize insights from DataGap jobs.
How We Cut AWS Costs
With a redesigned architecture built on client-side uploads, efficient data formats, and a Lakehouse query stack, we now have tighter control over processing and storage — at a fraction of the cost.
Summary
In real-time data pipelines, integrity isn't optional — it's essential. But maintaining trust in your data while managing cost at scale demands thoughtful trade-offs, lean design, and the courage to challenge default patterns.
That's precisely what we did with the Data Lakehouse architecture — simplifying the architecture, streamlining ingestion, and rethinking how data is stored and validated to drive performance and efficiency.
The lesson? You don't need more infrastructure. You need smarter architecture.
If you're scaling a data platform today, the opportunity isn't just to make it faster — it's to make it better.
Further Reading
- Ensuring Data Integrity in Guidewire's Petabyte-Scale Big Data Pipelines A technical deep dive by Architect Rajeeve Kuriakose on the foundations and evolution of the Data Integrity Service.
- Introducing Guidewire Data Platform Discover how Guidewire's cloud-native Data Platform was engineered for scale, performance, and extensibility.
If you are interested in joining our Engineering teams to develop innovative cloud-distributed systems and large-scale data platforms that enable a wide range of AI/ML SaaS applications, apply at Guidewire Careers.