The Entertaining Pitfalls of Replicating Data
Every day hundreds of millions of actions occur on our clients’ websites. New users sign up, browse pages, send emails, and place orders. These actions generate the raw data our clients need to piece together relevant and personalized campaigns. To make this happen we need to replicate our clients’ data into our platform.
This isn’t as easy as it looks.
We currently have two types of data integrations with our customers:
- Stream updates: We have a direct connection to their data sources and we receive fresh data throughout the day.
- Batch updates: Our client drops us their latest data at some regular interval.
The data we receive falls into three categories, each of which we handle differently:
- Snapshot tables: We overwrite the table on update. Snapshot tables are typically smaller and don’t have metadata indicating when a row was last created or updated.
- Versioned tables: Updates contain a mix of brand new rows and updates for existing rows. To handle this potpourri of data, we require that each table have a unique primary key and a version key that identifies the latest updated row. Using these keys we scan the data set and pick the most recent (by version key) row for each primary key.
- Append-only tables: Some tables describe events which are immutable. This means there is no need to maintain versions as each primary key corresponds to a single row. For those tables, we take the daily delta and append it to the bottom of the table.
For the most part it’s been a smooth and seamless process, but we’ve had our share of bumps and bruises.
Here are some of the issues we’ve run into:
- Misconfigured ETL jobs: There was a short period of time where we were receiving both production and test data from a customer. Because the test data looked “real” it was very difficult to isolate this as the source of erroneous data.
- Duplicate data drops: We’ve also received multiple copies of the same data. This wouldn’t cause problems for versioned tables, since we would only select a single row with the latest version key. Since snapshot tables are ingested as-is, we would read the duplicate files and generate a snapshot with duplicate rows..
- “Primary keys” aren’t unique: Our versioning logic rests on the assumption that a table’s primary key is unique. This isn’t always true – especially with grandfathered systems that many IT teams have to support. We’ve had situations where our data-quality checks alerted us to these errors.
- Latest timestamps are not properly maintained: AIQ gets our customers’ data based on their newest update time. We’ve seen instances where ad-hoc updates to their databases did not increment the update time for those rows.
- Timezones: The bane of every programmer’s existence. Because of inconsistencies with timezones, daylight savings, unix timestamps being converted into milliseconds in our customers’ data, we’ve seen simple queries return bizarre results.
– One concrete example of a problem was when daylight savings hit earlier this year. Most of our clients track sales on a day-by-day basis, midnight to midnight. When daylight savings hit, some of our business variables were recording sales on a 1am to 1am basis.
And here is how we’ve built a resilient system to solve these problems:
- Daily data size checker:
- Before we run our ingest job on our customers’ daily dump of data, we check the size of the delta and ensure it’s within a certain range based on the historical sizes of each delta. If it is, we continue with the ingest job. If it’s not, the ingest job is paused and we get alerted via PagerDuty so we can determine what the cause of the data size spike is.
- We specifically designed this check to happen before any data is ingested. It is much much easier to diagnose and resolve the issue before your environment is contaminated with bad data.
- Table count comparison script
- We have a daily job that gathers row counts for each table on our customer’s Data Warehouse as well as the corresponding tables in our system. When the counts don’t match up, our field engineers are alerted to look into the data discrepancy.
- Data rollback scripts
- No matter what preventive mechanisms are put in place, there is always a possibility that we’ll need to clean out some data. So we’re implementing ways to keep our versioned data very modular. By placing a flag by each batch of data, we’ll easily be able to remove or include specific pieces of data.
- We are also implementing a periodic snapshot of each table, so we can always have a safe baseline to fall back to in case of an emergency.
Having the correct and most up-to-date data for our customers stands as an absolute priority for our team. If the replication process fails – so does everything else. Our engineering team is working everyday to make this process more robust and fail-proof. However in our world where billions of rows across multiple data sources are being transferred daily, there is no such thing as a 0% fail-rate. So with that in mind, everything we build is designed to allow a painless transition to a correct state.
The two priorities for this core process are to minimize the chance of ingesting wrong or incomplete data, and reduce to a minimum the time to recover from the ingestion of bad data. The critical architecture here is the framework enabling a growing number of tests on the data coming inbefore ingestion in order to validate it as thoroughly as possible.