Table of Contents
When a client asks us what data accuracy we deliver, our answer is 99.8%. That figure is not drawn from a best-case scenario or a particularly clean source. It is the average field-level accuracy rate across all active client feeds, measured continuously and reported in every delivery summary. This article explains precisely how we achieve and maintain it.
The key insight is that accuracy at this level is not achieved by having better scrapers. It is achieved by having a systematic process that catches errors before they leave our pipeline. Four stages. Every project. No exceptions.
Stage 1: Source Validation
Before a single data point is extracted, we assess the quality and reliability of the sources themselves. Poor-quality sources produce poor-quality data regardless of how sophisticated your extraction logic is.
Identifying Reliable Data Sources
Not all publicly accessible data is equally trustworthy. A product price on a retailer's own website is authoritative; the same price scraped from an aggregator site may be hours or days stale. We evaluate each proposed source against a set of reliability criteria: update frequency, historical consistency, structural stability, and the degree to which the source publisher has an incentive to keep the data accurate.
Checking for Stale Data
Many websites display content that has not been refreshed in line with their stated update frequency. Before a source enters our pipeline, we run a freshness audit: we capture timestamps embedded in pages, compare them against our extraction time, and establish a staleness baseline. Sources that consistently deliver data significantly behind their stated update frequency are flagged and either supplemented with alternatives or deprioritised.
Source Redundancy
For data points that are critical to a client's use case, we identify at least one secondary source. If the primary source becomes unavailable — due to downtime, blocking, or structural changes — the secondary source maintains data continuity. This redundancy adds engineering overhead upfront but prevents the gaps in historical feeds that frustrate downstream analytics.
Stage 2: Extraction Validation
Once data is extracted from a source, it passes through a suite of automated checks before being written to our staging database. These checks are defined per-project based on the agreed data schema and run on every record, every collection cycle.
Schema Validation
Every extracted record is validated against a strict schema definition. Fields that are required must be present. Fields with defined data types — string, integer, decimal, date — must conform to those types. Any record that fails schema validation is rejected from the pipeline and logged for review rather than silently passed through with missing or malformed data.
Type Checking
Web pages frequently present numeric data as formatted strings — prices with currency symbols, quantities with commas, dates in inconsistent formats. Our extraction layer normalises all values to their canonical types and validates the result. A price field that returns a non-numeric string after normalisation indicates an extraction failure, not a valid price, and is treated accordingly.
Range Checks
For fields where expected value ranges can be defined — prices, quantities, percentages, geographic coordinates — we apply automated range checks. A product price of £0.00 or £999,999 on a dataset where prices ordinarily fall between £5 and £500 triggers an anomaly flag. Range thresholds are set conservatively to catch genuine outliers without suppressing legitimately unusual but accurate values.
Null Handling
We treat unexpected nulls as errors, not as acceptable outcomes. If a field is expected to be populated based on the source structure and it is absent, the system logs the specific field, the record identifier, and the page URL from which extraction was attempted. This granular logging is what enables our error rate transparency reports.
Stage 3: Cross-Referencing
Stage three is where the multi-source architecture pays dividends. Having validated individual records in isolation, we now compare them across sources and against historical data to detect anomalies that single-source validation cannot catch.
Comparing Against Secondary Sources
Where secondary sources are available, extracted values from the primary source are compared against them programmatically. For numeric fields, we apply a configurable tolerance threshold — a price that differs by more than 5% between sources, for example, may indicate that one source has not updated or that an extraction error has occurred on one side. These discrepancies are queued for human review rather than automatically resolved in favour of either source.
Anomaly Detection
We maintain rolling historical baselines for every active data feed. Each new collection run is compared against the baseline to identify statistical outliers: values that fall outside expected distributions, metrics that change by more than a defined percentage between runs, or fields that suddenly shift from populated to null across a significant proportion of records. Anomaly detection catches errors that pass schema and range validation because they look syntactically correct but are semantically implausible in context.
Stage 4: Delivery QA
The final stage occurs immediately before data is delivered to the client. At this point, the data has passed three automated validation layers, but we apply one further set of checks specific to the client's output requirements.
Structured Output Testing
Every delivery runs through an output test suite that verifies the data conforms to the agreed delivery format — whether that is a JSON schema, a CSV structure, a database table definition, or an API response contract. Field names, ordering, encoding, and delimiter handling are all validated programmatically.
Client-Specific Format Validation
Many clients have downstream systems with specific expectations about data format. A product identifier that should be a zero-padded eight-digit string must not arrive as a plain integer. A date field used as a partition key in a data warehouse must use the exact format the warehouse expects. We maintain per-client output profiles that capture these requirements and validate against them on every delivery.
Delivery Confirmation
Every delivery generates a confirmation record that includes a timestamp, record count, field-level error summary, and a hash of the delivered file or dataset. Clients receive this confirmation alongside their data. If a delivery is delayed, interrupted, or incomplete for any reason, the client is notified proactively rather than discovering the issue themselves.
What 0.2% Error Means in Practice
A 99.8% accuracy rate means that, on average, 2 out of every 1,000 field-level data points contain an error. Understanding what that means operationally is important for clients setting expectations.
How Errors Are Caught
The majority of errors in the 0.2% are caught before delivery by our pipeline. They appear in our internal error logs as rejected records or flagged anomalies. Of errors that do reach the delivered dataset, most are minor formatting inconsistencies or edge cases in value normalisation rather than fundamentally incorrect values.
Client Notification
When errors are detected post-delivery — either by our monitoring systems or reported by the client — we acknowledge the report within two business hours and provide an initial assessment within four. Our error notification includes the specific fields affected, the probable cause, and an estimated time to remediation.
Remediation SLA
Our standard remediation SLA is 24 hours for errors affecting less than 1% of a delivered dataset and 4 hours for errors affecting more than 1%. For clients on enterprise agreements, expedited remediation windows of 2 hours and 1 hour respectively are available. Remediated data is redelivered in the same format as the original, with a clear notation of which records were corrected and what change was made.
Case Study: E-Commerce Competitor Pricing Feed at 99.8%
To illustrate how these four stages function on a real project, consider a feed we have operated for an e-commerce client since late 2024. The brief was to deliver daily competitor pricing data for approximately 12,000 SKUs across nine competitor websites, formatted for direct ingestion into their pricing engine.
Stage 1 identified that two of the nine competitor sites were aggregators with intermittent freshness issues. We introduced a third primary-source alternative for the affected product categories and downgraded the aggregators to secondary reference sources.
Stage 2 caught a recurring issue with one competitor's price display: promotional prices were being presented in a non-standard markup that our initial extractor misidentified as the regular price. The type and range checks flagged a statistically unusual number of prices below a defined minimum threshold, which surfaced the issue within the first collection run. The extractor was corrected the same day.
Stage 3's anomaly detection flagged a three-day period during which one competitor's prices appeared frozen — identical values across consecutive daily runs. Cross-referencing against the secondary source confirmed the competitor's site had experienced a pricing engine outage. The client was notified and the affected data was held rather than delivered as though it were live pricing.
Stage 4's delivery confirmation caught one instance in which the pricing engine's expected date format changed from ISO 8601 to a localised UK format following a client-side system update. The mismatch was detected before the delivery reached the pricing engine and corrected within the same delivery window.
The result across twelve months of operation: a measured field-level accuracy rate of 99.81%, with zero instances of the pricing engine receiving data that caused an incorrect automated price change.
Accuracy You Can Measure and Rely On
Data accuracy at 99.8% does not happen by chance. It is the product of a rigorous, stage-gated pipeline that treats errors as engineering problems to be systematically eliminated rather than statistical noise to be tolerated. If your current data supplier cannot show you field-level accuracy metrics and a documented remediation process, it is worth asking why not.
Ready to discuss your data accuracy requirements? We will walk you through our validation process and show you how it applies to your specific use case.
Request a Quote Explore Our Services