Regulatory & legal data extraction
Law firms, compliance teams, and competition economists come to us when a public source holds the data they need but will not give it up cleanly. The register sits behind a bot wall, the decisions are locked in PDFs, the API is undocumented or does not exist. We build the pipeline that pulls the complete record, proves nothing is missing, and hands it to you as structured data you can stand behind.
If a partner, a regulator, or a tribunal can ask where a figure came from, an approximate dataset is a liability. We work with teams who need the whole record and a way to show their working.
A partial dataset is worse than none, because you act on it without knowing what is absent. Every pipeline reconciles against the source's own index, so we can show you the count matches and name anything that did not come through.
Every record traces back to the page or document it came from, with the date it was captured. When someone asks where a figure originated, you have the answer in the data itself.
We do not put a language model between you and the source. The extraction follows fixed rules, so the same input gives the same output every time, and a colleague can re-run it and get an identical result.
We can run the pipeline on servers you control and deliver into systems you own. Nothing has to pass through a shared platform or a third-party cloud.
We work under NDA as standard. We do not publish client names, reference engagements in marketing, or reuse your data. The work we do for you stays yours.
Sources change their markup, move pages, and restructure without warning. We watch for that and fix the extraction before your feed goes quietly stale.
We work across UK and international sources. Common ones are competition authority case decisions, financial regulator enforcement notices, court and tribunal judgments, official gazettes, and public registers. The hard part is rarely the data itself. It is the source: pages rendered by JavaScript, registers behind Cloudflare or other bot protection, decisions published only as scanned PDFs, and APIs that exist but are undocumented. Those are the jobs we take. For a worked example on an open source, see our complete dataset of every UK CMA case.
Most engagements are either a one-off extraction or an ongoing feed we maintain. Either way the shape is the same, and we tell you up front whether a source is reachable before you commit to anything.
You tell us the source, the fields, the date range, and how often you need it refreshed. We confirm what is reachable and flag anything legally or technically awkward before any work starts.
We build the extraction and check it against the source by hand before anything goes live. You see a sample and sign off on the format and the field definitions.
Data lands where you work: a database, a CSV drop, an SFTP folder, or a server you own. As a single dataset or on a schedule you set.
For ongoing feeds we monitor the source for changes and keep the pipeline running, so you are not the one who finds out it broke.
We will tell you whether it is extractable, what the complete record looks like, and what it takes to deliver. UK-based, ICO-registered, working under NDA.