Regulatory & legal data extraction

Regulatory and legal data, extracted in full.

Law firms, compliance teams, and competition economists come to us when a public source holds the data they need but will not give it up cleanly. The register sits behind a bot wall, the decisions are locked in PDFs, the API is undocumented or does not exist. We build the pipeline that pulls the complete record, proves nothing is missing, and hands it to you as structured data you can stand behind.

Who this is for

Built for people who have to defend the data

If a partner, a regulator, or a tribunal can ask where a figure came from, an approximate dataset is a liability. We work with teams who need the whole record and a way to show their working.

Completeness you can prove

A partial dataset is worse than none, because you act on it without knowing what is absent. Every pipeline reconciles against the source's own index, so we can show you the count matches and name anything that did not come through.

A full audit trail

Every record traces back to the page or document it came from, with the date it was captured. When someone asks where a figure originated, you have the answer in the data itself.

Deterministic, not generative

We do not put a language model between you and the source. The extraction follows fixed rules, so the same input gives the same output every time, and a colleague can re-run it and get an identical result.

Your data, your infrastructure

We can run the pipeline on servers you control and deliver into systems you own. Nothing has to pass through a shared platform or a third-party cloud.

Confidential by default

We work under NDA as standard. We do not publish client names, reference engagements in marketing, or reuse your data. The work we do for you stays yours.

Built to keep running

Sources change their markup, move pages, and restructure without warning. We watch for that and fix the extraction before your feed goes quietly stale.

Sources we handle

If you can reach it in a browser, we can usually extract it

We work across UK and international sources. Common ones are competition authority case decisions, financial regulator enforcement notices, court and tribunal judgments, official gazettes, and public registers. The hard part is rarely the data itself. It is the source: pages rendered by JavaScript, registers behind Cloudflare or other bot protection, decisions published only as scanned PDFs, and APIs that exist but are undocumented. Those are the jobs we take. For a worked example on an open source, see our complete dataset of every UK CMA case.

How an engagement works

From source to dataset

Most engagements are either a one-off extraction or an ongoing feed we maintain. Either way the shape is the same, and we tell you up front whether a source is reachable before you commit to anything.

1

Scope

You tell us the source, the fields, the date range, and how often you need it refreshed. We confirm what is reachable and flag anything legally or technically awkward before any work starts.

2

Build and validate

We build the extraction and check it against the source by hand before anything goes live. You see a sample and sign off on the format and the field definitions.

3

Deliver

Data lands where you work: a database, a CSV drop, an SFTP folder, or a server you own. As a single dataset or on a schedule you set.

4

Maintain

For ongoing feeds we monitor the source for changes and keep the pipeline running, so you are not the one who finds out it broke.

Tell us the source and what you need from it

We will tell you whether it is extractable, what the complete record looks like, and what it takes to deliver. UK-based, ICO-registered, working under NDA.