Methodology

How we extract complete, auditable regulatory data

These are the questions law firms, compliance teams, and economists ask us before a project starts. Plain answers, no sales gloss. If yours is not here, ask us directly.

How do you extract data from a source that has no API?

We work from the same pages a person sees in a browser. The extraction reads the page structure directly, follows the source's own pagination and index, and writes each record to a structured field. An API is convenient when it exists, but it is not required to get a complete, reliable dataset.


How do you handle sources behind Cloudflare or other bot protection?

We use a real browser environment that loads the page the way a person's machine would, including JavaScript and challenge handling, then read the data once the page has fully rendered. We keep request rates low and steady so the extraction behaves like ordinary use rather than a flood of traffic.


How do you get data out of PDFs?

Regulators often publish decisions only as PDFs, and many are scanned images rather than text. We extract text-based PDFs directly and run OCR on scanned ones, then parse the result into fields such as parties, dates, case numbers, and outcomes. Each value stays linked to the document it came from.


How do you know a dataset is complete?

We reconcile what we extracted against the source's own count. Most registers and archives state how many items they hold or how many pages of results exist. We compare our record count to that figure, and where they differ we identify each missing item rather than reporting a round number and hoping. This is the step that separates a usable regulatory dataset from a partial one. You can see it on a complete worked example: our open dataset of every UK CMA case, reconciled against the source and published with full provenance.


How do you stop a regulatory data feed from breaking when the source changes?

Sources change their layout without notice, and a fragile extractor will return empty results while looking like it still works. We compare each run against the expected shape of the data, so a structural change triggers an alert instead of a silent gap. We then fix the extraction before the next delivery.


Can every figure be traced back to its source?

Yes. Each record carries the URL or document it came from and the date it was captured. If a partner or a regulator questions a number, you can point to exactly where it originated, which matters when the dataset supports advice or a filing.


Why don't you use AI to read the documents?

A language model can paraphrase a document, but it can also invent a value that was never there, and it will not give the same answer twice. For data that supports legal or regulatory work, that is the wrong trade. We use fixed extraction rules so the output is reproducible and every field can be checked against the source.


Where does the finished data go, and who can see it?

It goes wherever you work: a database, a CSV drop, an SFTP folder, or a server you own. We can run the whole pipeline on infrastructure you control so the data never sits on a shared platform. We work under NDA and do not reuse or publish what we collect for you.

Have a source in mind?

Tell us what it is and what you need from it. We will tell you whether it is extractable and what it takes to deliver the complete record.