Worked example · open dataset
We extracted all 2,562 cases the Competition and Markets Authority has published since 1999, structured them into a single file, and checked the count against the source so you can see nothing is missing. It is free to download. We built it to show how we work, on a public source anyone can verify.
Each record carries the case title, type, state, outcome, market sector, the dates it opened and closed, when it was last updated, and a link back to the case on GOV.UK. The types span the CMA's full remit: 1,988 mergers, 169 Competition Act and civil cartel cases, 131 subsidy control referrals, 108 consumer enforcement cases, 70 market studies and investigations, and the rest across regulatory appeals, criminal cartels, and director disqualifications. The data comes as CSV and JSON.
Anyone can scrape a list of cases. The work is in proving it is the whole list and that every value traces back to the source.
The source reports how many cases it holds. We collected 2,562 and the source says 2,562, so the dataset is provably complete on the day it was captured. That check is recorded in the provenance file, not just asserted here.
The extractor is a single script that reads the public GOV.UK Search API. Run it yourself and you get the same dataset. We publish the script alongside the data, so the method is open to inspection.
Cases are only the top layer. We also enumerate the PDF attachments behind each case and extract their text. A 125-page final report came out as 388,000 characters of searchable text, parsed straight from the source PDF.
When a decision exists only as a scan, OCR reads it. On a controlled test, rasterising a real CMA decision to image-only pages and running OCR recovered 99.5% of the words from the original.
An honest dataset reports its own holes. Of the 2,562 cases, 1,240 have no opened date and 314 closed cases have no recorded outcome. Almost all of these are pre-2014 records inherited from the Office of Fair Trading and the Competition Commission, which never carried those fields. The gaps are in the source, not the extraction, and the provenance file counts every one. That is the difference between a dataset you can rely on and a number someone rounded up and hoped you would not check.
Free to use under the Open Government Licence, the same licence as the source. The scripts are included so you can verify the method or re-run it for the latest cases.
GOV.UK is open. Most regulators are not: bot walls, JavaScript, PDFs, no API. That is the work we do. Tell us the source and what you need from it.