Worked example · open dataset

Every UK CMA case, in one reconciled dataset.

We extracted all 2,562 cases the Competition and Markets Authority has published since 1999, structured them into a single file, and checked the count against the source so you can see nothing is missing. It is free to download. We built it to show how we work, on a public source anyone can verify.

2,562cases, reconciled

1999earliest case

12case types

100%of the source total

What is in it

One row per case, ready to use

Each record carries the case title, type, state, outcome, market sector, the dates it opened and closed, when it was last updated, and a link back to the case on GOV.UK. The types span the CMA's full remit: 1,988 mergers, 169 Competition Act and civil cartel cases, 131 subsidy control referrals, 108 consumer enforcement cases, 70 market studies and investigations, and the rest across regulatory appeals, criminal cartels, and director disqualifications. The data comes as CSV and JSON.

How it was built

The parts that make it trustworthy

Anyone can scrape a list of cases. The work is in proving it is the whole list and that every value traces back to the source.

Completeness, reconciled

The source reports how many cases it holds. We collected 2,562 and the source says 2,562, so the dataset is provably complete on the day it was captured. That check is recorded in the provenance file, not just asserted here.

Reproducible by design

The extractor is a single script that reads the public GOV.UK Search API. Run it yourself and you get the same dataset. We publish the script alongside the data, so the method is open to inspection.

Down to the documents

Cases are only the top layer. We also enumerate the PDF attachments behind each case and extract their text. A 125-page final report came out as 388,000 characters of searchable text, parsed straight from the source PDF.

Scanned documents too

When a decision exists only as a scan, OCR reads it. On a controlled test, rasterising a real CMA decision to image-only pages and running OCR recovered 99.5% of the words from the original.

What we did not hide

The gaps are named, not smoothed over

An honest dataset reports its own holes. Of the 2,562 cases, 1,240 have no opened date and 314 closed cases have no recorded outcome. Almost all of these are pre-2014 records inherited from the Office of Fair Trading and the Competition Commission, which never carried those fields. The gaps are in the source, not the extraction, and the provenance file counts every one. That is the difference between a dataset you can rely on and a number someone rounded up and hoped you would not check.

Download

Take it, check it, use it

Free to use under the Open Government Licence, the same licence as the source. The scripts are included so you can verify the method or re-run it for the latest cases.

Everything, zipped (data, provenance, scripts) ZIP CMA cases CSV CMA cases JSON Provenance and completeness check JSON The extractor script PY