Data Profiling: What It Is, How It Works and Why It Matters Before Any Data Project
A complete guide to data profiling for UK businesses — what it reveals about your data's structure, quality and relationships, and why it should precede any migration or integration project.
Most data quality problems could have been anticipated — if anyone had looked carefully at the data before they built something on top of it. Data profiling is the discipline of doing exactly that: systematically analysing a dataset to understand its structure, content, and quality before committing to a project that depends on it. It's the equivalent of a structural survey before buying a property — not glamorous, but the thing that prevents expensive surprises later.
What Is Data Profiling?
Data profiling is the process of examining data from an existing source and collecting statistics and information about that data. The aim is to produce a clear, evidence-based picture of what the data looks like — its format, distribution, completeness, consistency, and relationships — without yet changing anything.
Profiling is analytical and descriptive, not corrective. It answers the question "what do we have?" rather than "how do we fix it?" That distinction matters because the output of profiling directly informs what fixing is needed, how much effort it will require, and whether the project it's intended to support is even feasible with the data as it stands.
Data Profiling vs Data Cleansing
These two activities are related but distinct and sequential. Profiling should always come first. Cleansing without profiling means you're making assumptions about what problems exist — and those assumptions are frequently wrong. A team might spend days writing scripts to deduplicate a database, only to discover that the real problem is a 60% null rate on the email address field, which renders the deduplication largely irrelevant for the email campaign that prompted the project.
Profiling informs cleansing priorities, effort estimates, and approach. It also creates an audit baseline — a documented record of the data's state before intervention — which is valuable for measuring the improvement achieved and for governance purposes.
What Data Profiling Reveals
Null Rates and Population Rates
For every field in a dataset, profiling calculates how often it is populated vs null or blank. A contact database might show that 94% of records have an email address, 71% have a direct dial telephone number, but only 38% have a valid postcode. These population rates immediately signal where data collection processes are failing and which use cases will be constrained by the data as it stands.
Format Distributions
Even populated fields can hold data in inconsistent or invalid formats. Profiling analyses the actual values in each field and identifies the range of formats present. A phone number field might reveal values stored as 07xxx xxxxxx, +44 7xxx xxxxxx, (07xxx) xxxxxx, and plain 07xxxxxxxxx — all representing the same format of number but requiring standardisation before they can be used reliably. It might also reveal values like N/A, TBC, or see above — entries that were technically entered into the field but aren't phone numbers at all.
Statistical Outliers
For numeric and date fields, profiling computes descriptive statistics: minimum, maximum, mean, median, standard deviation, and distribution. These reveal outliers that may indicate data entry errors or system problems. A customer spend field with a maximum value 1,000 times the mean suggests a data entry error or unit inconsistency (pence vs pounds, perhaps). A date field containing records from 1900 suggests a system default is being written as a real value.
Referential Integrity
In relational databases or datasets that are supposed to link to one another, profiling checks whether the links are intact. If a CRM has a contacts table and an accounts table, referential integrity checking reveals how many contact records have an account ID that doesn't exist in the accounts table — orphaned records that will cause problems in any reporting or matching operation. In UK B2B datasets, company registration numbers are a common referential field; profiling might reveal that 15% of records have a Companies House number that doesn't match any registered company, suggesting data entry errors or dissolved companies.
Value Frequency and Cardinality
Profiling enumerates the distinct values in categorical fields and their frequencies. For a job title field, this might reveal that the top 10 values account for 70% of records, or that there are 847 distinct values suggesting the field was free-text rather than controlled vocabulary. For a country field, it might reveal that 98% of records are "United Kingdom" stored 12 different ways — UK, U.K., United Kingdom, Great Britain, England, GB, and so on — pointing to an immediate standardisation task.
Tools for Data Profiling
Several tools are commonly used for data profiling work:
- Python with pandas: The most flexible option for technical users. The
df.describe(),df.isnull().sum(), anddf.value_counts()methods provide rapid profiling of any dataset loaded into a DataFrame. Libraries likeydata-profiling(formerly pandas-profiling) generate comprehensive HTML profiling reports automatically. - OpenRefine: A free, open-source tool with a graphical interface that makes it accessible to non-developers. Excellent for exploring text field distributions, identifying format variations, and understanding cardinality.
- SQL: For data held in relational databases, SQL queries —
COUNT(*),COUNT(DISTINCT),MIN,MAX,GROUP BY— can profile data directly without extraction. This is often the most efficient approach for large production databases. - Commercial data quality platforms: Talend, Informatica, and similar enterprise platforms have dedicated profiling modules that produce structured reports and integrate with downstream cleansing workflows.
When to Profile: Key Triggers
Pre-Migration
A CRM migration — moving from one platform to another — is one of the highest-risk data operations a business undertakes. Profiling the source data thoroughly before migration design begins is essential. It reveals which fields can be mapped directly, which require transformation, which have quality issues that will cause import failures in the target system, and whether volumes are as expected. Migrations that skip profiling routinely encounter problems mid-project that were entirely foreseeable.
Pre-Integration
When connecting a new data feed — a lead generation supplier, a new e-commerce platform, a partner data share — profiling the incoming data before it touches your master database prevents quality problems at source. A week of profiling a new supplier's data format can prevent months of remediation after contaminated records have spread through a CRM.
Pre-Campaign
Before investing significant budget in a marketing campaign, profiling the target list confirms that the data is fit for purpose. It answers questions like: what proportion of these records actually have a valid email address? How many are likely duplicates? When were these records last verified?
Interpreting Profiling Reports
A profiling report is only useful if it informs decisions. When reviewing profiling output, the key questions are: which findings represent blockers (problems that must be fixed before the project can proceed), which are material risks (problems that will degrade results if not addressed), and which are known acceptable compromises? The answers will differ by project type, business context, and risk appetite — but making those decisions explicitly, informed by evidence from the profiling, is far better than discovering the problems halfway through a project.
Need Help Cleaning Your Data?
UK Data Services handles data cleansing, deduplication and quality improvement projects for UK businesses. See our data cleaning services or get in touch for a no-obligation consultation.
Get a Free Consultation