Types of Data Problems
1. Missing Values
Missing data appears as empty cells, null, NaN, or placeholder values like "N/A".
Dirty Data:
| Name | Age | Score |
|---|---|---|
| Alice | 25 | 95 |
| Bob | 88 | |
| Carol | 22 | N/A |
| Dave | 30 | 92 |
Clean Data (imputed):
| Name | Age | Score |
|---|---|---|
| Alice | 25 | 95 |
| Bob | 26 | 88 |
| Carol | 22 | 92 |
| Dave | 30 | 92 |
2. Duplicate Records
The same information appears multiple times, inflating statistics and creating bias.
Dirty Data:
| Customer ID | Name | Purchase |
|---|---|---|
| 001 | John Smith | $50 |
| 001 | John Smith | $50 |
| 002 | Jane Doe | $75 |
| 002 | Jane Doe | $75 |
Clean Data (deduplicated):
| Customer ID | Name | Purchase |
|---|---|---|
| 001 | John Smith | $50 |
| 002 | Jane Doe | $75 |
| Duplicates removed | ||
3. Outliers and Errors
Values that are impossibly high/low or obviously wrong due to data entry errors.
Dirty Data:
| Person | Age | Height (cm) |
|---|---|---|
| Alice | 25 | 165 |
| Bob | 999 | 180 |
| Carol | 22 | 158 |
| Dave | 30 | 5 |
Clean Data (corrected):
| Person | Age | Height (cm) |
|---|---|---|
| Alice | 25 | 165 |
| Bob | 29 | 180 |
| Carol | 22 | 158 |
| Dave | 30 | 175 |
4. Formatting Inconsistencies
The same information represented in different formats makes analysis difficult.
Dirty Data:
| Name | Date | Phone |
|---|---|---|
| john smith | 03/15/2024 | 555-1234 |
| JANE DOE | 2024-03-16 | (555) 5678 |
| Bob Jones | March 17, 2024 | 5559012 |
Clean Data (standardized):
| Name | Date | Phone |
|---|---|---|
| John Smith | 2024-03-15 | 555-1234 |
| Jane Doe | 2024-03-16 | 555-5678 |
| Bob Jones | 2024-03-17 | 555-9012 |
5. Inconsistent Categories
Same category written different ways creates artificial categories.
Dirty Data:
| Product | Category | Price |
|---|---|---|
| Shirt | Clothing | $25 |
| Pants | clothing | $40 |
| Shoes | Clothes | $60 |
| Hat | Apparel | $15 |
Clean Data (standardized):
| Product | Category | Price |
|---|---|---|
| Shirt | Clothing | $25 |
| Pants | Clothing | $40 |
| Shoes | Clothing | $60 |
| Hat | Clothing | $15 |
Why Data Cleaning Matters
The Danger of Dirty Data
Dirty data leads to incorrect conclusions, biased AI models, and poor business decisions. An AI model is only as good as the data it learns from!
Example 1: Medical AI Makes Wrong Diagnosis
Scenario: A hospital uses AI to predict patient readmission risk.
Dirty Data: Patient ages have duplicates, some heights recorded in inches vs centimeters,
and missing blood pressure values filled with zeros.
Result: The AI learns that "age 0" and "blood pressure 0" are common, treats them as normal,
and makes dangerously inaccurate predictions. Patients who need monitoring are sent home!
Impact: Lives at risk due to poor data quality.
Example 2: Hiring AI Becomes Biased
Scenario: A company uses AI to screen job applications.
Dirty Data: Historical hiring data has many duplicate entries for successful candidates,
inconsistent job titles ("Developer" vs "Software Developer" vs "Programmer"), and missing education data
for many qualified candidates.
Result: The AI over-weights certain profiles due to duplicates, unfairly rejects candidates
with "Programmer" title thinking it's different from "Developer", and discriminates against candidates
without education data even though many are qualified.
Impact: Unfair hiring practices and loss of talented candidates.
Example 3: Sales Predictions Wildly Inaccurate
Scenario: A retail company predicts next quarter's sales using AI.
Dirty Data: Sales data has outliers (someone typed $99999 instead of $99.99),
duplicate transactions weren't removed, and product categories are inconsistent ("Electronics" vs "electronic" vs "Tech").
Result: The AI thinks certain products sell for $99,999, double-counts many sales,
and treats Electronics/electronic/Tech as three separate categories with different patterns.
Impact: Company orders wrong inventory, wastes millions in overstock or lost sales.
Example 4: Spam Filter Blocks Important Emails
Scenario: An AI spam filter learns from user-labeled emails.
Dirty Data: Training data has many duplicate spam examples (making spam seem more common),
legitimate emails with typos marked as spam by mistake, and inconsistent capitalization in sender names.
Result: The AI becomes overly aggressive, blocks legitimate emails with any typos,
and can't recognize the same sender if capitalization differs.
Impact: Users miss important business communications and job offers.
Data Cleaning Techniques
1. Handling Missing Values
- Remove: Delete rows with missing data (if few)
- Mean/Median: Fill with average value
- Mode: Fill with most common value
- Forward Fill: Use previous valid value
- Predict: Use other columns to estimate
2. Removing Duplicates
- Exact Matches: Remove identical rows
- Key-Based: Keep one record per unique ID
- Fuzzy Matching: Find near-duplicates
- Time-Based: Keep most recent entry
3. Detecting Outliers
- Statistical: Values beyond 3 standard deviations
- IQR Method: Beyond 1.5 ร interquartile range
- Domain Rules: Age > 150 is impossible
- Visual: Plot data to spot anomalies
4. Standardizing Formats
- Dates: Convert to standard format (YYYY-MM-DD)
- Text: Consistent capitalization
- Numbers: Remove currency symbols, commas
- Categories: Map to standard values
5. Correcting Errors
- Spell Check: Fix typos in text fields
- Validation: Check against known valid values
- Range Checks: Ensure values in valid range
- Cross-Reference: Verify against other sources
6. Dealing with Inconsistencies
- Mapping: Create lookup tables for variations
- Normalization: Convert to common units
- Encoding: Standardize categorical values
- Trim Whitespace: Remove extra spaces
# Python Example: Basic Data Cleaning
import pandas as pd
# Load data
df = pd.read_csv('messy_data.csv')
# 1. Remove duplicates
df = df.drop_duplicates()
# 2. Handle missing values
df['age'].fillna(df['age'].median(), inplace=True) # Fill with median
df = df.dropna(subset=['important_column']) # Drop rows with missing critical data
# 3. Remove outliers (age example)
df = df[(df['age'] >= 0) & (df['age'] <= 120)]
# 4. Standardize text
df['category'] = df['category'].str.lower().str.strip()
# 5. Standardize dates
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# 6. Fix data types
df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)
print("Data cleaned!")
Best Practices:
- Always keep a backup of original data before cleaning
- Document all changes you make to the dataset
- Validate assumptions - check if imputed values make sense
- Use domain knowledge - understand what values are realistic
- Clean early - don't wait until you're building the model
- Automate - create scripts for repeatable cleaning tasks
Ready to Practice?
Test your data cleaning skills with the "Fix the Dataset" game! Find and fix errors in messy datasets from the real world.
Play Fix the Dataset Game โReal-World Applications:
Cleaning patient records for accurate diagnoses and treatment recommendations
Ensuring transaction data is accurate for fraud detection and risk analysis
Standardizing product data across platforms for better recommendations
Validating experimental data to ensure reproducible results