Data Cleaning - Interactive Lesson

Quick Navigation

Key Concepts:

Missing Values: Empty or null data
Duplicates: Repeated records
Outliers: Extreme abnormal values
Inconsistency: Different formats for same thing
Errors: Typos and wrong entries

💡 Did You Know?

Data scientists spend 60-80% of their time cleaning and preparing data, not building fancy AI models! Clean data is the foundation of good AI.

Types of Data Problems

1. Missing Values

Missing data appears as empty cells, null, NaN, or placeholder values like "N/A".

Dirty Data:

Name	Age	Score
Alice	25	95
Bob		88
Carol	22	N/A
Dave	30	92

Clean Data (imputed):

Name	Age	Score
Alice	25	95
Bob	26	88
Carol	22	92
Dave	30	92

2. Duplicate Records

The same information appears multiple times, inflating statistics and creating bias.

Dirty Data:

Customer ID	Name	Purchase
001	John Smith	$50
001	John Smith	$50
002	Jane Doe	$75
002	Jane Doe	$75

Clean Data (deduplicated):

Customer ID	Name	Purchase
001	John Smith	$50
002	Jane Doe	$75
Duplicates removed

3. Outliers and Errors

Values that are impossibly high/low or obviously wrong due to data entry errors.

Dirty Data:

Person	Age	Height (cm)
Alice	25	165
Bob	999	180
Carol	22	158
Dave	30	5

Clean Data (corrected):

Person	Age	Height (cm)
Alice	25	165
Bob	29	180
Carol	22	158
Dave	30	175

4. Formatting Inconsistencies

The same information represented in different formats makes analysis difficult.

Dirty Data:

Name	Date	Phone
john smith	03/15/2024	555-1234
JANE DOE	2024-03-16	(555) 5678
Bob Jones	March 17, 2024	5559012

Clean Data (standardized):

Name	Date	Phone
John Smith	2024-03-15	555-1234
Jane Doe	2024-03-16	555-5678
Bob Jones	2024-03-17	555-9012

5. Inconsistent Categories

Same category written different ways creates artificial categories.

Dirty Data:

Product	Category	Price
Shirt	Clothing	$25
Pants	clothing	$40
Shoes	Clothes	$60
Hat	Apparel	$15

Clean Data (standardized):

Product	Category	Price
Shirt	Clothing	$25
Pants	Clothing	$40
Shoes	Clothing	$60
Hat	Clothing	$15

Why Data Cleaning Matters

The Danger of Dirty Data

Dirty data leads to incorrect conclusions, biased AI models, and poor business decisions. An AI model is only as good as the data it learns from!

Example 1: Medical AI Makes Wrong Diagnosis

Scenario: A hospital uses AI to predict patient readmission risk.

Dirty Data: Patient ages have duplicates, some heights recorded in inches vs centimeters, and missing blood pressure values filled with zeros.

Result: The AI learns that "age 0" and "blood pressure 0" are common, treats them as normal, and makes dangerously inaccurate predictions. Patients who need monitoring are sent home!

Impact: Lives at risk due to poor data quality.

Example 2: Hiring AI Becomes Biased

Scenario: A company uses AI to screen job applications.

Dirty Data: Historical hiring data has many duplicate entries for successful candidates, inconsistent job titles ("Developer" vs "Software Developer" vs "Programmer"), and missing education data for many qualified candidates.

Result: The AI over-weights certain profiles due to duplicates, unfairly rejects candidates with "Programmer" title thinking it's different from "Developer", and discriminates against candidates without education data even though many are qualified.

Impact: Unfair hiring practices and loss of talented candidates.

Example 3: Sales Predictions Wildly Inaccurate

Scenario: A retail company predicts next quarter's sales using AI.

Dirty Data: Sales data has outliers (someone typed $99999 instead of $99.99), duplicate transactions weren't removed, and product categories are inconsistent ("Electronics" vs "electronic" vs "Tech").

Result: The AI thinks certain products sell for $99,999, double-counts many sales, and treats Electronics/electronic/Tech as three separate categories with different patterns.

Impact: Company orders wrong inventory, wastes millions in overstock or lost sales.

Example 4: Spam Filter Blocks Important Emails

Scenario: An AI spam filter learns from user-labeled emails.

Dirty Data: Training data has many duplicate spam examples (making spam seem more common), legitimate emails with typos marked as spam by mistake, and inconsistent capitalization in sender names.

Result: The AI becomes overly aggressive, blocks legitimate emails with any typos, and can't recognize the same sender if capitalization differs.

Impact: Users miss important business communications and job offers.

Data Cleaning Techniques

1. Handling Missing Values

Remove: Delete rows with missing data (if few)
Mean/Median: Fill with average value
Mode: Fill with most common value
Forward Fill: Use previous valid value
Predict: Use other columns to estimate

2. Removing Duplicates

Exact Matches: Remove identical rows
Key-Based: Keep one record per unique ID
Fuzzy Matching: Find near-duplicates
Time-Based: Keep most recent entry

3. Detecting Outliers

Statistical: Values beyond 3 standard deviations
IQR Method: Beyond 1.5 × interquartile range
Domain Rules: Age > 150 is impossible
Visual: Plot data to spot anomalies

4. Standardizing Formats

Dates: Convert to standard format (YYYY-MM-DD)
Text: Consistent capitalization
Numbers: Remove currency symbols, commas
Categories: Map to standard values

5. Correcting Errors

Spell Check: Fix typos in text fields
Validation: Check against known valid values
Range Checks: Ensure values in valid range
Cross-Reference: Verify against other sources

6. Dealing with Inconsistencies

Mapping: Create lookup tables for variations
Normalization: Convert to common units
Encoding: Standardize categorical values
Trim Whitespace: Remove extra spaces

# Python Example: Basic Data Cleaning
import pandas as pd

# Load data
df = pd.read_csv('messy_data.csv')

# 1. Remove duplicates
df = df.drop_duplicates()

# 2. Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)  # Fill with median
df = df.dropna(subset=['important_column'])  # Drop rows with missing critical data

# 3. Remove outliers (age example)
df = df[(df['age'] >= 0) & (df['age'] <= 120)]

# 4. Standardize text
df['category'] = df['category'].str.lower().str.strip()

# 5. Standardize dates
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# 6. Fix data types
df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)

print("Data cleaned!")

Best Practices:

Always keep a backup of original data before cleaning
Document all changes you make to the dataset
Validate assumptions - check if imputed values make sense
Use domain knowledge - understand what values are realistic
Clean early - don't wait until you're building the model
Automate - create scripts for repeatable cleaning tasks

Ready to Practice?

Test your data cleaning skills with the "Fix the Dataset" game! Find and fix errors in messy datasets from the real world.

Play Fix the Dataset Game →

Real-World Applications:

🏥 Healthcare

Cleaning patient records for accurate diagnoses and treatment recommendations

💰 Finance

Ensuring transaction data is accurate for fraud detection and risk analysis

🛒 E-Commerce

Standardizing product data across platforms for better recommendations

🔬 Scientific Research

Validating experimental data to ensure reproducible results

🧹 Data Cleaning - Why Messy Data Creates Wrong Conclusions

What is Data Cleaning?

Quick Navigation

Key Concepts:

💡 Did You Know?

Types of Data Problems

1. Missing Values

Dirty Data:

Clean Data (imputed):

2. Duplicate Records

Dirty Data:

Clean Data (deduplicated):

3. Outliers and Errors

Dirty Data:

Clean Data (corrected):

4. Formatting Inconsistencies

Dirty Data:

Clean Data (standardized):

5. Inconsistent Categories

Dirty Data:

Clean Data (standardized):

Why Data Cleaning Matters

The Danger of Dirty Data

Example 1: Medical AI Makes Wrong Diagnosis

Example 2: Hiring AI Becomes Biased

Example 3: Sales Predictions Wildly Inaccurate

Example 4: Spam Filter Blocks Important Emails

Data Cleaning Techniques

1. Handling Missing Values

2. Removing Duplicates

3. Detecting Outliers

4. Standardizing Formats

5. Correcting Errors

6. Dealing with Inconsistencies

Best Practices:

Ready to Practice?

Real-World Applications: