Types of Data Problems

1. Missing Values

Missing data appears as empty cells, null, NaN, or placeholder values like "N/A".

Dirty Data:

NameAgeScore
Alice2595
Bob88
Carol22N/A
Dave3092

Clean Data (imputed):

NameAgeScore
Alice2595
Bob2688
Carol2292
Dave3092

2. Duplicate Records

The same information appears multiple times, inflating statistics and creating bias.

Dirty Data:

Customer IDNamePurchase
001John Smith$50
001John Smith$50
002Jane Doe$75
002Jane Doe$75

Clean Data (deduplicated):

Customer IDNamePurchase
001John Smith$50
002Jane Doe$75
Duplicates removed

3. Outliers and Errors

Values that are impossibly high/low or obviously wrong due to data entry errors.

Dirty Data:

PersonAgeHeight (cm)
Alice25165
Bob999180
Carol22158
Dave305

Clean Data (corrected):

PersonAgeHeight (cm)
Alice25165
Bob29180
Carol22158
Dave30175

4. Formatting Inconsistencies

The same information represented in different formats makes analysis difficult.

Dirty Data:

NameDatePhone
john smith03/15/2024555-1234
JANE DOE2024-03-16(555) 5678
Bob JonesMarch 17, 20245559012

Clean Data (standardized):

NameDatePhone
John Smith2024-03-15555-1234
Jane Doe2024-03-16555-5678
Bob Jones2024-03-17555-9012

5. Inconsistent Categories

Same category written different ways creates artificial categories.

Dirty Data:

ProductCategoryPrice
ShirtClothing$25
Pantsclothing$40
ShoesClothes$60
HatApparel$15

Clean Data (standardized):

ProductCategoryPrice
ShirtClothing$25
PantsClothing$40
ShoesClothing$60
HatClothing$15

Why Data Cleaning Matters

The Danger of Dirty Data

Dirty data leads to incorrect conclusions, biased AI models, and poor business decisions. An AI model is only as good as the data it learns from!

Example 1: Medical AI Makes Wrong Diagnosis

Scenario: A hospital uses AI to predict patient readmission risk.

Dirty Data: Patient ages have duplicates, some heights recorded in inches vs centimeters, and missing blood pressure values filled with zeros.

Result: The AI learns that "age 0" and "blood pressure 0" are common, treats them as normal, and makes dangerously inaccurate predictions. Patients who need monitoring are sent home!

Impact: Lives at risk due to poor data quality.

Example 2: Hiring AI Becomes Biased

Scenario: A company uses AI to screen job applications.

Dirty Data: Historical hiring data has many duplicate entries for successful candidates, inconsistent job titles ("Developer" vs "Software Developer" vs "Programmer"), and missing education data for many qualified candidates.

Result: The AI over-weights certain profiles due to duplicates, unfairly rejects candidates with "Programmer" title thinking it's different from "Developer", and discriminates against candidates without education data even though many are qualified.

Impact: Unfair hiring practices and loss of talented candidates.

Example 3: Sales Predictions Wildly Inaccurate

Scenario: A retail company predicts next quarter's sales using AI.

Dirty Data: Sales data has outliers (someone typed $99999 instead of $99.99), duplicate transactions weren't removed, and product categories are inconsistent ("Electronics" vs "electronic" vs "Tech").

Result: The AI thinks certain products sell for $99,999, double-counts many sales, and treats Electronics/electronic/Tech as three separate categories with different patterns.

Impact: Company orders wrong inventory, wastes millions in overstock or lost sales.

Example 4: Spam Filter Blocks Important Emails

Scenario: An AI spam filter learns from user-labeled emails.

Dirty Data: Training data has many duplicate spam examples (making spam seem more common), legitimate emails with typos marked as spam by mistake, and inconsistent capitalization in sender names.

Result: The AI becomes overly aggressive, blocks legitimate emails with any typos, and can't recognize the same sender if capitalization differs.

Impact: Users miss important business communications and job offers.

Data Cleaning Techniques

1. Handling Missing Values

  • Remove: Delete rows with missing data (if few)
  • Mean/Median: Fill with average value
  • Mode: Fill with most common value
  • Forward Fill: Use previous valid value
  • Predict: Use other columns to estimate

2. Removing Duplicates

  • Exact Matches: Remove identical rows
  • Key-Based: Keep one record per unique ID
  • Fuzzy Matching: Find near-duplicates
  • Time-Based: Keep most recent entry

3. Detecting Outliers

  • Statistical: Values beyond 3 standard deviations
  • IQR Method: Beyond 1.5 ร— interquartile range
  • Domain Rules: Age > 150 is impossible
  • Visual: Plot data to spot anomalies

4. Standardizing Formats

  • Dates: Convert to standard format (YYYY-MM-DD)
  • Text: Consistent capitalization
  • Numbers: Remove currency symbols, commas
  • Categories: Map to standard values

5. Correcting Errors

  • Spell Check: Fix typos in text fields
  • Validation: Check against known valid values
  • Range Checks: Ensure values in valid range
  • Cross-Reference: Verify against other sources

6. Dealing with Inconsistencies

  • Mapping: Create lookup tables for variations
  • Normalization: Convert to common units
  • Encoding: Standardize categorical values
  • Trim Whitespace: Remove extra spaces
# Python Example: Basic Data Cleaning
import pandas as pd

# Load data
df = pd.read_csv('messy_data.csv')

# 1. Remove duplicates
df = df.drop_duplicates()

# 2. Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)  # Fill with median
df = df.dropna(subset=['important_column'])  # Drop rows with missing critical data

# 3. Remove outliers (age example)
df = df[(df['age'] >= 0) & (df['age'] <= 120)]

# 4. Standardize text
df['category'] = df['category'].str.lower().str.strip()

# 5. Standardize dates
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# 6. Fix data types
df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)

print("Data cleaned!")

Best Practices:

  1. Always keep a backup of original data before cleaning
  2. Document all changes you make to the dataset
  3. Validate assumptions - check if imputed values make sense
  4. Use domain knowledge - understand what values are realistic
  5. Clean early - don't wait until you're building the model
  6. Automate - create scripts for repeatable cleaning tasks

Real-World Applications:

๐Ÿฅ Healthcare

Cleaning patient records for accurate diagnoses and treatment recommendations

๐Ÿ’ฐ Finance

Ensuring transaction data is accurate for fraud detection and risk analysis

๐Ÿ›’ E-Commerce

Standardizing product data across platforms for better recommendations

๐Ÿ”ฌ Scientific Research

Validating experimental data to ensure reproducible results