Day 13: Data Odyssey – What is Data Preprocessing?

Welcome to Day 13: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 12: Data Odyssey – How Do We Handle Bigger Data?, we scaled Priya’s analysis to handle larger datasets—a week (35 rows) or month (150 rows) of POS data. We used Pandas to loop files, chunk big CSVs, and summarize (Saturday’s ₹2500 peak), keeping her laptop humming efficiently. It grew her view from days to trends. Today, we pivot to a new phase: What is data preprocessing, and how does it prep Priya’s data for predictive power?

The Purpose of Preprocessing

Data preprocessing is the act of refining cleaned, wrangled data into a form machines can learn from. It’s a step beyond cleaning (Day 5) and wrangling (Day 11)—where those fixed errors (₹5000 typo) and reshaped tables (merged days), preprocessing tunes data for modeling, the “model” part of our workflow (Day 1): define, collect, clean, analyze, model, communicate. Models—like predicting Priya’s next day’s sales—need data that’s consistent, scaled, and machine-friendly.

Think of it as seasoning a dish. Cleaning picks fresh ingredients, wrangling chops them, preprocessing spices them for the oven—models bake better results. Day 13: Data Odyssey primes Priya for this leap.

Why Preprocessing Matters

Priya’s data—hours, sales, items—works for stats (Day 4) and plots (Day 10), but models stumble on:

  • Units: Sales in ₹ (200-650) vs. hours (7-11)—scales clash.
  • Text: “Chai” vs. “Samosa”—computers need numbers.
  • Outliers: A rare ₹2000 sale skews predictions.

Preprocessing standardizes, encodes, and balances her table so a model can learn: “8 AM + Chai = high sales.” Without it, predictions flop; with it, Priya forecasts stock. Day 13: Data Odyssey sets this up.

Priya’s Data Recap

Her week’s data (Day 12, simplified):

    Day     Hour  Sales   Item
0   Monday  7 AM    200   Chai
1   Monday  8 AM    500   Chai
2   Monday  9 AM    600  Samosa
3   Tuesday 7 AM    150   Chai
4   Tuesday 8 AM    550   Chai
5   Tuesday 9 AM    650  Samosa
... (35 rows total)

Wrangled into one table, it’s ready to preprocess for a model: “Predict tomorrow’s sales.” Day 13: Data Odyssey starts here.

Key Preprocessing Steps

Preprocessing tweaks data systematically:

  1. Encoding Text:
    • Models need numbers, not “Chai.”
    • Label Encoding: Chai = 0, Samosa = 1.
data["Item_Code"] = data["Item"].map({"Chai": 0, "Samosa": 1})
print(data[["Item", "Item_Code"]])

Output:

     Item  Item_Code
0    Chai          0
1    Chai          0
2  Samosa          1
...

Priya’s items are numeric—model-ready. Day 13: Data Odyssey encodes this.

  1. Scaling Numbers:
    • Sales (150-650) dwarf hour numbers (7-11)—models overweigh big ranges.
    • Standardization: Center at 0, scale by standard deviation.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data["Sales_Scaled"] = scaler.fit_transform(data[["Sales"]])
print(data[["Sales", "Sales_Scaled"]])

Output (rounded):

   Sales  Sales_Scaled
0    200       -1.20
1    500        0.80
2    600        1.50
...

Mean ≈ 0, spread ≈ 1—sales align with other features. Day 13: Data Odyssey scales it.

  1. Handling Time:
    • “7 AM” isn’t numeric—extract hour:
data["Hour_Num"] = data["Hour"].str.split(" ").str[0].astype(int)
print(data[["Hour", "Hour_Num"]])

Output:

   Hour  Hour_Num
0  7 AM         7
1  8 AM         8
2  9 AM         9
...

Now 7-11—numeric, usable. Day 13: Data Odyssey parses this.

  1. One-Hot Encoding Days:
    • “Monday” vs. “Tuesday”—categorical, not ordinal.
    • One-hot: Columns for each day (0 or 1).
data = pd.get_dummies(data, columns=["Day"], prefix="Day")
print(data)

Output (partial):

   Hour  Sales   Item  Hour_Num  Day_Monday  Day_Tuesday ...
0  7 AM    200   Chai         7           1            0 ...
1  8 AM    500   Chai         8           1            0 ...
3  7 AM    150   Chai         7           0            1 ...

Days split—models see them separately. Day 13: Data Odyssey transforms this.

Preprocessed Table

Priya’s data now:

   Hour  Sales  Item  Hour_Num  Item_Code  Sales_Scaled  Day_Monday  Day_Tuesday ...
0  7 AM    200  Chai         7          0        -1.20           1            0 ...
1  8 AM    500  Chai         8          0         0.80           1            0 ...
2  9 AM    600  Samosa       9          1         1.50           1            0 ...
3  7 AM    150  Chai         7          0        -1.50           0            1 ...
...

Numeric, scaled, ready—her model’s fuel. Day 13: Data Odyssey crafts this.

Installing Scikit-Learn

StandardScaler needs Scikit-Learn:

  • Terminal: pip install scikit-learn.
  • Import: from sklearn.preprocessing import StandardScaler.

Priya runs pip install scikit-learn—her toolkit grows. Day 13: Data Odyssey adds this.

Why Preprocess?

Models thrive on:

  • Consistency: Scaled sales (0-1 range) match hours.
  • Clarity: Numbers (0, 1) beat text (“Chai”).
  • Balance: No feature (sales vs. hour) dominates unfairly.

Priya’s raw sales predict poorly—₹600 vs. “9 AM” confuses. Preprocessed? “9 AM, Chai, Monday” aligns to high sales. Day 13: Data Odyssey preps this.

Real-World Preprocessing

India’s weather models preprocess rainfall—scaled, encoded—to predict floods. Amazon standardizes prices, encodes categories—sales forecasts sharpen. Priya’s café is small, but the game’s the same. Day 13: Data Odyssey ties her in.

Challenges

Preprocessing stumbles:

  • Over-Scaling: Tiny datasets distort—Priya’s 35 rows are fine.
  • Encoding Errors: “chai” vs. “Chai”—case kills maps.
  • Loss: Dropping “Hour” for “Hour_Num” loses AM/PM—keep if needed.

Priya forgets astype(int)—error! Fixes it, learns. Day 13: Data Odyssey grows with her.

Why This Matters

Preprocessing turns Priya’s table into model food—predicting tomorrow’s 9 AM sales (₹600?) from patterns. Without it, her data’s indigestible; with it, she forecasts stock. Scale it: preprocessed traffic data predicts India’s jams—roads ease. Day 13: Data Odyssey primes you for this.

Recap Summary

Yesterday, Day 12: Data Odyssey scaled Priya’s data—week (Saturday ₹2500) to month—looping files, chunking, summarizing with Pandas. Today, Day 13: Data Odyssey explored preprocessing—encoding (Chai=0), scaling sales, one-hot days—to ready her data for modeling. It’s her predictive step.

What’s Next

Tomorrow, in Day 14: Data Odyssey – What is Machine Learning?, we’ll dive into machine learning: What’s it mean? How can Priya predict sales? We’ll introduce the concept, setting her preprocessed data for action. Bring your curiosity, and I’ll see you there!

Author

More From Author

Bharats Wildlife And Living Ecosystems

Article 37: Bharat Is Not for Beginners – The Winged Whānau: Bharat’s Wildlife and Living Ecosystems

Sunita Williams Butch Wilmore

Sunita Williams Returns to Earth: A Triumph of Resilience and Space Exploration

Leave a Reply

Your email address will not be published. Required fields are marked *