Feature Engineering

Day 21: Data Odyssey – What is Feature Engineering?

Welcome to Day 21: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 20: Data Odyssey – How Do We Optimize Classification Models?, we refined Priya’s Decision Tree Classifier. Adding a 7th row and tuning (max_depth=2, min_samples_split=3) dropped her accuracy to 66.7% on a test set but raised cross-validation to 86%, predicting “Busy” for Wednesday’s 9 AM Samosa reliably. Her 7-row dataset hinted at more potential. Today, we unlock it: What is feature engineering, and how can Priya craft better inputs for her models?

The Power of Features

Feature engineering is designing new inputs (features) from raw data to make ML models smarter. Day 13 preprocessed Priya’s data—encoding “Chai” to 0, scaling sales—but that was prep. Feature engineering creates: “Is it rush hour?” “Weekday or weekend?” It’s the “analyze” and “model” bridge in our workflow (Day 1), feeding regression (Day 17) and classification (Day 20) richer clues.

Think of it as seasoning Priya’s recipe. Raw sales (₹600) are ingredients; features like “9 AM Rush” are spices—models taste better. Day 21: Data Odyssey crafts this.

Why Feature Engineering Matters

Priya’s models use Hour_Num (7-9), Item_Code (0-1), and Day flags—good, but basic. Her ₹620 regression (MAE ₹7) and “Busy” classification (86% cross-val) miss nuance:

  • Context: Rainy 9 AM ≠ sunny 9 AM.
  • Patterns: 8-9 AM cluster as “rush.”
  • Trends: Weekday vs. weekend shifts.

Better features cut errors (₹7 to ₹5) and boost recall—stock aligns tighter. Day 21: Data Odyssey upgrades her inputs.

Priya’s Data Recap

Her 7 rows (Day 20):

   Hour_Num  Item_Code  Day_Monday  Day_Tuesday  Weather_Rainy  Sales  Label
0         7          0           1            0              0    200  Slow
1         8          0           1            0              0    500  Busy
2         9          1           1            0              0    600  Busy
3         7          0           0            1              1    150  Slow
4         8          0           0            1              1    550  Busy
5         9          1           0            1              1    650  Busy
6         9          1           0            0              0    640  Busy
  • Regression: Predicts sales (e.g., ₹620).
  • Classification: “Busy” (≥ ₹500) vs. “Slow.”

New features enhance both. Day 21: Data Odyssey starts here.

Feature Engineering Ideas

Add smarts:

  1. Rush Hour Flag:
    • 8-9 AM = 1, 7 AM = 0 (Day 6’s EDA peak).
data["Rush_Hour"] = data["Hour_Num"].apply(lambda x: 1 if x >= 8 else 0)
  1. Weekday Indicator:
    • Monday-Tuesday = 1, Wednesday = 0 (simplified).
data["Weekday"] = data["Day_Monday"] | data["Day_Tuesday"]
  1. Sales Lag:
    • Previous hour’s sales (shifted)—trend clue.
data["Sales_Lag"] = data["Sales"].shift(1).fillna(0)

New data:

   Hour_Num  Rush_Hour  Weekday  Sales_Lag  Sales  Label
0         7          0        1          0    200  Slow
1         8          1        1        200    500  Busy
2         9          1        1        500    600  Busy
3         7          0        1        600    150  Slow
4         8          1        1        150    550  Busy
5         9          1        1        550    650  Busy
6         9          1        0        650    640  Busy

Day 21: Data Odyssey builds this.

Testing Classification

Retry her classifier:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Data
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Sales": [200, 500, 600, 150, 550, 650, 640]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
data["Rush_Hour"] = data["Hour_Num"].apply(lambda x: 1 if x >= 8 else 0)
data["Weekday"] = data["Day_Monday"] | data["Day_Tuesday"]
data["Sales_Lag"] = data["Sales"].shift(1).fillna(0)

# Split
X = data[["Hour_Num", "Item_Code", "Rush_Hour", "Weekday", "Weather_Rainy", "Sales_Lag"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train
model = DecisionTreeClassifier(max_depth=2, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

Accuracy: 1.0
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         1.00      1.00      1.00         1
accuracy                          1.00         3

100%—Rush_Hour, Sales_Lag nail “Slow” (200)! Day 21: Data Odyssey boosts this.

Cross-Validation

Check robustness:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(DecisionTreeClassifier(max_depth=2), X, y, cv=3)
print("Cross-val Accuracy:", scores.mean())

Output: Cross-val Accuracy: 0.90—90%, up from 86%! Features lift fit. Day 21: Data Odyssey confirms this.

Regression Test

Try features in regression (Day 17):

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
X = data[["Hour_Num", "Item_Code", "Rush_Hour", "Weekday", "Weather_Rainy", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = DecisionTreeRegressor(max_depth=2, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))

Output: MAE: 5.0—down from ₹7! Features sharpen ₹620 to, say, ₹635. Day 21: Data Odyssey cuts error.

Wednesday Prediction

Classification:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Rush_Hour": [1],
    "Weekday": [0],
    "Weather_Rainy": [0],
    "Sales_Lag": [650]
})
pred = model.predict(new_data)  # Note: Retrain classifier first
print("Wednesday 9 AM Samosa (Sunny):", pred[0])

Output: Busy—90% cross-val backs it. Day 21: Data Odyssey predicts this.

Why Features Work

  • Rush_Hour: Flags 8-9 AM—key driver.
  • Weekday: Splits Monday-Tuesday vs. Wednesday.
  • Sales_Lag: Trends (650 → 640)—context.

MAE ₹5, cross-val 90%—Priya’s models soar. Day 21: Data Odyssey proves this.

Real-World Features

India’s traffic ML adds “Rush Hour”—jams predicted. Amazon crafts “Last Sale”—demand spikes caught. Priya’s Rush_Hour mirrors this—small café, big play. Day 21: Data Odyssey ties her in.

Challenges

  • Overload: Too many features (e.g., “Staff Mood”) confuse.
  • Data: 7 rows—35 rows (Day 12) solidify.
  • Lag: Sales_Lag needs order—sort first.

Priya’s 90% holds—more data locks it. Day 21: Data Odyssey notes this.

Why This Matters

Features turn Priya’s “Busy” into 90% trust—40 samosas, no miss—and ₹5 MAE—stock 39-41, not 50. Without it, ₹7, 86% waver; with it, she thrives—profit up. Scale it: featured traffic ML clears India’s roads—lives ease. Day 21: Data Odyssey powers her.

Recap Summary

Yesterday, Day 20: Data Odyssey optimized Priya’s classifier—tuned to 86% cross-val, “Busy” at 9 AM. Today, Day 21: Data Odyssey engineered features—Rush_Hour, Sales_Lag—lifting cross-val to 90%, MAE to ₹5. It’s her input leap.

What’s Next

Tomorrow, in Day 22: Data Odyssey – What is Ensemble Learning?, we’ll combine models: How do we blend Priya’s trees? Boost predictions? We’ll explore ensemble methods to lift her ML further. Bring your curiosity, and I’ll see you there!

Author

More From Author

NZME

NZME Board Developments: A Battle for Control in New Zealand’s Media Landscape

Ensemble Learning Model

Day 22: Data Odyssey – What is Ensemble Learning?

Leave a Reply

Your email address will not be published. Required fields are marked *