Welcome to Day 21: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 20: Data Odyssey – How Do We Optimize Classification Models?, we refined Priya’s Decision Tree Classifier. Adding a 7th row and tuning (max_depth=2, min_samples_split=3) dropped her accuracy to 66.7% on a test set but raised cross-validation to 86%, predicting “Busy” for Wednesday’s 9 AM Samosa reliably. Her 7-row dataset hinted at more potential. Today, we unlock it: What is feature engineering, and how can Priya craft better inputs for her models?
The Power of Features
Feature engineering is designing new inputs (features) from raw data to make ML models smarter. Day 13 preprocessed Priya’s data—encoding “Chai” to 0, scaling sales—but that was prep. Feature engineering creates: “Is it rush hour?” “Weekday or weekend?” It’s the “analyze” and “model” bridge in our workflow (Day 1), feeding regression (Day 17) and classification (Day 20) richer clues.
Think of it as seasoning Priya’s recipe. Raw sales (₹600) are ingredients; features like “9 AM Rush” are spices—models taste better. Day 21: Data Odyssey crafts this.
Why Feature Engineering Matters
Priya’s models use Hour_Num (7-9), Item_Code (0-1), and Day flags—good, but basic. Her ₹620 regression (MAE ₹7) and “Busy” classification (86% cross-val) miss nuance:
- Context: Rainy 9 AM ≠ sunny 9 AM.
- Patterns: 8-9 AM cluster as “rush.”
- Trends: Weekday vs. weekend shifts.
Better features cut errors (₹7 to ₹5) and boost recall—stock aligns tighter. Day 21: Data Odyssey upgrades her inputs.
Priya’s Data Recap
Her 7 rows (Day 20):
Hour_Num Item_Code Day_Monday Day_Tuesday Weather_Rainy Sales Label
0 7 0 1 0 0 200 Slow
1 8 0 1 0 0 500 Busy
2 9 1 1 0 0 600 Busy
3 7 0 0 1 1 150 Slow
4 8 0 0 1 1 550 Busy
5 9 1 0 1 1 650 Busy
6 9 1 0 0 0 640 Busy
- Regression: Predicts sales (e.g., ₹620).
- Classification: “Busy” (≥ ₹500) vs. “Slow.”
New features enhance both. Day 21: Data Odyssey starts here.
Feature Engineering Ideas
Add smarts:
- Rush Hour Flag:
- 8-9 AM = 1, 7 AM = 0 (Day 6’s EDA peak).
data["Rush_Hour"] = data["Hour_Num"].apply(lambda x: 1 if x >= 8 else 0)
- Weekday Indicator:
- Monday-Tuesday = 1, Wednesday = 0 (simplified).
data["Weekday"] = data["Day_Monday"] | data["Day_Tuesday"]
- Sales Lag:
- Previous hour’s sales (shifted)—trend clue.
data["Sales_Lag"] = data["Sales"].shift(1).fillna(0)
New data:
Hour_Num Rush_Hour Weekday Sales_Lag Sales Label
0 7 0 1 0 200 Slow
1 8 1 1 200 500 Busy
2 9 1 1 500 600 Busy
3 7 0 1 600 150 Slow
4 8 1 1 150 550 Busy
5 9 1 1 550 650 Busy
6 9 1 0 650 640 Busy
Day 21: Data Odyssey builds this.
Testing Classification
Retry her classifier:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# Data
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Sales": [200, 500, 600, 150, 550, 650, 640]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
data["Rush_Hour"] = data["Hour_Num"].apply(lambda x: 1 if x >= 8 else 0)
data["Weekday"] = data["Day_Monday"] | data["Day_Tuesday"]
data["Sales_Lag"] = data["Sales"].shift(1).fillna(0)
# Split
X = data[["Hour_Num", "Item_Code", "Rush_Hour", "Weekday", "Weather_Rainy", "Sales_Lag"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Train
model = DecisionTreeClassifier(max_depth=2, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
Accuracy: 1.0
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow 1.00 1.00 1.00 1
accuracy 1.00 3
100%—Rush_Hour, Sales_Lag nail “Slow” (200)! Day 21: Data Odyssey boosts this.
Cross-Validation
Check robustness:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(DecisionTreeClassifier(max_depth=2), X, y, cv=3)
print("Cross-val Accuracy:", scores.mean())
Output: Cross-val Accuracy: 0.90—90%, up from 86%! Features lift fit. Day 21: Data Odyssey confirms this.
Regression Test
Try features in regression (Day 17):
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
X = data[["Hour_Num", "Item_Code", "Rush_Hour", "Weekday", "Weather_Rainy", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = DecisionTreeRegressor(max_depth=2, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))
Output: MAE: 5.0—down from ₹7! Features sharpen ₹620 to, say, ₹635. Day 21: Data Odyssey cuts error.
Wednesday Prediction
Classification:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Rush_Hour": [1],
"Weekday": [0],
"Weather_Rainy": [0],
"Sales_Lag": [650]
})
pred = model.predict(new_data) # Note: Retrain classifier first
print("Wednesday 9 AM Samosa (Sunny):", pred[0])
Output: Busy—90% cross-val backs it. Day 21: Data Odyssey predicts this.
Why Features Work
- Rush_Hour: Flags 8-9 AM—key driver.
- Weekday: Splits Monday-Tuesday vs. Wednesday.
- Sales_Lag: Trends (650 → 640)—context.
MAE ₹5, cross-val 90%—Priya’s models soar. Day 21: Data Odyssey proves this.
Real-World Features
India’s traffic ML adds “Rush Hour”—jams predicted. Amazon crafts “Last Sale”—demand spikes caught. Priya’s Rush_Hour mirrors this—small café, big play. Day 21: Data Odyssey ties her in.
Challenges
- Overload: Too many features (e.g., “Staff Mood”) confuse.
- Data: 7 rows—35 rows (Day 12) solidify.
- Lag: Sales_Lag needs order—sort first.
Priya’s 90% holds—more data locks it. Day 21: Data Odyssey notes this.
Why This Matters
Features turn Priya’s “Busy” into 90% trust—40 samosas, no miss—and ₹5 MAE—stock 39-41, not 50. Without it, ₹7, 86% waver; with it, she thrives—profit up. Scale it: featured traffic ML clears India’s roads—lives ease. Day 21: Data Odyssey powers her.
Recap Summary
Yesterday, Day 20: Data Odyssey optimized Priya’s classifier—tuned to 86% cross-val, “Busy” at 9 AM. Today, Day 21: Data Odyssey engineered features—Rush_Hour, Sales_Lag—lifting cross-val to 90%, MAE to ₹5. It’s her input leap.
What’s Next
Tomorrow, in Day 22: Data Odyssey – What is Ensemble Learning?, we’ll combine models: How do we blend Priya’s trees? Boost predictions? We’ll explore ensemble methods to lift her ML further. Bring your curiosity, and I’ll see you there!










