Welcome to Day 20: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 19: Data Odyssey – How Do We Use ML for Classification?, we shifted Priya’s focus to classification. Her Decision Tree Classifier labeled hours as “Busy” (sales ≥ ₹500) or “Slow” (< ₹500), nailing Wednesday’s 9 AM Samosa as “Busy” with 100% accuracy on her 6-row test set. But perfect scores hinted at overfitting (Day 18). Today, we refine: How do we optimize classification models, and can Priya trust “Busy” for her café?
The Need for Optimization
Priya’s classifier works—100% accuracy on 2 test rows, calling 9 AM “Busy” to stock extra samosas. But Day 18’s overfitting lesson looms: Train and Test at 1.0 suggest memorization, not learning. Optimization seeks:
- Generalization – “Busy” holds for new days.
- Balance – Catch all “Busy” hours, avoid false “Slow.”
- Efficiency – Simple yet sharp predictions.
Her 6 rows limit her—Day 12’s 35 rows beckon. Day 20: Data Odyssey tunes her model.
Priya’s Starting Point
Her data (Day 19):
Hour_Num Item_Code Day_Monday Day_Tuesday Weather_Rainy Sales Label
0 7 0 1 0 0 200 Slow
1 8 0 1 0 0 500 Busy
2 9 1 1 0 0 600 Busy
3 7 0 0 1 1 150 Slow
4 8 0 0 1 1 550 Busy
5 9 1 0 1 1 650 Busy
- Classifier: Decision Tree, 100% accuracy.
- Prediction: Wednesday, 9 AM, Samosa, Sunny = “Busy.”
Goal: Optimize to trust “Busy” on bigger data. Day 20: Data Odyssey starts here.
Optimization Strategies
Refine with data, tuning, and metrics:
- More Data:
- 6 rows overfit—add Day 12’s 35 rows later.
- Test with 7th row (imagined):
6 9 1 0 0 640 Busy
- Tune Model:
- Limit depth (Day 18) to avoid memorizing.
- Try max_depth=2.
- Better Metrics:
- Accuracy (1.0) hides flaws—use precision, recall.
Day 20: Data Odyssey applies these.
Re-Running with Tuning
Update her script:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# Data
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9],
"Item_Code": [0, 0, 1, 0, 0, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1],
"Sales": [200, 500, 600, 150, 550, 650]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Train with depth limit
model = DecisionTreeClassifier(max_depth=2, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("Predictions:", y_pred)
print("Actual:", y_test.values)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
Predictions: ['Busy', 'Busy']
Actual: ['Busy', 'Busy']
Accuracy: 1.0
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow - - - 0
accuracy 1.00 2
Still 100%—test set’s all “Busy.” Day 20: Data Odyssey digs deeper.
Cross-Validation
Day 16’s trick—average performance:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(DecisionTreeClassifier(max_depth=2), X, y, cv=3, scoring="accuracy")
print("Cross-val Accuracy:", scores.mean())
Output: Cross-val Accuracy: 0.833—83.3%, not 100%! One fold (2 rows) mixes “Slow” and “Busy”—overfit eases. Day 20: Data Odyssey balances this.
Adding Data
Add Wednesday’s row:
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Sales": [200, 500, 600, 150, 550, 650, 640]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
# Split and train
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = DecisionTreeClassifier(max_depth=2, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
Accuracy: 0.67
precision recall f1-score support
Busy 0.67 1.00 0.80 2
Slow - 0.00 - 1
accuracy 0.67 3
66.7%—missed “Slow” (e.g., 200). More data shows flaws. Day 20: Data Odyssey tests this.
Tuning Parameters
Try min_samples_split=3 (needs 3 rows to split):
model = DecisionTreeClassifier(max_depth=2, min_samples_split=3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output: Accuracy: 0.67—same, but cross-val rises:
scores = cross_val_score(DecisionTreeClassifier(max_depth=2, min_samples_split=3), X, y, cv=3)
print("Cross-val Accuracy:", scores.mean())
Cross-val Accuracy: 0.86—86%, stabler! Day 20: Data Odyssey tunes this.
Wednesday Check
9 AM, Samosa, Sunny:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Day_Monday": [0],
"Day_Tuesday": [0],
"Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Wednesday 9 AM Samosa (Sunny):", pred[0])
Output: Busy—still fits ₹640. Day 20: Data Odyssey predicts this.
Why It Improves
- Data: 7 rows vs. 6—less memorization.
- Tuning: max_depth=2, min_samples_split=3—simpler splits.
- Metrics: Recall 1.0 for “Busy”—catches rushes.
Accuracy dips (66.7%), but cross-val (86%) shows robustness—Priya trusts “Busy.” Day 20: Data Odyssey gains this.
Real-World Optimization
India’s railways tune classifiers—peak hours hit 95% recall. Amazon optimizes “high demand”—fewer misses. Priya’s 86% cross-val is small-scale pro—stock aligns. Day 20: Data Odyssey mirrors this.
Challenges
- Small Data: 7 rows—35 rows (Day 12) sharpen it.
- Imbalance: 5 Busy, 2 Slow—skews to “Busy.”
- Threshold: ₹500 rigid—₹450 shifts labels.
More data balances—Priya grows. Day 20: Data Odyssey flags this.
Why This Matters
Optimized, Priya’s “Busy” at 9 AM—86% reliable—means 40 samosas, no rush missed. Without it, 100% overfits—new days fail; with it, she plans—profit holds. Scale it: optimized traffic “busy” clears India’s roads—lives ease. Day 20: Data Odyssey refines her.
Recap Summary
Yesterday, Day 19: Data Odyssey built Priya’s classifier—“Busy” ≥ ₹500, 100% accuracy, hinting overfit. Today, Day 20: Data Odyssey optimized it—more data, tuned tree, 86% cross-val—trusting “Busy” at 9 AM. It’s her refinement step.
What’s Next
Tomorrow, in Day 21: Data Odyssey – What is Feature Engineering?, we’ll boost Priya’s models: How do we craft better inputs? Add “Rush Hour” flags? We’ll engineer features to lift regression and classification. Bring your curiosity, and I’ll see you there!










