Welcome to Day 20: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 19: Data Odyssey – How Do We Use ML for Classification?, we shifted Priya’s focus to classification. Her Decision Tree Classifier labeled hours as “Busy” (sales ≥ ₹500) or “Slow” (< ₹500), nailing Wednesday’s 9 AM Samosa as “Busy” with 100% accuracy on her 6-row test set. But perfect scores hinted at overfitting (Day 18). Today, we refine: How do we optimize classification models, and can Priya trust “Busy” for her café?

The Need for Optimization

Priya’s classifier works—100% accuracy on 2 test rows, calling 9 AM “Busy” to stock extra samosas. But Day 18’s overfitting lesson looms: Train and Test at 1.0 suggest memorization, not learning. Optimization seeks:

Generalization – “Busy” holds for new days.
Balance – Catch all “Busy” hours, avoid false “Slow.”
Efficiency – Simple yet sharp predictions.

Her 6 rows limit her—Day 12’s 35 rows beckon. Day 20: Data Odyssey tunes her model.

Priya’s Starting Point

Her data (Day 19):

   Hour_Num  Item_Code  Day_Monday  Day_Tuesday  Weather_Rainy  Sales  Label
0         7          0           1            0              0    200  Slow
1         8          0           1            0              0    500  Busy
2         9          1           1            0              0    600  Busy
3         7          0           0            1              1    150  Slow
4         8          0           0            1              1    550  Busy
5         9          1           0            1              1    650  Busy

Classifier: Decision Tree, 100% accuracy.
Prediction: Wednesday, 9 AM, Samosa, Sunny = “Busy.”

Goal: Optimize to trust “Busy” on bigger data. Day 20: Data Odyssey starts here.

Optimization Strategies

Refine with data, tuning, and metrics:

More Data:
- 6 rows overfit—add Day 12’s 35 rows later.
- Test with 7th row (imagined):

6         9          1           0            0    640  Busy

Tune Model:
- Limit depth (Day 18) to avoid memorizing.
- Try max_depth=2.
Better Metrics:
- Accuracy (1.0) hides flaws—use precision, recall.

Day 20: Data Odyssey applies these.

Re-Running with Tuning

Update her script:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Data
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1],
    "Sales": [200, 500, 600, 150, 550, 650]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]

# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train with depth limit
model = DecisionTreeClassifier(max_depth=2, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Predictions:", y_pred)
print("Actual:", y_test.values)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

Predictions: ['Busy', 'Busy']
Actual: ['Busy', 'Busy']
Accuracy: 1.0
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         -         -         -           0
accuracy                          1.00         2

Still 100%—test set’s all “Busy.” Day 20: Data Odyssey digs deeper.

Cross-Validation

Day 16’s trick—average performance:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(DecisionTreeClassifier(max_depth=2), X, y, cv=3, scoring="accuracy")
print("Cross-val Accuracy:", scores.mean())

Output: Cross-val Accuracy: 0.833—83.3%, not 100%! One fold (2 rows) mixes “Slow” and “Busy”—overfit eases. Day 20: Data Odyssey balances this.

Adding Data

Add Wednesday’s row:

data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Sales": [200, 500, 600, 150, 550, 650, 640]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]

# Split and train
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = DecisionTreeClassifier(max_depth=2, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

Accuracy: 0.67
              precision    recall  f1-score   support
Busy         0.67      1.00      0.80         2
Slow         -         0.00      -           1
accuracy                          0.67         3

66.7%—missed “Slow” (e.g., 200). More data shows flaws. Day 20: Data Odyssey tests this.

Tuning Parameters

Try min_samples_split=3 (needs 3 rows to split):

model = DecisionTreeClassifier(max_depth=2, min_samples_split=3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output: Accuracy: 0.67—same, but cross-val rises:

scores = cross_val_score(DecisionTreeClassifier(max_depth=2, min_samples_split=3), X, y, cv=3)
print("Cross-val Accuracy:", scores.mean())

Cross-val Accuracy: 0.86—86%, stabler! Day 20: Data Odyssey tunes this.

Wednesday Check

9 AM, Samosa, Sunny:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Day_Monday": [0],
    "Day_Tuesday": [0],
    "Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Wednesday 9 AM Samosa (Sunny):", pred[0])

Output: Busy—still fits ₹640. Day 20: Data Odyssey predicts this.

Why It Improves

Data: 7 rows vs. 6—less memorization.
Tuning: max_depth=2, min_samples_split=3—simpler splits.
Metrics: Recall 1.0 for “Busy”—catches rushes.

Accuracy dips (66.7%), but cross-val (86%) shows robustness—Priya trusts “Busy.” Day 20: Data Odyssey gains this.

Real-World Optimization

India’s railways tune classifiers—peak hours hit 95% recall. Amazon optimizes “high demand”—fewer misses. Priya’s 86% cross-val is small-scale pro—stock aligns. Day 20: Data Odyssey mirrors this.

Challenges

Small Data: 7 rows—35 rows (Day 12) sharpen it.
Imbalance: 5 Busy, 2 Slow—skews to “Busy.”
Threshold: ₹500 rigid—₹450 shifts labels.

More data balances—Priya grows. Day 20: Data Odyssey flags this.

Why This Matters

Optimized, Priya’s “Busy” at 9 AM—86% reliable—means 40 samosas, no rush missed. Without it, 100% overfits—new days fail; with it, she plans—profit holds. Scale it: optimized traffic “busy” clears India’s roads—lives ease. Day 20: Data Odyssey refines her.

Recap Summary

Yesterday, Day 19: Data Odyssey built Priya’s classifier—“Busy” ≥ ₹500, 100% accuracy, hinting overfit. Today, Day 20: Data Odyssey optimized it—more data, tuned tree, 86% cross-val—trusting “Busy” at 9 AM. It’s her refinement step.

What’s Next

Tomorrow, in Day 21: Data Odyssey – What is Feature Engineering?, we’ll boost Priya’s models: How do we craft better inputs? Add “Rush Hour” flags? We’ll engineer features to lift regression and classification. Bring your curiosity, and I’ll see you there!

Author

Vinay Karanam

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W