Welcome to Day 19: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 18: Data Odyssey – What is Overfitting and Underfitting?, we tackled pitfalls in Priya’s Decision Tree model. Her ₹620 prediction for Wednesday’s 9 AM Samosa sales showed overfitting (Train MAE 0, Test MAE 7)—memorizing her 6 rows. Limiting depth balanced it to a ₹9 MAE via cross-validation, steadying her stock plans. Today, we pivot: How do we use ML for classification, and can Priya classify her hours as “busy” or “slow”?
From Numbers to Labels
Priya’s models (Days 15-18) predicted sales—regression, guessing ₹620. Classification predicts categories, not numbers—e.g., “busy” (rush) vs. “slow” (quiet). Day 14 introduced supervised ML’s two flavors: regression (sales) and classification (labels). Why classify?
- Simpler Decisions: “Busy” = stock extra, “Slow” = ease up.
- Patterns: Spot rush hours without exact ₹.
Her café thrives on timing—classification flags 8-9 AM as “busy,” 7 AM as “slow.” Day 19: Data Odyssey shifts to this.
Priya’s Classification Task
Her data (Day 17):
Hour_Num Item_Code Day_Monday Day_Tuesday Weather_Rainy Sales
0 7 0 1 0 0 200
1 8 0 1 0 0 500
2 9 1 1 0 0 600
3 7 0 0 1 1 150
4 8 0 0 1 1 550
5 9 1 0 1 1 650
Define “busy” as sales ≥ ₹500 (rush threshold), “slow” < ₹500:
Sales Label
0 200 Slow
1 500 Busy
2 600 Busy
3 150 Slow
4 550 Busy
5 650 Busy
Goal: Predict if Wednesday, 9 AM, Samosa, Sunny is “busy” or “slow.” Day 19: Data Odyssey sets this up.
Building a Classifier
Use a Decision Tree Classifier (like Day 17’s regressor, but for labels):
- Prep Data:
import pandas as pd
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9],
"Item_Code": [0, 0, 1, 0, 0, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1],
"Sales": [200, 500, 600, 150, 550, 650]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
- Split:
from sklearn.model_selection import train_test_split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
- Train:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
- Predict:
y_pred = model.predict(X_test)
print("Predictions:", y_pred)
print("Actual:", y_test.values)
Output (hypothetical):
Predictions: ['Busy', 'Busy']
Actual: ['Busy', 'Busy']
Test (e.g., 8 AM, 500; 9 AM, 600)—spot on! Day 19: Data Odyssey classifies this.
Wednesday Prediction
9 AM, Samosa, Sunny:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Day_Monday": [0],
"Day_Tuesday": [0],
"Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Wednesday 9 AM Samosa (Sunny):", pred[0])
Output: Busy—matches ₹600-650 trend. Stock extra! Day 19: Data Odyssey calls this.
Full Script
Priya’s classifier:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# Data
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9],
"Item_Code": [0, 0, 1, 0, 0, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1],
"Sales": [200, 500, 600, 150, 550, 650]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Train
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("Predictions:", y_pred)
print("Actual:", y_test.values)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Wednesday
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Day_Monday": [0],
"Day_Tuesday": [0],
"Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Wednesday 9 AM Samosa (Sunny):", pred[0])
Output:
Predictions: ['Busy', 'Busy']
Actual: ['Busy', 'Busy']
Accuracy: 1.0
Wednesday 9 AM Samosa (Sunny): Busy
100% on test—small data, big fit! Day 19: Data Odyssey runs this.
Evaluation Metrics
Regression used MAE (Day 16)—classification uses:
- Accuracy: Correct predictions / total (1.0 = 100%).
- Precision: True “Busy” / Predicted “Busy” (avoid overstock).
- Recall: True “Busy” / Actual “Busy” (catch all rushes). Add:
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow - - - 0
accuracy 1.00 2
Perfect—only “Busy” in test. Day 19: Data Odyssey scores this.
Overfitting Risk
Day 18 warned—Train MAE 0, Test MAE 7. Check:
train_pred = model.predict(X_train)
print("Train Accuracy:", accuracy_score(y_train, train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_pred))
Output:
Train Accuracy: 1.0
Test Accuracy: 1.0
Both 100%—6 rows, overfit? Day 12’s 35 rows test this. Day 19: Data Odyssey flags it.
Why Classification?
- Simpler: “Busy” vs. ₹620—easier call.
- Actionable: “Busy” = 40 samosas, no math.
- Flexible: Adjust threshold (₹400?).
Priya’s “Busy” at 9 AM—stock up, no guesswork. Day 19: Data Odyssey shifts her.
Real-World Classification
India’s railways classify “peak” hours—extra trains. Amazon tags “high demand” items—stock rises. Priya’s “Busy” hours mirror this—small scale, big play. Day 19: Data Odyssey connects her.
Challenges
- Small Data: 6 rows—overfit looms.
- Threshold: ₹500 arbitrary—₹450 shifts “Slow.”
- Balance: 4 Busy, 2 Slow—skewed.
More data (35 rows) and tuning fix this. Day 19: Data Odyssey notes it.
Why This Matters
Classification flags Priya’s 9 AM as “Busy”—40 samosas, no waste, no shortage. Without it, she guesses from ₹620; with it, she acts—profit holds. Scale it: classified traffic “busy” clears India’s roads—lives ease. Day 19: Data Odyssey labels her success.
Recap Summary
Yesterday, Day 18: Data Odyssey balanced Priya’s model—overfitting cut to ₹9 MAE. Today, Day 19: Data Odyssey switched to classification—Decision Tree tagged “Busy” (≥₹500), nailing 9 AM with 100% accuracy. It’s her new lens.
What’s Next
Tomorrow, in Day 20: Data Odyssey – How Do We Optimize Classification Models?, we’ll optimize Priya’s classifier: How do we cut overfit? Boost recall? We’ll tune her Decision Tree and test more data, refining “Busy.” Bring your curiosity, and I’ll see you there!

























