Day 19: Data Odyssey – How Do We Use ML for Classification?

Welcome to Day 19: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 18: Data Odyssey – What is Overfitting and Underfitting?, we tackled pitfalls in Priya’s Decision Tree model. Her ₹620 prediction for Wednesday’s 9 AM Samosa sales showed overfitting (Train MAE 0, Test MAE 7)—memorizing her 6 rows. Limiting depth balanced it to a ₹9 MAE via cross-validation, steadying her stock plans. Today, we pivot: How do we use ML for classification, and can Priya classify her hours as “busy” or “slow”?

From Numbers to Labels

Priya’s models (Days 15-18) predicted sales—regression, guessing ₹620. Classification predicts categories, not numbers—e.g., “busy” (rush) vs. “slow” (quiet). Day 14 introduced supervised ML’s two flavors: regression (sales) and classification (labels). Why classify?

Simpler Decisions: “Busy” = stock extra, “Slow” = ease up.
Patterns: Spot rush hours without exact ₹.

Her café thrives on timing—classification flags 8-9 AM as “busy,” 7 AM as “slow.” Day 19: Data Odyssey shifts to this.

Priya’s Classification Task

Her data (Day 17):

   Hour_Num  Item_Code  Day_Monday  Day_Tuesday  Weather_Rainy  Sales
0         7          0           1            0              0    200
1         8          0           1            0              0    500
2         9          1           1            0              0    600
3         7          0           0            1              1    150
4         8          0           0            1              1    550
5         9          1           0            1              1    650

Define “busy” as sales ≥ ₹500 (rush threshold), “slow” < ₹500:

   Sales  Label
0    200  Slow
1    500  Busy
2    600  Busy
3    150  Slow
4    550  Busy
5    650  Busy

Goal: Predict if Wednesday, 9 AM, Samosa, Sunny is “busy” or “slow.” Day 19: Data Odyssey sets this up.

Building a Classifier

Use a Decision Tree Classifier (like Day 17’s regressor, but for labels):

Prep Data:

import pandas as pd
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1],
    "Sales": [200, 500, 600, 150, 550, 650]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]

Split:

from sklearn.model_selection import train_test_split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Train:

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

Predict:

y_pred = model.predict(X_test)
print("Predictions:", y_pred)
print("Actual:", y_test.values)

Output (hypothetical):

Predictions: ['Busy', 'Busy']
Actual: ['Busy', 'Busy']

Test (e.g., 8 AM, 500; 9 AM, 600)—spot on! Day 19: Data Odyssey classifies this.

Wednesday Prediction

9 AM, Samosa, Sunny:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Day_Monday": [0],
    "Day_Tuesday": [0],
    "Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Wednesday 9 AM Samosa (Sunny):", pred[0])

Output: Busy—matches ₹600-650 trend. Stock extra! Day 19: Data Odyssey calls this.

Full Script

Priya’s classifier:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Data
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1],
    "Sales": [200, 500, 600, 150, 550, 650]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]

# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Predictions:", y_pred)
print("Actual:", y_test.values)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Wednesday
new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Day_Monday": [0],
    "Day_Tuesday": [0],
    "Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Wednesday 9 AM Samosa (Sunny):", pred[0])

Output:

Predictions: ['Busy', 'Busy']
Actual: ['Busy', 'Busy']
Accuracy: 1.0
Wednesday 9 AM Samosa (Sunny): Busy

100% on test—small data, big fit! Day 19: Data Odyssey runs this.

Evaluation Metrics

Regression used MAE (Day 16)—classification uses:

Accuracy: Correct predictions / total (1.0 = 100%).
Precision: True “Busy” / Predicted “Busy” (avoid overstock).
Recall: True “Busy” / Actual “Busy” (catch all rushes). Add:

print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         -         -         -           0
accuracy                          1.00         2

Perfect—only “Busy” in test. Day 19: Data Odyssey scores this.

Overfitting Risk

Day 18 warned—Train MAE 0, Test MAE 7. Check:

train_pred = model.predict(X_train)
print("Train Accuracy:", accuracy_score(y_train, train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_pred))

Output:

Train Accuracy: 1.0
Test Accuracy: 1.0

Both 100%—6 rows, overfit? Day 12’s 35 rows test this. Day 19: Data Odyssey flags it.

Why Classification?

Simpler: “Busy” vs. ₹620—easier call.
Actionable: “Busy” = 40 samosas, no math.
Flexible: Adjust threshold (₹400?).

Priya’s “Busy” at 9 AM—stock up, no guesswork. Day 19: Data Odyssey shifts her.

Real-World Classification

India’s railways classify “peak” hours—extra trains. Amazon tags “high demand” items—stock rises. Priya’s “Busy” hours mirror this—small scale, big play. Day 19: Data Odyssey connects her.

Challenges

Small Data: 6 rows—overfit looms.
Threshold: ₹500 arbitrary—₹450 shifts “Slow.”
Balance: 4 Busy, 2 Slow—skewed.

More data (35 rows) and tuning fix this. Day 19: Data Odyssey notes it.

Why This Matters

Classification flags Priya’s 9 AM as “Busy”—40 samosas, no waste, no shortage. Without it, she guesses from ₹620; with it, she acts—profit holds. Scale it: classified traffic “busy” clears India’s roads—lives ease. Day 19: Data Odyssey labels her success.

Recap Summary

Yesterday, Day 18: Data Odyssey balanced Priya’s model—overfitting cut to ₹9 MAE. Today, Day 19: Data Odyssey switched to classification—Decision Tree tagged “Busy” (≥₹500), nailing 9 AM with 100% accuracy. It’s her new lens.

What’s Next

Tomorrow, in Day 20: Data Odyssey – How Do We Optimize Classification Models?, we’ll optimize Priya’s classifier: How do we cut overfit? Boost recall? We’ll tune her Decision Tree and test more data, refining “Busy.” Bring your curiosity, and I’ll see you there!

Author

Vinay Karanam

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W