Welcome to Day 31: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 30: Data Odyssey – What is Model Interpretability?, we unpacked Priya’s Random Forest model, predicting ₹642 for Thursday’s 9 AM sales (MAE ₹4, Day 23). Feature importance and SHAP revealed Hour_Num and Sales_Lag as key drivers, building trust in her 7-row dataset’s forecasts. Today, we tackle a flaw: How do we handle imbalanced data, and can Priya fix her classifier’s skew—5 “Busy” vs. 2 “Slow” hours?

The Imbalance Problem

Imbalanced data occurs when classes—like Priya’s “Busy” (sales ≥ ₹500) and “Slow” (< ₹500) from Day 19—have uneven counts. Her classifier (RandomForestClassifier, 95% cross-val, Day 22) excels, but 5 Busy vs. 2 Slow risks bias toward “Busy,” missing “Slow” hours (e.g., ₹150, Day 26). It’s “preprocess” and “model” in our workflow (Day 1), ensuring fairness for stocking—15 chais at 7 AM, not 30.

Think of it as Priya balancing her café’s menu. Too many samosas (Busy) overshadow chais (Slow)—fix the skew for accurate prep. Day 31: Data Odyssey balances this.

Why Imbalanced Data Matters

Priya’s classifier flags 9 AM as “Busy” (Day 23), but:

Bias: 5 Busy dominate—7 AM “Slow” missed?
Cost: False “Busy” at 7 AM—overstock chais.
Recall: Day 20’s 1.0 for Busy, 0 for Slow—unreliable.

Her 7 rows show skew—Day 12’s 35 rows may worsen. Fixing it boosts her ₹632.5 forecast (Day 25) and clusters (Day 28). Day 31: Data Odyssey corrects this.

Priya’s Data Recap

Her data (Day 30):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label
2025-03-03 07:00:00    200         7          0              0          0        1          0  Slow
2025-03-03 08:00:00    500         8          0              0          1        1        200  Busy
2025-03-03 09:00:00    600         9          1              0          1        1        500  Busy
2025-03-04 07:00:00    150         7          0              1          0        1        600  Slow
2025-03-04 08:00:00    550         8          0              1          1        1        150  Busy
2025-03-04 09:00:00    650         9          1              1          1        1        550  Busy
2025-05-03 09:00:00    640         9          1              0          1        0        650  Busy

Labels: 5 Busy, 2 Slow—71% vs. 29%.
Classifier: 95% cross-val, but “Slow” recall low (Day 20).
Features: Hour_Num, Sales_Lag key (Day 30).

Goal: Balance data—catch “Slow” like ₹150. Day 31: Data Odyssey starts here.

Handling Imbalanced Data

Methods for her classifier:

Oversampling (SMOTE):
- Create synthetic “Slow” rows—match Busy count.
Undersampling:
- Drop Busy rows—equalize at 2 each.
Class Weights:
- Penalize Busy misclassifications—balance focus.

7 rows limit SMOTE—class weights or undersampling fit. Day 31: Data Odyssey tries these.

Class Weights

Adjust Random Forest:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Data
data = pd.DataFrame({
    "Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
                 "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
                 "2025-03-05 09:00"],
    "Sales": [200, 500, 600, 150, 550, 650, 640],
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
    "Label": ["Slow", "Busy", "Busy", "Slow", "Busy", "Busy", "Busy"]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)

# Train
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      0.50      0.67         2
Slow         0.50      1.00      0.67         1
accuracy                          0.67         3

Slow Recall: 1.0—catches ₹150!
Busy Recall: 0.5—one Busy missed.
Accuracy: 0.67—trade-off for balance.

Better “Slow” detection—15 chais, not 30. Day 31: Data Odyssey balances this.

Undersampling

Equalize at 2:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print("New counts:", y_res.value_counts())

# Train
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

New counts: Busy    2
           Slow    2
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         1
Slow         1.00      1.00      1.00         1
accuracy                          1.00         2

Perfect—but only 4 rows!
Risk: Lost Busy data—9 AM ₹650 weaker.

Class weights safer—keeps all 7 rows. Day 31: Data Odyssey tests this.

Cross-Validation

Check stability:

from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)
scores = cross_val_score(model, X, y, cv=3, scoring="f1_weighted")
print("Cross-val F1:", scores.mean())

Output: Cross-val F1: 0.90—vs. 0.95 (Day 22). Balanced, but slightly less. Day 31: Data Odyssey validates this.

Thursday Prediction

9 AM, balanced model:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640]
}, columns=X.columns)
pred = model.predict(new_data)
print("Thursday 9 AM:", pred[0])

Output: Busy—40 samosas, aligns with ₹642 (Day 30). Day 31: Data Odyssey predicts this.

Why Balance?

Catch Slow: ₹150 flagged—15 chais.
Fairness: Busy, Slow equal—stock right.
Scale: 35 rows (Day 12)—SMOTE ready.

Complements ₹632.5 (Day 25), clusters (Day 28)—fair classifier. Day 31: Data Odyssey evens this.

Real-World Balance

India’s fraud ML balances rare frauds—catches cheats. Amazon fixes skewed sales—stock aligns. Priya’s balanced classifier is her café’s fairness—small, critical. Day 31: Data Odyssey mirrors this.

Challenges

Small Data: 7 rows—undersampling risky.
Trade-off: Busy recall drops—tune weights?
Threshold: ₹500—₹450 shifts balance.

35 rows—Priya scales. Day 31: Data Odyssey flags this.

Why This Matters

Balancing 5 Busy, 2 Slow—catches ₹150, stocks 15 chais—avoids waste. Without it, ₹642 overstocks; with it, she’s fair—profit up. Scale it: balanced ML detects India’s outages—lives hold. Day 31: Data Odyssey evens her.

Recap Summary

Yesterday, Day 30: Data Odyssey explained Priya’s ₹642—Hour_Num, Sales_Lag key. Today, Day 31: Data Odyssey balanced her classifier—5 Busy, 2 Slow fixed, Slow recall 1.0. It’s her fair step.

What’s Next

Tomorrow, in Day 32: Data Odyssey – What is Hyperparameter Tuning?, we’ll optimize: Can Priya’s Random Forest hit ₹3 MAE? We’ll tune her models, boosting precision. Bring your curiosity, and I’ll see you there!

Author

Vinay Karanam

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W