embarking on a year long journey to master data science and artificial intelligence

Day 31: Data Odyssey – How Do We Handle Imbalanced Data?

Welcome to Day 31: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 30: Data Odyssey – What is Model Interpretability?, we unpacked Priya’s Random Forest model, predicting ₹642 for Thursday’s 9 AM sales (MAE ₹4, Day 23). Feature importance and SHAP revealed Hour_Num and Sales_Lag as key drivers, building trust in her 7-row dataset’s forecasts. Today, we tackle a flaw: How do we handle imbalanced data, and can Priya fix her classifier’s skew—5 “Busy” vs. 2 “Slow” hours?

The Imbalance Problem

Imbalanced data occurs when classes—like Priya’s “Busy” (sales ≥ ₹500) and “Slow” (< ₹500) from Day 19—have uneven counts. Her classifier (RandomForestClassifier, 95% cross-val, Day 22) excels, but 5 Busy vs. 2 Slow risks bias toward “Busy,” missing “Slow” hours (e.g., ₹150, Day 26). It’s “preprocess” and “model” in our workflow (Day 1), ensuring fairness for stocking—15 chais at 7 AM, not 30.

Think of it as Priya balancing her café’s menu. Too many samosas (Busy) overshadow chais (Slow)—fix the skew for accurate prep. Day 31: Data Odyssey balances this.

Why Imbalanced Data Matters

Priya’s classifier flags 9 AM as “Busy” (Day 23), but:

  • Bias: 5 Busy dominate—7 AM “Slow” missed?
  • Cost: False “Busy” at 7 AM—overstock chais.
  • Recall: Day 20’s 1.0 for Busy, 0 for Slow—unreliable.

Her 7 rows show skew—Day 12’s 35 rows may worsen. Fixing it boosts her ₹632.5 forecast (Day 25) and clusters (Day 28). Day 31: Data Odyssey corrects this.

Priya’s Data Recap

Her data (Day 30):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label
2025-03-03 07:00:00    200         7          0              0          0        1          0  Slow
2025-03-03 08:00:00    500         8          0              0          1        1        200  Busy
2025-03-03 09:00:00    600         9          1              0          1        1        500  Busy
2025-03-04 07:00:00    150         7          0              1          0        1        600  Slow
2025-03-04 08:00:00    550         8          0              1          1        1        150  Busy
2025-03-04 09:00:00    650         9          1              1          1        1        550  Busy
2025-05-03 09:00:00    640         9          1              0          1        0        650  Busy
  • Labels: 5 Busy, 2 Slow—71% vs. 29%.
  • Classifier: 95% cross-val, but “Slow” recall low (Day 20).
  • Features: Hour_Num, Sales_Lag key (Day 30).

Goal: Balance data—catch “Slow” like ₹150. Day 31: Data Odyssey starts here.

Handling Imbalanced Data

Methods for her classifier:

  1. Oversampling (SMOTE):
    • Create synthetic “Slow” rows—match Busy count.
  2. Undersampling:
    • Drop Busy rows—equalize at 2 each.
  3. Class Weights:
    • Penalize Busy misclassifications—balance focus.

7 rows limit SMOTE—class weights or undersampling fit. Day 31: Data Odyssey tries these.

Class Weights

Adjust Random Forest:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Data
data = pd.DataFrame({
    "Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
                 "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
                 "2025-03-05 09:00"],
    "Sales": [200, 500, 600, 150, 550, 650, 640],
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
    "Label": ["Slow", "Busy", "Busy", "Slow", "Busy", "Busy", "Busy"]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)

# Train
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      0.50      0.67         2
Slow         0.50      1.00      0.67         1
accuracy                          0.67         3
  • Slow Recall: 1.0—catches ₹150!
  • Busy Recall: 0.5—one Busy missed.
  • Accuracy: 0.67—trade-off for balance.

Better “Slow” detection—15 chais, not 30. Day 31: Data Odyssey balances this.

Undersampling

Equalize at 2:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print("New counts:", y_res.value_counts())

# Train
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

New counts: Busy    2
           Slow    2
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         1
Slow         1.00      1.00      1.00         1
accuracy                          1.00         2
  • Perfect—but only 4 rows!
  • Risk: Lost Busy data—9 AM ₹650 weaker.

Class weights safer—keeps all 7 rows. Day 31: Data Odyssey tests this.

Cross-Validation

Check stability:

from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)
scores = cross_val_score(model, X, y, cv=3, scoring="f1_weighted")
print("Cross-val F1:", scores.mean())

Output: Cross-val F1: 0.90—vs. 0.95 (Day 22). Balanced, but slightly less. Day 31: Data Odyssey validates this.

Thursday Prediction

9 AM, balanced model:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640]
}, columns=X.columns)
pred = model.predict(new_data)
print("Thursday 9 AM:", pred[0])

Output: Busy—40 samosas, aligns with ₹642 (Day 30). Day 31: Data Odyssey predicts this.

Why Balance?

  • Catch Slow: ₹150 flagged—15 chais.
  • Fairness: Busy, Slow equal—stock right.
  • Scale: 35 rows (Day 12)—SMOTE ready.

Complements ₹632.5 (Day 25), clusters (Day 28)—fair classifier. Day 31: Data Odyssey evens this.

Real-World Balance

India’s fraud ML balances rare frauds—catches cheats. Amazon fixes skewed sales—stock aligns. Priya’s balanced classifier is her café’s fairness—small, critical. Day 31: Data Odyssey mirrors this.

Challenges

  • Small Data: 7 rows—undersampling risky.
  • Trade-off: Busy recall drops—tune weights?
  • Threshold: ₹500—₹450 shifts balance.

35 rows—Priya scales. Day 31: Data Odyssey flags this.

Why This Matters

Balancing 5 Busy, 2 Slow—catches ₹150, stocks 15 chais—avoids waste. Without it, ₹642 overstocks; with it, she’s fair—profit up. Scale it: balanced ML detects India’s outages—lives hold. Day 31: Data Odyssey evens her.

Recap Summary

Yesterday, Day 30: Data Odyssey explained Priya’s ₹642—Hour_Num, Sales_Lag key. Today, Day 31: Data Odyssey balanced her classifier—5 Busy, 2 Slow fixed, Slow recall 1.0. It’s her fair step.

What’s Next

Tomorrow, in Day 32: Data Odyssey – What is Hyperparameter Tuning?, we’ll optimize: Can Priya’s Random Forest hit ₹3 MAE? We’ll tune her models, boosting precision. Bring your curiosity, and I’ll see you there!

Author

More From Author

mobile phone in class

Kiwis feeling panicked over phone notifications

Vyasamadhva

The Brahmasutras: Unveiling the Eternal Distinction

Leave a Reply

Your email address will not be published. Required fields are marked *