Welcome to Day 31: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 30: Data Odyssey – What is Model Interpretability?, we unpacked Priya’s Random Forest model, predicting ₹642 for Thursday’s 9 AM sales (MAE ₹4, Day 23). Feature importance and SHAP revealed Hour_Num and Sales_Lag as key drivers, building trust in her 7-row dataset’s forecasts. Today, we tackle a flaw: How do we handle imbalanced data, and can Priya fix her classifier’s skew—5 “Busy” vs. 2 “Slow” hours?
The Imbalance Problem
Imbalanced data occurs when classes—like Priya’s “Busy” (sales ≥ ₹500) and “Slow” (< ₹500) from Day 19—have uneven counts. Her classifier (RandomForestClassifier, 95% cross-val, Day 22) excels, but 5 Busy vs. 2 Slow risks bias toward “Busy,” missing “Slow” hours (e.g., ₹150, Day 26). It’s “preprocess” and “model” in our workflow (Day 1), ensuring fairness for stocking—15 chais at 7 AM, not 30.
Think of it as Priya balancing her café’s menu. Too many samosas (Busy) overshadow chais (Slow)—fix the skew for accurate prep. Day 31: Data Odyssey balances this.
Why Imbalanced Data Matters
Priya’s classifier flags 9 AM as “Busy” (Day 23), but:
- Bias: 5 Busy dominate—7 AM “Slow” missed?
- Cost: False “Busy” at 7 AM—overstock chais.
- Recall: Day 20’s 1.0 for Busy, 0 for Slow—unreliable.
Her 7 rows show skew—Day 12’s 35 rows may worsen. Fixing it boosts her ₹632.5 forecast (Day 25) and clusters (Day 28). Day 31: Data Odyssey corrects this.
Priya’s Data Recap
Her data (Day 30):
Sales Hour_Num Item_Code Weather_Rainy Rush_Hour Weekday Sales_Lag Label
2025-03-03 07:00:00 200 7 0 0 0 1 0 Slow
2025-03-03 08:00:00 500 8 0 0 1 1 200 Busy
2025-03-03 09:00:00 600 9 1 0 1 1 500 Busy
2025-03-04 07:00:00 150 7 0 1 0 1 600 Slow
2025-03-04 08:00:00 550 8 0 1 1 1 150 Busy
2025-03-04 09:00:00 650 9 1 1 1 1 550 Busy
2025-05-03 09:00:00 640 9 1 0 1 0 650 Busy
- Labels: 5 Busy, 2 Slow—71% vs. 29%.
- Classifier: 95% cross-val, but “Slow” recall low (Day 20).
- Features: Hour_Num, Sales_Lag key (Day 30).
Goal: Balance data—catch “Slow” like ₹150. Day 31: Data Odyssey starts here.
Handling Imbalanced Data
Methods for her classifier:
- Oversampling (SMOTE):
- Create synthetic “Slow” rows—match Busy count.
- Undersampling:
- Drop Busy rows—equalize at 2 each.
- Class Weights:
- Penalize Busy misclassifications—balance focus.
7 rows limit SMOTE—class weights or undersampling fit. Day 31: Data Odyssey tries these.
Class Weights
Adjust Random Forest:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Data
data = pd.DataFrame({
"Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
"2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
"2025-03-05 09:00"],
"Sales": [200, 500, 600, 150, 550, 650, 640],
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
"Weekday": [1, 1, 1, 1, 1, 1, 0],
"Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
"Label": ["Slow", "Busy", "Busy", "Slow", "Busy", "Busy", "Busy"]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)
# Train
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
Busy 1.00 0.50 0.67 2
Slow 0.50 1.00 0.67 1
accuracy 0.67 3
- Slow Recall: 1.0—catches ₹150!
- Busy Recall: 0.5—one Busy missed.
- Accuracy: 0.67—trade-off for balance.
Better “Slow” detection—15 chais, not 30. Day 31: Data Odyssey balances this.
Undersampling
Equalize at 2:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print("New counts:", y_res.value_counts())
# Train
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
New counts: Busy 2
Slow 2
precision recall f1-score support
Busy 1.00 1.00 1.00 1
Slow 1.00 1.00 1.00 1
accuracy 1.00 2
- Perfect—but only 4 rows!
- Risk: Lost Busy data—9 AM ₹650 weaker.
Class weights safer—keeps all 7 rows. Day 31: Data Odyssey tests this.
Cross-Validation
Check stability:
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)
scores = cross_val_score(model, X, y, cv=3, scoring="f1_weighted")
print("Cross-val F1:", scores.mean())
Output: Cross-val F1: 0.90—vs. 0.95 (Day 22). Balanced, but slightly less. Day 31: Data Odyssey validates this.
Thursday Prediction
9 AM, balanced model:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [1],
"Sales_Lag": [640]
}, columns=X.columns)
pred = model.predict(new_data)
print("Thursday 9 AM:", pred[0])
Output: Busy—40 samosas, aligns with ₹642 (Day 30). Day 31: Data Odyssey predicts this.
Why Balance?
- Catch Slow: ₹150 flagged—15 chais.
- Fairness: Busy, Slow equal—stock right.
- Scale: 35 rows (Day 12)—SMOTE ready.
Complements ₹632.5 (Day 25), clusters (Day 28)—fair classifier. Day 31: Data Odyssey evens this.
Real-World Balance
India’s fraud ML balances rare frauds—catches cheats. Amazon fixes skewed sales—stock aligns. Priya’s balanced classifier is her café’s fairness—small, critical. Day 31: Data Odyssey mirrors this.
Challenges
- Small Data: 7 rows—undersampling risky.
- Trade-off: Busy recall drops—tune weights?
- Threshold: ₹500—₹450 shifts balance.
35 rows—Priya scales. Day 31: Data Odyssey flags this.
Why This Matters
Balancing 5 Busy, 2 Slow—catches ₹150, stocks 15 chais—avoids waste. Without it, ₹642 overstocks; with it, she’s fair—profit up. Scale it: balanced ML detects India’s outages—lives hold. Day 31: Data Odyssey evens her.
Recap Summary
Yesterday, Day 30: Data Odyssey explained Priya’s ₹642—Hour_Num, Sales_Lag key. Today, Day 31: Data Odyssey balanced her classifier—5 Busy, 2 Slow fixed, Slow recall 1.0. It’s her fair step.
What’s Next
Tomorrow, in Day 32: Data Odyssey – What is Hyperparameter Tuning?, we’ll optimize: Can Priya’s Random Forest hit ₹3 MAE? We’ll tune her models, boosting precision. Bring your curiosity, and I’ll see you there!










