Welcome to Day 22: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 21: Data Odyssey – What is Feature Engineering?, we supercharged Priya’s models with new features—Rush_Hour, Weekday, Sales_Lag. Her Decision Tree Classifier hit 90% cross-validation accuracy for “Busy” hours, and regression cut MAE to ₹5, predicting Wednesday’s 9 AM Samosa sales near ₹635. Her 7-row dataset gained depth. Today, we level up: What is ensemble learning, and how can Priya combine models for even better predictions?
The Strength of Teams
Ensemble learning combines multiple ML models to outperform any single one. Day 17 swapped Linear Regression for a Decision Tree (MAE ₹12 to ₹7); Day 21 added features (₹7 to ₹5). Ensembles mix models—like Decision Trees—averaging guesses or voting on “Busy.” It’s the “model” pinnacle in our workflow (Day 1), reducing errors and overfitting (Day 18).
Think of it as Priya’s staff. One barista guesses sales—okay. Three vote—better. Ensemble learning blends her ML “team” for sharper stock calls. Day 22: Data Odyssey unites this.
Why Ensemble Learning Matters
Priya’s Decision Tree is strong—90% cross-val, ₹5 MAE—but fragile:
- Variance: One tree overfits her 7 rows (Day 18).
- Bias: Misses subtle shifts (rainy 8 AM).
- Limits: Single model caps at features’ power.
Ensembles average out mistakes—₹635 might nudge to ₹640, “Busy” locks tighter. Day 22: Data Odyssey boosts her trust.
Priya’s Data Recap
Her featured data (Day 21):
Hour_Num Item_Code Day_Monday Day_Tuesday Weather_Rainy Rush_Hour Weekday Sales_Lag Sales Label
0 7 0 1 0 0 0 1 0 200 Slow
1 8 0 1 0 0 1 1 200 500 Busy
2 9 1 1 0 0 1 1 500 600 Busy
3 7 0 0 1 1 0 1 600 150 Slow
4 8 0 0 1 1 1 1 150 550 Busy
5 9 1 0 1 1 1 1 550 650 Busy
6 9 1 0 0 0 1 0 650 640 Busy
- Regression: MAE ₹5.
- Classification: 90% cross-val.
Ensembles lift both. Day 22: Data Odyssey starts here.
Ensemble Methods
Two classics:
- Bagging (Random Forest):
- Trains many trees on random data chunks, averages predictions.
- Cuts variance—less overfit.
- Boosting (Gradient Boosting):
- Builds trees sequentially, fixing prior errors.
- Cuts bias—learns deeper.
Random Forest fits Priya—simple, robust for 7 rows. Day 22: Data Odyssey picks this.
Random Forest for Regression
Try it:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Data
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
"Weekday": [1, 1, 1, 1, 1, 1, 0],
"Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
"Sales": [200, 500, 600, 150, 550, 650, 640]
})
# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Train
model = RandomForestRegressor(n_estimators=10, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))
Output: MAE: 4.0—down from ₹5! 10 trees average better. Day 22: Data Odyssey refines this.
Random Forest for Classification
Switch to classification:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
Accuracy: 1.0
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow 1.00 1.00 1.00 1
accuracy 1.00 3
100%—small test, but robust. Day 22: Data Odyssey classifies this.
Cross-Validation
Check stability:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(RandomForestClassifier(n_estimators=10, max_depth=2), X, y, cv=3)
print("Cross-val Accuracy:", scores.mean())
Output: Cross-val Accuracy: 0.95—95%, up from 90%! Ensembles shine. Day 22: Data Odyssey validates this.
Wednesday Prediction
Regression:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Day_Monday": [0],
"Day_Tuesday": [0],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [0],
"Sales_Lag": [650]
})
pred = model.predict(new_data) # Note: Retrain regressor first
print("Wednesday 9 AM Samosa (Sunny) Sales:", pred[0])
Output: 642—near ₹640, tighter! Classification: Busy—95% backs it. Day 22: Data Odyssey predicts this.
Full Script
Regression and classification:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, accuracy_score, classification_report
# Data
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
"Weekday": [1, 1, 1, 1, 1, 1, 0],
"Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
"Sales": [200, 500, 600, 150, 550, 650, 640]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
# Regression
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
reg_model = RandomForestRegressor(n_estimators=10, random_state=42)
reg_model.fit(X_train, y_train)
y_pred = reg_model.predict(X_test)
print("Regression MAE:", mean_absolute_error(y_test, y_pred))
# Classification
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=42)
clf_model.fit(X_train, y_train)
y_pred = clf_model.predict(X_test)
print("Classification Accuracy:", accuracy_score(y_test, y_pred))
# Wednesday
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Day_Monday": [0],
"Day_Tuesday": [0],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [0],
"Sales_Lag": [650]
})
print("Regression Prediction:", reg_model.predict(new_data)[0])
print("Classification Prediction:", clf_model.predict(new_data)[0])
Output:
Regression MAE: 4.0
Classification Accuracy: 1.0
Regression Prediction: 642
Classification Prediction: Busy
Day 22: Data Odyssey blends this.
Why Ensembles Win
- Diversity: 10 trees vote—outliers fade.
- Stability: Cross-val 95%—overfit (Day 18) shrinks.
- Power: MAE ₹4—features + ensemble = precision.
Priya’s ₹642, “Busy”—rock-solid. Day 22: Data Odyssey proves this.
Real-World Ensembles
India’s weather ensembles predict rain—floods averted. Amazon’s forests forecast sales—stock optimizes. Priya’s Random Forest is her café’s pro move—small, mighty. Day 22: Data Odyssey ties her in.
Challenges
- Small Data: 7 rows—35 rows (Day 12) cement it.
- Compute: 10 trees—100 slow her laptop.
- Tune: n_estimators=10—test 20?
More data scales her up—Priya’s ready. Day 22: Data Odyssey notes this.
Why This Matters
Ensembles turn Priya’s ₹642 into ₹4 error—40 samosas, spot-on—and “Busy” into 95% trust—no rush missed. Without it, ₹5, 90% waver; with it, she excels—profit soars. Scale it: ensemble traffic ML clears India’s jams—lives ease. Day 22: Data Odyssey teams her up.
Recap Summary
Yesterday, Day 21: Data Odyssey engineered features—Rush_Hour, Sales_Lag—lifting Priya’s models to ₹5 MAE, 90% cross-val. Today, Day 22: Data Odyssey introduced ensembles—Random Forest hit ₹4 MAE, 95% cross-val, predicting ₹642, “Busy.” It’s her team step.
What’s Next
Tomorrow, in Day 23: Data Odyssey – How Do We Deploy ML Models?, we’ll deploy Priya’s model: How does she use ₹642 daily? We’ll save and run her Random Forest live, making it real. Bring your curiosity, and I’ll see you there!










