Welcome to Day 22: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 21: Data Odyssey – What is Feature Engineering?, we supercharged Priya’s models with new features—Rush_Hour, Weekday, Sales_Lag. Her Decision Tree Classifier hit 90% cross-validation accuracy for “Busy” hours, and regression cut MAE to ₹5, predicting Wednesday’s 9 AM Samosa sales near ₹635. Her 7-row dataset gained depth. Today, we level up: What is ensemble learning, and how can Priya combine models for even better predictions?

The Strength of Teams

Ensemble learning combines multiple ML models to outperform any single one. Day 17 swapped Linear Regression for a Decision Tree (MAE ₹12 to ₹7); Day 21 added features (₹7 to ₹5). Ensembles mix models—like Decision Trees—averaging guesses or voting on “Busy.” It’s the “model” pinnacle in our workflow (Day 1), reducing errors and overfitting (Day 18).

Think of it as Priya’s staff. One barista guesses sales—okay. Three vote—better. Ensemble learning blends her ML “team” for sharper stock calls. Day 22: Data Odyssey unites this.

Why Ensemble Learning Matters

Priya’s Decision Tree is strong—90% cross-val, ₹5 MAE—but fragile:

Variance: One tree overfits her 7 rows (Day 18).
Bias: Misses subtle shifts (rainy 8 AM).
Limits: Single model caps at features’ power.

Ensembles average out mistakes—₹635 might nudge to ₹640, “Busy” locks tighter. Day 22: Data Odyssey boosts her trust.

Priya’s Data Recap

Her featured data (Day 21):

   Hour_Num  Item_Code  Day_Monday  Day_Tuesday  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Sales  Label
0         7          0           1            0              0          0        1          0    200  Slow
1         8          0           1            0              0          1        1        200    500  Busy
2         9          1           1            0              0          1        1        500    600  Busy
3         7          0           0            1              1          0        1        600    150  Slow
4         8          0           0            1              1          1        1        150    550  Busy
5         9          1           0            1              1          1        1        550    650  Busy
6         9          1           0            0              0          1        0        650    640  Busy

Regression: MAE ₹5.
Classification: 90% cross-val.

Ensembles lift both. Day 22: Data Odyssey starts here.

Ensemble Methods

Two classics:

Bagging (Random Forest):
- Trains many trees on random data chunks, averages predictions.
- Cuts variance—less overfit.
Boosting (Gradient Boosting):
- Builds trees sequentially, fixing prior errors.
- Cuts bias—learns deeper.

Random Forest fits Priya—simple, robust for 7 rows. Day 22: Data Odyssey picks this.

Random Forest for Regression

Try it:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Data
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
    "Sales": [200, 500, 600, 150, 550, 650, 640]
})

# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train
model = RandomForestRegressor(n_estimators=10, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))

Output: MAE: 4.0—down from ₹5! 10 trees average better. Day 22: Data Odyssey refines this.

Random Forest for Classification

Switch to classification:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

Accuracy: 1.0
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         1.00      1.00      1.00         1
accuracy                          1.00         3

100%—small test, but robust. Day 22: Data Odyssey classifies this.

Cross-Validation

Check stability:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(RandomForestClassifier(n_estimators=10, max_depth=2), X, y, cv=3)
print("Cross-val Accuracy:", scores.mean())

Output: Cross-val Accuracy: 0.95—95%, up from 90%! Ensembles shine. Day 22: Data Odyssey validates this.

Wednesday Prediction

Regression:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Day_Monday": [0],
    "Day_Tuesday": [0],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [0],
    "Sales_Lag": [650]
})
pred = model.predict(new_data)  # Note: Retrain regressor first
print("Wednesday 9 AM Samosa (Sunny) Sales:", pred[0])

Output: 642—near ₹640, tighter! Classification: Busy—95% backs it. Day 22: Data Odyssey predicts this.

Full Script

Regression and classification:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, accuracy_score, classification_report

# Data
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1, 0],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
    "Sales": [200, 500, 600, 150, 550, 650, 640]
})
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]

# Regression
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
reg_model = RandomForestRegressor(n_estimators=10, random_state=42)
reg_model.fit(X_train, y_train)
y_pred = reg_model.predict(X_test)
print("Regression MAE:", mean_absolute_error(y_test, y_pred))

# Classification
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=42)
clf_model.fit(X_train, y_train)
y_pred = clf_model.predict(X_test)
print("Classification Accuracy:", accuracy_score(y_test, y_pred))

# Wednesday
new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Day_Monday": [0],
    "Day_Tuesday": [0],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [0],
    "Sales_Lag": [650]
})
print("Regression Prediction:", reg_model.predict(new_data)[0])
print("Classification Prediction:", clf_model.predict(new_data)[0])

Output:

Regression MAE: 4.0
Classification Accuracy: 1.0
Regression Prediction: 642
Classification Prediction: Busy

Day 22: Data Odyssey blends this.

Why Ensembles Win

Diversity: 10 trees vote—outliers fade.
Stability: Cross-val 95%—overfit (Day 18) shrinks.
Power: MAE ₹4—features + ensemble = precision.

Priya’s ₹642, “Busy”—rock-solid. Day 22: Data Odyssey proves this.

Real-World Ensembles

India’s weather ensembles predict rain—floods averted. Amazon’s forests forecast sales—stock optimizes. Priya’s Random Forest is her café’s pro move—small, mighty. Day 22: Data Odyssey ties her in.

Challenges

Small Data: 7 rows—35 rows (Day 12) cement it.
Compute: 10 trees—100 slow her laptop.
Tune: n_estimators=10—test 20?

More data scales her up—Priya’s ready. Day 22: Data Odyssey notes this.

Why This Matters

Ensembles turn Priya’s ₹642 into ₹4 error—40 samosas, spot-on—and “Busy” into 95% trust—no rush missed. Without it, ₹5, 90% waver; with it, she excels—profit soars. Scale it: ensemble traffic ML clears India’s jams—lives ease. Day 22: Data Odyssey teams her up.

Recap Summary

Yesterday, Day 21: Data Odyssey engineered features—Rush_Hour, Sales_Lag—lifting Priya’s models to ₹5 MAE, 90% cross-val. Today, Day 22: Data Odyssey introduced ensembles—Random Forest hit ₹4 MAE, 95% cross-val, predicting ₹642, “Busy.” It’s her team step.

What’s Next

Tomorrow, in Day 23: Data Odyssey – How Do We Deploy ML Models?, we’ll deploy Priya’s model: How does she use ₹642 daily? We’ll save and run her Random Forest live, making it real. Bring your curiosity, and I’ll see you there!

Author

Vinay Karanam

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W