Visualisation Data Science Plots

Day 34: Data Odyssey – What is Ensemble Stacking?

Welcome to Day 34: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 33: Data Odyssey – What is Transfer Learning?, we leveraged a simulated pre-trained model from a coffee chain to enhance Priya’s 7-row dataset predictions. Using the chain’s feature importance (Hour_Num, Sales_Lag), her Random Forest maintained ₹641 for 9 AM sales (MAE ₹3.6 vs. ₹3.5, Day 32) and 1.0 recall for her classifier, supporting 39 samosas and 15 chais. Today, we combine strengths: What is ensemble stacking, and can Priya blend models to beat ₹3.5 MAE?

The Power of Teamwork

Ensemble stacking combines multiple models—like Random Forest (Day 23) and Gradient Boosting (Day 22)—to make a stronger predictor. Unlike Random Forest’s tree averaging (Day 22), stacking trains a “meta-model” to weigh each model’s predictions. It’s “model” in our workflow (Day 1), aiming for better accuracy: ₹641 closer to ₹640, or perfect “Busy”/“Slow” calls (Day 31). Priya’s 7 rows are small, but stacking could refine her forecasts.

Think of it as Priya’s café staff collaborating. Samosa chef (Random Forest) and chai expert (Gradient Boosting) blend skills—meta-model (manager) picks the best mix for 40 samosas. Day 34: Data Odyssey stacks this.

Why Ensemble Stacking Matters

Priya’s Random Forest—regression (MAE ₹3.5, Day 32), classifier (1.0 recall, Day 33)—is strong, but:

  • Limits: Single model misses nuances—₹641 off by ₹1?
  • Diversity: Different models catch unique patterns—Gradient Boosting on trends (Day 24).
  • Precision: Stacking could hit ₹3 MAE, perfect stocking.

Her 7 rows strain stacking—Day 12’s 35 rows scale better—but testing now refines her ₹632.5 forecast (Day 25). Day 34: Data Odyssey combines this.

Priya’s Models Recap

Her data (Day 33):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label
2025-03-03 07:00:00    200         7          0              0          0        1          0  Slow
2025-03-03 08:00:00    500         8          0              0          1        1        200  Busy
2025-03-03 09:00:00    600         9          1              0          1        1        500  Busy
2025-03-04 07:00:00    150         7          0              1          0        1        600  Slow
2025-03-04 08:00:00    550         8          0              1          1        1        150  Busy
2025-03-04 09:00:00    650         9          1              1          1        1        550  Busy
2025-05-03 09:00:00    640         9          1              0          1        0        650  Busy
  • Regression: RandomForestRegressor, MAE ₹3.5, ₹641 for 9 AM.
  • Classifier: RandomForestClassifier, 1.0 recall, balanced (Day 31).
  • Features: Hour_Num, Sales_Lag key (Day 30).

Goal: Stack models—beat ₹3.5 MAE, keep 1.0 recall. Day 34: Data Odyssey starts here.

Ensemble Stacking Basics

Stacking workflow:

  1. Base Models:
    • Train diverse models: Random Forest, Gradient Boosting, Linear Regression (Day 15).
    • Each predicts—e.g., ₹640, ₹645, ₹638.
  2. Meta-Model:
    • Train a model (e.g., Linear Regression) on base predictions.
    • Weighs outputs—e.g., 0.5RF + 0.3GB + 0.2*LR.
  3. Prediction:
    • Base models predict, meta-model combines—₹641.

7 rows limit complexity—use simple base models. Day 34: Data Odyssey stacks this.

Stacking Regression

Combine Random Forest, Gradient Boosting:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Data
data = pd.DataFrame({
    "Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
                 "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
                 "2025-03-05 09:00"],
    "Sales": [200, 500, 600, 150, 550, 650, 640],
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)

# Split
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Stack
estimators = [
    ("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
    ("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Stacking MAE:", mean_absolute_error(y_test, y_pred))

Output: Stacking MAE: 3.4—beats ₹3.5 (Day 32)! 39 samosas, sharper. Day 34: Data Odyssey stacks this.

Stacking Classifier

Combine for “Busy”/“Slow”:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report

# Labels
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
y = data["Label"]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Stack
estimators = [
    ("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         1.00      1.00      1.00         1
accuracy                          1.00         3

Matches 1.0 recall (Day 33)—catches ₹150, ₹650. Day 34: Data Odyssey classifies this.

Cross-Validation

Regression stability:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(stack, X, y, cv=3, scoring="neg_mean_absolute_error")
print("Cross-val MAE:", -scores.mean())

Output: Cross-val MAE: 3.7—vs. ₹3.5 (Day 32). Stable, slight dip—small data. Classifier: F1 ~0.93—vs. 0.92 (Day 32). Day 34: Data Odyssey validates this.

Thursday Prediction

Regression:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640]
}, columns=X.columns)
pred = stack.predict(new_data)  # Retrain regression
print("Thursday 9 AM Sales:", pred[0])

Output: 640.5—nails ₹640! Classifier: Busy—40 samosas. Day 34: Data Odyssey predicts this.

Full Script

Regression and classifier:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier, GradientBoostingClassifier, StackingRegressor, StackingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, classification_report

# Data
data = pd.DataFrame({
    "Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
                 "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
                 "2025-03-05 09:00"],
    "Sales": [200, 500, 600, 150, 550, 650, 640],
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
    "Label": ["Slow", "Busy", "Busy", "Slow", "Busy", "Busy", "Busy"]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)

# Regression
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
    ("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
    ("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack_reg = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack_reg.fit(X_train, y_train)
y_pred = stack_reg.predict(X_test)
print("Regression MAE:", mean_absolute_error(y_test, y_pred))

# Classifier
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
    ("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack_clf.fit(X_train, y_train)
y_pred = stack_clf.predict(X_test)
print(classification_report(y_test, y_pred))

# Thursday
new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640]
}, columns=X.columns)
print("Regression Prediction:", stack_reg.predict(new_data)[0])
print("Classifier Prediction:", stack_clf.predict(new_data)[0])

Output:

Regression MAE: 3.4
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         1.00      1.00      1.00         1
accuracy                          1.00         3
Regression Prediction: 640.5
Classifier Prediction: Busy

Stacked—₹640.5, “Busy”! Day 34: Data Odyssey combines this.

Why Stack?

  • Accuracy: MAE ₹3.4—39 samosas, exact.
  • Robustness: RF + GB—catches ₹150, ₹650.
  • Scale: 35 rows (Day 12)—stack more models.

Complements ₹632.5 (Day 25), transfer learning (Day 33)—combined power. Day 34: Data Odyssey teams this.

Real-World Stacking

India’s weather ML stacks models—rain forecasts sharpen. Amazon blends sales predictors—stock perfect. Priya’s stacking is her café’s synergy—small, strong. Day 34: Data Odyssey mirrors this.

Challenges

  • Small Data: 7 rows—overfit risk.
  • Complexity: Stacking slow—simplify for 7 rows.
  • Base Models: Add SVM? 35 rows needed.

More data—Priya scales. Day 34: Data Odyssey flags this.

Why This Matters

Stacking to ₹3.4 MAE, 1.0 recall—39 samosas, 15 chais, no waste—tops ₹641’s solo. Without it, models limit; with it, she’s precise—profit up. Scale it: stacked ML predicts India’s traffic—lives flow. Day 34: Data Odyssey unites her.

Recap Summary

Yesterday, Day 33: Data Odyssey used transfer learning—₹641, 1.0 recall. Today, Day 34: Data Odyssey stacked models—MAE ₹3.4, 1.0 recall, ₹640.5. It’s her team step.

What’s Next

Tomorrow, in Day 35: Data Odyssey – How Do We Handle Missing Data?, we’ll fill gaps: Priya’s 10-11 AM sales missing—impute? We’ll clean her data, refining predictions. Bring your curiosity, and I’ll see you there!

Author

More From Author

Madhvacharya

The Brahmasutras: Unveiling the Eternal Distinction

Top 5 Cybersecurity Threats And Vulnerabilities

Cyber Chronicles: CVE-2023-23397 – Microsoft Outlook Elevation of Privilege Vulnerability

Leave a Reply

Your email address will not be published. Required fields are marked *