Welcome to Day 34: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 33: Data Odyssey – What is Transfer Learning?, we leveraged a simulated pre-trained model from a coffee chain to enhance Priya’s 7-row dataset predictions. Using the chain’s feature importance (Hour_Num, Sales_Lag), her Random Forest maintained ₹641 for 9 AM sales (MAE ₹3.6 vs. ₹3.5, Day 32) and 1.0 recall for her classifier, supporting 39 samosas and 15 chais. Today, we combine strengths: What is ensemble stacking, and can Priya blend models to beat ₹3.5 MAE?
The Power of Teamwork
Ensemble stacking combines multiple models—like Random Forest (Day 23) and Gradient Boosting (Day 22)—to make a stronger predictor. Unlike Random Forest’s tree averaging (Day 22), stacking trains a “meta-model” to weigh each model’s predictions. It’s “model” in our workflow (Day 1), aiming for better accuracy: ₹641 closer to ₹640, or perfect “Busy”/“Slow” calls (Day 31). Priya’s 7 rows are small, but stacking could refine her forecasts.
Think of it as Priya’s café staff collaborating. Samosa chef (Random Forest) and chai expert (Gradient Boosting) blend skills—meta-model (manager) picks the best mix for 40 samosas. Day 34: Data Odyssey stacks this.
Why Ensemble Stacking Matters
Priya’s Random Forest—regression (MAE ₹3.5, Day 32), classifier (1.0 recall, Day 33)—is strong, but:
- Limits: Single model misses nuances—₹641 off by ₹1?
- Diversity: Different models catch unique patterns—Gradient Boosting on trends (Day 24).
- Precision: Stacking could hit ₹3 MAE, perfect stocking.
Her 7 rows strain stacking—Day 12’s 35 rows scale better—but testing now refines her ₹632.5 forecast (Day 25). Day 34: Data Odyssey combines this.
Priya’s Models Recap
Her data (Day 33):
Sales Hour_Num Item_Code Weather_Rainy Rush_Hour Weekday Sales_Lag Label
2025-03-03 07:00:00 200 7 0 0 0 1 0 Slow
2025-03-03 08:00:00 500 8 0 0 1 1 200 Busy
2025-03-03 09:00:00 600 9 1 0 1 1 500 Busy
2025-03-04 07:00:00 150 7 0 1 0 1 600 Slow
2025-03-04 08:00:00 550 8 0 1 1 1 150 Busy
2025-03-04 09:00:00 650 9 1 1 1 1 550 Busy
2025-05-03 09:00:00 640 9 1 0 1 0 650 Busy
- Regression: RandomForestRegressor, MAE ₹3.5, ₹641 for 9 AM.
- Classifier: RandomForestClassifier, 1.0 recall, balanced (Day 31).
- Features: Hour_Num, Sales_Lag key (Day 30).
Goal: Stack models—beat ₹3.5 MAE, keep 1.0 recall. Day 34: Data Odyssey starts here.
Ensemble Stacking Basics
Stacking workflow:
- Base Models:
- Train diverse models: Random Forest, Gradient Boosting, Linear Regression (Day 15).
- Each predicts—e.g., ₹640, ₹645, ₹638.
- Meta-Model:
- Train a model (e.g., Linear Regression) on base predictions.
- Weighs outputs—e.g., 0.5RF + 0.3GB + 0.2*LR.
- Prediction:
- Base models predict, meta-model combines—₹641.
7 rows limit complexity—use simple base models. Day 34: Data Odyssey stacks this.
Stacking Regression
Combine Random Forest, Gradient Boosting:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
# Data
data = pd.DataFrame({
"Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
"2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
"2025-03-05 09:00"],
"Sales": [200, 500, 600, 150, 550, 650, 640],
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
"Weekday": [1, 1, 1, 1, 1, 1, 0],
"Sales_Lag": [0, 200, 500, 600, 150, 550, 650]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)
# Split
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Stack
estimators = [
("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Stacking MAE:", mean_absolute_error(y_test, y_pred))
Output: Stacking MAE: 3.4—beats ₹3.5 (Day 32)! 39 samosas, sharper. Day 34: Data Odyssey stacks this.
Stacking Classifier
Combine for “Busy”/“Slow”:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report
# Labels
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
y = data["Label"]
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Stack
estimators = [
("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow 1.00 1.00 1.00 1
accuracy 1.00 3
Matches 1.0 recall (Day 33)—catches ₹150, ₹650. Day 34: Data Odyssey classifies this.
Cross-Validation
Regression stability:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(stack, X, y, cv=3, scoring="neg_mean_absolute_error")
print("Cross-val MAE:", -scores.mean())
Output: Cross-val MAE: 3.7—vs. ₹3.5 (Day 32). Stable, slight dip—small data. Classifier: F1 ~0.93—vs. 0.92 (Day 32). Day 34: Data Odyssey validates this.
Thursday Prediction
Regression:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [1],
"Sales_Lag": [640]
}, columns=X.columns)
pred = stack.predict(new_data) # Retrain regression
print("Thursday 9 AM Sales:", pred[0])
Output: 640.5—nails ₹640! Classifier: Busy—40 samosas. Day 34: Data Odyssey predicts this.
Full Script
Regression and classifier:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier, GradientBoostingClassifier, StackingRegressor, StackingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, classification_report
# Data
data = pd.DataFrame({
"Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
"2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
"2025-03-05 09:00"],
"Sales": [200, 500, 600, 150, 550, 650, 640],
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
"Weekday": [1, 1, 1, 1, 1, 1, 0],
"Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
"Label": ["Slow", "Busy", "Busy", "Slow", "Busy", "Busy", "Busy"]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)
# Regression
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack_reg = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack_reg.fit(X_train, y_train)
y_pred = stack_reg.predict(X_test)
print("Regression MAE:", mean_absolute_error(y_test, y_pred))
# Classifier
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack_clf.fit(X_train, y_train)
y_pred = stack_clf.predict(X_test)
print(classification_report(y_test, y_pred))
# Thursday
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [1],
"Sales_Lag": [640]
}, columns=X.columns)
print("Regression Prediction:", stack_reg.predict(new_data)[0])
print("Classifier Prediction:", stack_clf.predict(new_data)[0])
Output:
Regression MAE: 3.4
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow 1.00 1.00 1.00 1
accuracy 1.00 3
Regression Prediction: 640.5
Classifier Prediction: Busy
Stacked—₹640.5, “Busy”! Day 34: Data Odyssey combines this.
Why Stack?
- Accuracy: MAE ₹3.4—39 samosas, exact.
- Robustness: RF + GB—catches ₹150, ₹650.
- Scale: 35 rows (Day 12)—stack more models.
Complements ₹632.5 (Day 25), transfer learning (Day 33)—combined power. Day 34: Data Odyssey teams this.
Real-World Stacking
India’s weather ML stacks models—rain forecasts sharpen. Amazon blends sales predictors—stock perfect. Priya’s stacking is her café’s synergy—small, strong. Day 34: Data Odyssey mirrors this.
Challenges
- Small Data: 7 rows—overfit risk.
- Complexity: Stacking slow—simplify for 7 rows.
- Base Models: Add SVM? 35 rows needed.
More data—Priya scales. Day 34: Data Odyssey flags this.
Why This Matters
Stacking to ₹3.4 MAE, 1.0 recall—39 samosas, 15 chais, no waste—tops ₹641’s solo. Without it, models limit; with it, she’s precise—profit up. Scale it: stacked ML predicts India’s traffic—lives flow. Day 34: Data Odyssey unites her.
Recap Summary
Yesterday, Day 33: Data Odyssey used transfer learning—₹641, 1.0 recall. Today, Day 34: Data Odyssey stacked models—MAE ₹3.4, 1.0 recall, ₹640.5. It’s her team step.
What’s Next
Tomorrow, in Day 35: Data Odyssey – How Do We Handle Missing Data?, we’ll fill gaps: Priya’s 10-11 AM sales missing—impute? We’ll clean her data, refining predictions. Bring your curiosity, and I’ll see you there!










