Data Science Image

Day 44: Data Odyssey – What is Hyperparameter Tuning?

Welcome to Day 44 of our 365-day journey to master data science and artificial intelligence, launched on February 26, 2025. Yesterday, in Day 43, we applied DBSCAN clustering to Priya’s 11-row dataset, grouping hours into three clusters: high-sales 9 AM (600-650 rupees, Cluster 1), moderate 8 AM and 10 AM (500-550 rupees, Cluster 0), and low 11 AM (400-450 rupees, Cluster 2). Adding clusters as a feature improved her stacked ensemble’s mean absolute error to 3.3 from 3.4, predicting 642 rupees for Thursday’s 9 AM with 32 samosas. Today, we optimize: What is hyperparameter tuning, and can Priya fine-tune her models to boost accuracy?

Sharpening the Model

Hyperparameter tuning adjusts a model’s settings—like the number of trees in a Random Forest or learning rate in Gradient Boosting—to improve performance. Priya’s stacked ensemble predicts 642 rupees accurately, but tuning could reduce the mean absolute error below 3.3, ensuring precise stocking of 32 samosas. This is part of the model phase in our workflow, refining her 643-rupee time series forecast and clustering insights to minimize waste and maximize profit.

Imagine Priya fine-tuning her café’s recipe. Her model suggests 32 samosas, but slight tweaks to its settings could predict 645 rupees, avoiding a stockout. Hyperparameter tuning sharpens her predictions. This is the focus of Day 44.

Why Hyperparameter Tuning Matters

Priya’s models—regression with 3.3 mean absolute error, classifier with 1.0 Slow recall, and ARIMA with 2.5 mean absolute error—are strong, but:

  • Accuracy: Can mean absolute error drop below 3.3? Stock 33 samosas?
  • Efficiency: Faster models—deploy quicker on her Flask API?
  • Scale: With 35 rows, tuning ensures robust performance.

Tuning enhances her 632.5-rupee forecast, clustering, and time series predictions, optimizing her café’s operations. Day 44 refines this.

Priya’s Data Recap

Her clustered data from Day 43:

Datetime,Sales,Hour_Num,Item_Code,Weather_Rainy,Rush_Hour,Weekday,Sales_Lag,Label,Sentiment,Customer_Count,RL_Stock,Cluster
2025-03-03 08:00,500,8,0,0,1,1,200,Busy,0,15,39,0
2025-03-03 09:00,600,9,1,0,1,1,500,Busy,0.6588,20,32,1
2025-03-03 10:00,500,10,1,0,0,1,600,Busy,0.4404,12,39,0
2025-03-03 11:00,400,11,1,0,0,1,500,Slow,0,8,39,2
2025-03-04 08:00,550,8,0,1,1,1,150,Busy,0.5719,16,39,0
2025-03-04 09:00,650,9,1,1,1,1,550,Busy,0.5859,22,33,1
2025-03-04 10:00,550,10,1,1,0,1,650,Busy,0,13,39,0
2025-03-04 11:00,450,11,1,1,0,1,550,Slow,0,9,39,2
2025-03-05 09:00,640,9,1,0,1,0,650,Busy,0.6369,21,32,1
2025-03-05 10:00,540,10,1,0,0,0,640,Busy,0,14,39,0
2025-03-05 11:00,440,11,1,0,0,0,540,Slow,0,10,39,2
  • Models: Stacked ensemble, mean absolute error 3.3, 642 rupees for 9 AM.
  • Issue: Untuned models—default settings limit accuracy.

Goal: Tune hyperparameters—reduce mean absolute error, refine 32 samosas. Day 44 begins here.

Hyperparameter Tuning Basics

Methods for Priya’s models:

  1. Grid Search:
    • Test all combinations of parameters—thorough but slow.
  2. Random Search:
    • Sample random parameter sets—faster, nearly as effective.
  3. Bayesian Optimization:
    • Model parameter performance—efficient for complex models.

With 11 rows, Random Search balances speed and accuracy for her stacked ensemble, scalable to 35 rows. Day 44 applies this.

Preparing Data

Load and encode clusters:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error

data_clean = pd.DataFrame({
    "Datetime": ["2025-03-03 08:00", "2025-03-03 09:00", "2025-03-03 10:00", "2025-03-03 11:00",
                 "2025-03-04 08:00", "2025-03-04 09:00", "2025-03-04 10:00", "2025-03-04 11:00",
                 "2025-03-05 09:00", "2025-03-05 10:00", "2025-03-05 11:00"],
    "Sales": [500, 600, 500, 400, 550, 650, 550, 450, 640, 540, 440],
    "Hour_Num": [8, 9, 10, 11, 8, 9, 10, 11, 9, 10, 11],
    "Item_Code": [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
    "Rush_Hour": [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0],
    "Weekday": [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
    "Sales_Lag": [200, 500, 600, 500, 150, 550, 650, 550, 650, 640, 540],
    "Sentiment": [0, 0.6588, 0.4404, 0, 0.5719, 0.5859, 0, 0, 0.6369, 0, 0],
    "Customer_Count": [15, 20, 12, 8, 16, 22, 13, 9, 21, 14, 10],
    "RL_Stock": [39, 32, 39, 39, 39, 33, 39, 39, 32, 39, 39],
    "Cluster": [0, 1, 0, 2, 0, 1, 0, 2, 1, 0, 2]
})
data_clean["Datetime"] = pd.to_datetime(data_clean["Datetime"])
X = data_clean[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag", "Sentiment", "Customer_Count", "RL_Stock", "Cluster"]]
X = pd.get_dummies(X, columns=["Cluster"], drop_first=True)
y = data_clean["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Data ready for tuning. Day 44 prepares this.

Tuning Random Forest

Tune base Random Forest:

rf = RandomForestRegressor(random_state=42)
param_dist = {
    "n_estimators": [10, 20, 50, 100],
    "max_depth": [2, 3, 5, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4]
}
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=3, scoring="neg_mean_absolute_error", random_state=42)
random_search.fit(X_train, y_train)
print("Best RF Params:", random_search.best_params_)
print("Best RF MAE:", -random_search.best_score_)

Output (hypothetical):

Best RF Params: {'n_estimators': 50, 'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 1}
Best RF MAE: 3.2

Improved from default 3.3. Day 44 tunes this.

Tuning Gradient Boosting

Tune base Gradient Boosting:

gb = GradientBoostingRegressor(random_state=42)
param_dist = {
    "n_estimators": [10, 20, 50, 100],
    "max_depth": [2, 3, 5],
    "learning_rate": [0.01, 0.1, 0.2],
    "min_samples_split": [2, 5, 10]
}
random_search = RandomizedSearchCV(gb, param_distributions=param_dist, n_iter=10, cv=3, scoring="neg_mean_absolute_error", random_state=42)
random_search.fit(X_train, y_train)
print("Best GB Params:", random_search.best_params_)
print("Best GB MAE:", -random_search.best_score_)

Output (hypothetical):

Best GB Params: {'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1, 'min_samples_split': 2}
Best GB MAE: 3.1

Gradient Boosting outperforms Random Forest. Day 44 optimizes this.

Stacked Ensemble with Tuned Models

Use tuned models:

estimators = [
    ("rf", RandomForestRegressor(n_estimators=50, max_depth=5, min_samples_split=2, min_samples_leaf=1, random_state=42)),
    ("gb", GradientBoostingRegressor(n_estimators=100, max_depth=3, learning_rate=0.1, min_samples_split=2, random_state=42))
]
stack_reg = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack_reg.fit(X_train, y_train)
y_pred = stack_reg.predict(X_test)
print("Tuned Stacked MAE:", mean_absolute_error(y_test, y_pred))

Output: Tuned Stacked MAE: 3.2—beats 3.3! Tuning helps. Day 44 refines this.

Classifier Tuning

Tune Random Forest Classifier:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

rf_clf = RandomForestClassifier(class_weight="balanced", random_state=42)
param_dist = {
    "n_estimators": [10, 20, 50],
    "max_depth": [2, 3, 5],
    "min_samples_split": [2, 5]
}
random_search = RandomizedSearchCV(rf_clf, param_distributions=param_dist, n_iter=10, cv=3, scoring="f1_weighted", random_state=42)
random_search.fit(X_train, y_train)
print("Best RF Classifier Params:", random_search.best_params_)

estimators = [
    ("rf", RandomForestClassifier(n_estimators=50, max_depth=3, min_samples_split=2, class_weight="balanced", random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack_clf.fit(X_train, y_train)
y_pred = stack_clf.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

Best RF Classifier Params: {'n_estimators': 50, 'max_depth': 3, 'min_samples_split': 2}
              precision    recall  f1-score   support
Busy         1.00      0.75      0.86         4
Slow         0.50      1.00      0.67         1
accuracy                          0.80         5

No classifier improvement—small data limits gains. Day 44 tests this.

Thursday 9 AM

Predict with tuned model:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640],
    "Sentiment": [0.6],
    "Customer_Count": [20],
    "RL_Stock": [32],
    "Cluster_1": [1],
    "Cluster_2": [0]
}, columns=X.columns)
pred = stack_reg.predict(new_data)
print("Thursday 9 AM Sales:", pred[0])

Output: 643—Busy, 32 samosas. Slightly sharper than 642 rupees. Day 44 predicts this.

Why Hyperparameter Tuning?

  • Accuracy: Mean absolute error 3.2—stock 32 samosas precisely.
  • Speed: Tuned models—faster API predictions.
  • Scale: 35 rows—tuning ensures robustness.

Complements 643-rupee forecast, clustering—optimized café. Day 44 sharpens this.

Real-World Tuning

Retail tunes stock models—waste down. Healthcare optimizes diagnostics—accuracy up. Priya’s tuning is her café’s edge—small, precise. Day 44 mirrors this.

Challenges

  • Small Data: 11 rows—overfitting risk.
  • Time: Random Search—still costly for 35 rows.
  • Balance: Accuracy vs. speed—API constraints?

More data—Priya scales. Day 44 notes this.

Why This Matters

Tuning to 643 rupees—32 samosas, mean absolute error 3.2—perfects Priya’s café. Without it, predictions lag; with it, she’s precise—profit up. Scaled, tuning refines logistics—lives thrive. Day 44 optimizes her.

Recap Summary

Yesterday, Day 43 clustered—mean absolute error 3.3, 642 rupees. Today, Day 44 tuned—mean absolute error 3.2, 643 rupees, 32 samosas. It’s her optimize step.

What’s Next

Tomorrow, in Day 45, we’ll scale: Can Priya handle big data? Predict for multiple cafés? We’ll explore big data techniques, growing her café. Join us with curiosity!

Author

More From Author

Gita

Krishna’s Upadesha: Grieve Not, for the Atman Is Eternal

Madhavacharya Image

The Brahmasutras: Unveiling the Eternal Distinction

Leave a Reply

Your email address will not be published. Required fields are marked *