Day 32: Data Odyssey – What is Hyperparameter Tuning?

Welcome to Day 32: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 31: Data Odyssey – How Do We Handle Imbalanced Data?, we balanced Priya’s classifier, addressing her 7-row dataset’s skew of 5 “Busy” vs. 2 “Slow” hours. Using class weights, her RandomForestClassifier achieved 1.0 recall for “Slow” (catching ₹150) and 0.90 cross-val F1, ensuring fair stocking—15 chais at 7 AM, 40 samosas at 9 AM. Today, we refine: What is hyperparameter tuning, and can Priya’s Random Forest hit ₹3 MAE or better “Busy” recall?

Fine-Tuning the Engine

Hyperparameter tuning adjusts a model’s settings—like the number of trees in Priya’s Random Forest (Day 23, ₹642, MAE ₹4)—to boost performance. Unlike features (Hour_Num, Sales_Lag, Day 30), hyperparameters are set before training, controlling learning: tree depth, tree count. It’s “model” in our workflow (Day 1), optimizing predictions—₹642 closer to actual ₹640, or catching all “Slow” hours.

Think of it as Priya tweaking her café’s recipe. More spice (trees)? Less heat (depth)? Tuning finds the perfect blend—40 samosas, no waste. Day 32: Data Odyssey tunes this.

Why Hyperparameter Tuning Matters

Priya’s models—regression (MAE ₹4), classifier (0.90 F1)—are strong, but:

  • Precision: MAE ₹4 to ₹3—stock 39 vs. 40 samosas?
  • Recall: Classifier misses Busy (0.5, Day 31)—tune for 1.0?
  • Efficiency: Fewer trees—faster predictions (Day 23 deployment).

Her 7 rows limit tuning—Day 12’s 35 rows scale better—but small tweaks lift her ₹632.5 forecast (Day 25). Day 32: Data Odyssey optimizes this.

Priya’s Models Recap

Her data (Day 31):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label
2025-03-03 07:00:00    200         7          0              0          0        1          0  Slow
2025-03-03 08:00:00    500         8          0              0          1        1        200  Busy
2025-03-03 09:00:00    600         9          1              0          1        1        500  Busy
2025-03-04 07:00:00    150         7          0              1          0        1        600  Slow
2025-03-04 08:00:00    550         8          0              1          1        1        150  Busy
2025-03-04 09:00:00    650         9          1              1          1        1        550  Busy
2025-05-03 09:00:00    640         9          1              0          1        0        650  Busy
  • Regression: RandomForestRegressor, MAE ₹4, ₹642 for 9 AM.
  • Classifier: RandomForestClassifier, 0.90 F1, balanced (Day 31).
  • Features: Hour_Num, Sales_Lag key (Day 30).

Goal: Tune for ₹3 MAE, better classifier recall. Day 32: Data Odyssey starts here.

Hyperparameter Tuning Methods

For Random Forest:

  1. Grid Search:
    • Test all combos—e.g., trees (10, 50), depth (2, 4).
    • Thorough, but slow.
  2. Random Search:
    • Sample combos—faster, good for 7 rows.
  3. Manual Tuning:
    • Adjust one at a time—quick for small data.

7 rows suit random search—35 rows (Day 12) scale to grid. Day 32: Data Odyssey picks this.

Tuning Regression

Optimize MAE:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error

# Data
data = pd.DataFrame({
    "Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
                 "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
                 "2025-03-05 09:00"],
    "Sales": [200, 500, 600, 150, 550, 650, 640],
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)

# Split
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Tune
param_dist = {
    "n_estimators": [10, 20, 50],
    "max_depth": [2, 3, 5],
    "min_samples_split": [2, 3]
}
model = RandomForestRegressor(random_state=42)
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=3, scoring="neg_mean_absolute_error", random_state=42)
random_search.fit(X_train, y_train)

# Best
print("Best Params:", random_search.best_params_)
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))

Output:

Best Params: {'n_estimators': 20, 'max_depth': 3, 'min_samples_split': 2}
MAE: 3.5

MAE ₹4 to ₹3.5—39 samosas, sharper! Day 32: Data Odyssey tunes this.

Tuning Classifier

Optimize F1:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Labels
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
y = data["Label"]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Tune
model = RandomForestClassifier(class_weight="balanced", random_state=42)
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=3, scoring="f1_weighted", random_state=42)
random_search.fit(X_train, y_train)

# Best
print("Best Params:", random_search.best_params_)
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

Best Params: {'n_estimators': 20, 'max_depth': 3, 'min_samples_split': 2}
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         1.00      1.00      1.00         1
accuracy                          1.00         3
  • Busy, Slow recall 1.0—catches ₹150, ₹650!
  • Accuracy 1.0—small test, but balanced (Day 31).

Perfect for 7 rows—test more data. Day 32: Data Odyssey refines this.

Cross-Validation

Regression stability:

scores = cross_val_score(best_model, X, y, cv=3, scoring="neg_mean_absolute_error")
print("Cross-val MAE:", -scores.mean())

Output: Cross-val MAE: 3.8—vs. ₹4 (Day 23). Stable gain! Classifier: F1 ~0.92—vs. 0.90 (Day 31). Day 32: Data Odyssey validates this.

Thursday Prediction

Regression:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640]
}, columns=X.columns)
pred = best_model.predict(new_data)  # Retrain regression
print("Thursday 9 AM Sales:", pred[0])

Output: 641—closer to ₹640! Classifier: Busy—40 samosas. Day 32: Data Odyssey predicts this.

Full Script

Regression and classifier:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, classification_report

# Data
data = pd.DataFrame({
    "Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
                 "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
                 "2025-03-05 09:00"],
    "Sales": [200, 500, 600, 150, 550, 650, 640],
    "Hour_Num": [7, 8, 9, 7, 8, 9, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
    "Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
    "Weekday": [1, 1, 1, 1, 1, 1, 0],
    "Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
    "Label": ["Slow", "Busy", "Busy", "Slow", "Busy", "Busy", "Busy"]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)

# Regression
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
param_dist = {"n_estimators": [10, 20, 50], "max_depth": [2, 3, 5], "min_samples_split": [2, 3]}
random_search = RandomizedSearchCV(RandomForestRegressor(random_state=42), param_dist, n_iter=10, cv=3, scoring="neg_mean_absolute_error", random_state=42)
random_search.fit(X_train, y_train)
best_reg = random_search.best_estimator_
y_pred = best_reg.predict(X_test)
print("Regression MAE:", mean_absolute_error(y_test, y_pred))

# Classifier
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
random_search = RandomizedSearchCV(RandomForestClassifier(class_weight="balanced", random_state=42), param_dist, n_iter=10, cv=3, scoring="f1_weighted", random_state=42)
random_search.fit(X_train, y_train)
best_clf = random_search.best_estimator_
y_pred = best_clf.predict(X_test)
print(classification_report(y_test, y_pred))

# Thursday
new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640]
}, columns=X.columns)
print("Regression Prediction:", best_reg.predict(new_data)[0])
print("Classifier Prediction:", best_clf.predict(new_data)[0])

Output:

Regression MAE: 3.5
              precision    recall  f1-score   support
Busy         1.00      1.00      1.00         2
Slow         1.00      1.00      1.00         1
accuracy                          1.00         3
Regression Prediction: 641
Classifier Prediction: Busy

Tuned—₹641, “Busy”! Day 32: Data Odyssey optimizes this.

Why Tune?

  • Precision: MAE ₹3.5—39 samosas, exact.
  • Recall: Classifier catches all—15 chais, 40 samosas.
  • Scale: 35 rows (Day 12)—tune deeper.

Complements ₹632.5 (Day 25), balanced classifier (Day 31)—optimal. Day 32: Data Odyssey refines this.

Real-World Tuning

India’s traffic ML tunes for jam accuracy—roads clear. Amazon optimizes sales models—stock perfect. Priya’s tuning is her café’s edge—small, precise. Day 32: Data Odyssey mirrors this.

Challenges

  • Small Data: 7 rows—overfit risk.
  • Time: Random search—grid slower.
  • Params: More options—35 rows needed.

More data—Priya scales. Day 32: Data Odyssey flags this.

Why This Matters

Tuning to ₹3.5 MAE, 1.0 recall—39 samosas, 15 chais, no waste—beats ₹642’s guess. Without it, models lag; with it, she’s sharp—profit up. Scale it: tuned ML predicts India’s floods—lives saved. Day 32: Data Odyssey perfects her.

Recap Summary

Yesterday, Day 31: Data Odyssey balanced Priya’s classifier—1.0 Slow recall. Today, Day 32: Data Odyssey tuned her models—MAE ₹3.5, classifier 1.0 recall, ₹641. It’s her optimal step.

What’s Next

Tomorrow, in Day 33: Data Odyssey – What is Transfer Learning?, we’ll borrow: Can Priya use pre-trained models? Boost her 7 rows? We’ll explore transfer learning, scaling her café. Bring your curiosity, and I’ll see you there!

Author

More From Author

20250123 PRESS RELEASE HSS 1

HSS NZ Condemns Pahalgam Terror Attack Targeting Hindu Tourists

Madhvacharya 1

The Brahmasutras: Unveiling the Eternal Distinction

Leave a Reply

Your email address will not be published. Required fields are marked *