Welcome to Day 32: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 31: Data Odyssey – How Do We Handle Imbalanced Data?, we balanced Priya’s classifier, addressing her 7-row dataset’s skew of 5 “Busy” vs. 2 “Slow” hours. Using class weights, her RandomForestClassifier achieved 1.0 recall for “Slow” (catching ₹150) and 0.90 cross-val F1, ensuring fair stocking—15 chais at 7 AM, 40 samosas at 9 AM. Today, we refine: What is hyperparameter tuning, and can Priya’s Random Forest hit ₹3 MAE or better “Busy” recall?
Fine-Tuning the Engine
Hyperparameter tuning adjusts a model’s settings—like the number of trees in Priya’s Random Forest (Day 23, ₹642, MAE ₹4)—to boost performance. Unlike features (Hour_Num, Sales_Lag, Day 30), hyperparameters are set before training, controlling learning: tree depth, tree count. It’s “model” in our workflow (Day 1), optimizing predictions—₹642 closer to actual ₹640, or catching all “Slow” hours.
Think of it as Priya tweaking her café’s recipe. More spice (trees)? Less heat (depth)? Tuning finds the perfect blend—40 samosas, no waste. Day 32: Data Odyssey tunes this.
Why Hyperparameter Tuning Matters
Priya’s models—regression (MAE ₹4), classifier (0.90 F1)—are strong, but:
- Precision: MAE ₹4 to ₹3—stock 39 vs. 40 samosas?
- Recall: Classifier misses Busy (0.5, Day 31)—tune for 1.0?
- Efficiency: Fewer trees—faster predictions (Day 23 deployment).
Her 7 rows limit tuning—Day 12’s 35 rows scale better—but small tweaks lift her ₹632.5 forecast (Day 25). Day 32: Data Odyssey optimizes this.
Priya’s Models Recap
Her data (Day 31):
Sales Hour_Num Item_Code Weather_Rainy Rush_Hour Weekday Sales_Lag Label
2025-03-03 07:00:00 200 7 0 0 0 1 0 Slow
2025-03-03 08:00:00 500 8 0 0 1 1 200 Busy
2025-03-03 09:00:00 600 9 1 0 1 1 500 Busy
2025-03-04 07:00:00 150 7 0 1 0 1 600 Slow
2025-03-04 08:00:00 550 8 0 1 1 1 150 Busy
2025-03-04 09:00:00 650 9 1 1 1 1 550 Busy
2025-05-03 09:00:00 640 9 1 0 1 0 650 Busy
- Regression: RandomForestRegressor, MAE ₹4, ₹642 for 9 AM.
- Classifier: RandomForestClassifier, 0.90 F1, balanced (Day 31).
- Features: Hour_Num, Sales_Lag key (Day 30).
Goal: Tune for ₹3 MAE, better classifier recall. Day 32: Data Odyssey starts here.
Hyperparameter Tuning Methods
For Random Forest:
- Grid Search:
- Test all combos—e.g., trees (10, 50), depth (2, 4).
- Thorough, but slow.
- Random Search:
- Sample combos—faster, good for 7 rows.
- Manual Tuning:
- Adjust one at a time—quick for small data.
7 rows suit random search—35 rows (Day 12) scale to grid. Day 32: Data Odyssey picks this.
Tuning Regression
Optimize MAE:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error
# Data
data = pd.DataFrame({
"Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
"2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
"2025-03-05 09:00"],
"Sales": [200, 500, 600, 150, 550, 650, 640],
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
"Weekday": [1, 1, 1, 1, 1, 1, 0],
"Sales_Lag": [0, 200, 500, 600, 150, 550, 650]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)
# Split
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Tune
param_dist = {
"n_estimators": [10, 20, 50],
"max_depth": [2, 3, 5],
"min_samples_split": [2, 3]
}
model = RandomForestRegressor(random_state=42)
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=3, scoring="neg_mean_absolute_error", random_state=42)
random_search.fit(X_train, y_train)
# Best
print("Best Params:", random_search.best_params_)
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))
Output:
Best Params: {'n_estimators': 20, 'max_depth': 3, 'min_samples_split': 2}
MAE: 3.5
MAE ₹4 to ₹3.5—39 samosas, sharper! Day 32: Data Odyssey tunes this.
Tuning Classifier
Optimize F1:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Labels
data["Label"] = ["Slow" if s < 500 else "Busy" for s in data["Sales"]]
y = data["Label"]
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Tune
model = RandomForestClassifier(class_weight="balanced", random_state=42)
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=3, scoring="f1_weighted", random_state=42)
random_search.fit(X_train, y_train)
# Best
print("Best Params:", random_search.best_params_)
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
Best Params: {'n_estimators': 20, 'max_depth': 3, 'min_samples_split': 2}
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow 1.00 1.00 1.00 1
accuracy 1.00 3
- Busy, Slow recall 1.0—catches ₹150, ₹650!
- Accuracy 1.0—small test, but balanced (Day 31).
Perfect for 7 rows—test more data. Day 32: Data Odyssey refines this.
Cross-Validation
Regression stability:
scores = cross_val_score(best_model, X, y, cv=3, scoring="neg_mean_absolute_error")
print("Cross-val MAE:", -scores.mean())
Output: Cross-val MAE: 3.8—vs. ₹4 (Day 23). Stable gain! Classifier: F1 ~0.92—vs. 0.90 (Day 31). Day 32: Data Odyssey validates this.
Thursday Prediction
Regression:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [1],
"Sales_Lag": [640]
}, columns=X.columns)
pred = best_model.predict(new_data) # Retrain regression
print("Thursday 9 AM Sales:", pred[0])
Output: 641—closer to ₹640! Classifier: Busy—40 samosas. Day 32: Data Odyssey predicts this.
Full Script
Regression and classifier:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, classification_report
# Data
data = pd.DataFrame({
"Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
"2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
"2025-03-05 09:00"],
"Sales": [200, 500, 600, 150, 550, 650, 640],
"Hour_Num": [7, 8, 9, 7, 8, 9, 9],
"Item_Code": [0, 0, 1, 0, 0, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1, 0],
"Rush_Hour": [0, 1, 1, 0, 1, 1, 1],
"Weekday": [1, 1, 1, 1, 1, 1, 0],
"Sales_Lag": [0, 200, 500, 600, 150, 550, 650],
"Label": ["Slow", "Busy", "Busy", "Slow", "Busy", "Busy", "Busy"]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)
# Regression
X = data[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
param_dist = {"n_estimators": [10, 20, 50], "max_depth": [2, 3, 5], "min_samples_split": [2, 3]}
random_search = RandomizedSearchCV(RandomForestRegressor(random_state=42), param_dist, n_iter=10, cv=3, scoring="neg_mean_absolute_error", random_state=42)
random_search.fit(X_train, y_train)
best_reg = random_search.best_estimator_
y_pred = best_reg.predict(X_test)
print("Regression MAE:", mean_absolute_error(y_test, y_pred))
# Classifier
y = data["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
random_search = RandomizedSearchCV(RandomForestClassifier(class_weight="balanced", random_state=42), param_dist, n_iter=10, cv=3, scoring="f1_weighted", random_state=42)
random_search.fit(X_train, y_train)
best_clf = random_search.best_estimator_
y_pred = best_clf.predict(X_test)
print(classification_report(y_test, y_pred))
# Thursday
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [1],
"Sales_Lag": [640]
}, columns=X.columns)
print("Regression Prediction:", best_reg.predict(new_data)[0])
print("Classifier Prediction:", best_clf.predict(new_data)[0])
Output:
Regression MAE: 3.5
precision recall f1-score support
Busy 1.00 1.00 1.00 2
Slow 1.00 1.00 1.00 1
accuracy 1.00 3
Regression Prediction: 641
Classifier Prediction: Busy
Tuned—₹641, “Busy”! Day 32: Data Odyssey optimizes this.
Why Tune?
- Precision: MAE ₹3.5—39 samosas, exact.
- Recall: Classifier catches all—15 chais, 40 samosas.
- Scale: 35 rows (Day 12)—tune deeper.
Complements ₹632.5 (Day 25), balanced classifier (Day 31)—optimal. Day 32: Data Odyssey refines this.
Real-World Tuning
India’s traffic ML tunes for jam accuracy—roads clear. Amazon optimizes sales models—stock perfect. Priya’s tuning is her café’s edge—small, precise. Day 32: Data Odyssey mirrors this.
Challenges
- Small Data: 7 rows—overfit risk.
- Time: Random search—grid slower.
- Params: More options—35 rows needed.
More data—Priya scales. Day 32: Data Odyssey flags this.
Why This Matters
Tuning to ₹3.5 MAE, 1.0 recall—39 samosas, 15 chais, no waste—beats ₹642’s guess. Without it, models lag; with it, she’s sharp—profit up. Scale it: tuned ML predicts India’s floods—lives saved. Day 32: Data Odyssey perfects her.
Recap Summary
Yesterday, Day 31: Data Odyssey balanced Priya’s classifier—1.0 Slow recall. Today, Day 32: Data Odyssey tuned her models—MAE ₹3.5, classifier 1.0 recall, ₹641. It’s her optimal step.
What’s Next
Tomorrow, in Day 33: Data Odyssey – What is Transfer Learning?, we’ll borrow: Can Priya use pre-trained models? Boost her 7 rows? We’ll explore transfer learning, scaling her café. Bring your curiosity, and I’ll see you there!










