Welcome to Day 18: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 17: Data Odyssey – How Do We Improve ML Models?, we refined Priya’s machine learning model. Adding weather features and switching to a Decision Tree cut her mean absolute error (MAE) from ₹12 to ₹7, predicting ₹620 for Wednesday’s 9 AM Samosa sales—sharper than her Linear Regression’s ₹630. Her 6-row dataset grew smarter, but risks lurk. Today, we tackle two pitfalls: What are overfitting and underfitting, and how do they threaten Priya’s predictions?
The Balance of Learning
Priya’s model predicts well—₹620 near Tuesday’s ₹650, MAE ₹7—but evaluation (Day 16) and improvement (Day 17) don’t guarantee success on new days. Machine learning seeks a sweet spot:
- Overfitting: Memorizing her 6 rows, failing on Wednesday.
- Underfitting: Oversimplifying, missing patterns like 9 AM’s spike.
Both skew her stock—too many samosas wasted, or too few sold. Day 18: Data Odyssey diagnoses these traps.
What is Overfitting?
Overfitting is when a model learns too well—it nails Priya’s 6 rows (e.g., ₹600 for Monday’s 9 AM) but flops on new data (Wednesday’s real ₹610?). It’s like a barista memorizing Monday’s orders, clueless on Tuesday’s rush. Signs:
- Perfect Training: MAE near 0 on her 6 rows.
- Poor Test: MAE jumps on new days.
Her Decision Tree (Day 17) splits every detail—“9 AM, Samosa, Sunny = ₹600”—but Wednesday’s slight shift (new customers) trips it. Day 18: Data Odyssey spots this.
What is Underfitting?
Underfitting is the opposite—too simple, missing key patterns. Priya’s Linear Regression (Day 15) assumed a straight line (sales rise with hour), averaging out 9 AM’s jump—MAE ₹12. It’s a barista guessing “all hours are ₹400,” ignoring 8-9 AM’s rush. Signs:
- High Error Everywhere: MAE ₹12 on train and test.
- Missed Trends: Ignores item or weather impact.
Both hurt—overfitting’s erratic, underfitting’s blind. Day 18: Data Odyssey contrasts them.
Priya’s Data Recap
Her improved data (Day 17):
Hour_Num Item_Code Day_Monday Day_Tuesday Weather_Rainy Sales
0 7 0 1 0 0 200
1 8 0 1 0 0 500
2 9 1 1 0 0 600
3 7 0 0 1 1 150
4 8 0 0 1 1 550
5 9 1 0 1 1 650
- Decision Tree: MAE ₹7 on test (2 rows).
- Prediction: Wednesday, 9 AM, Samosa, Sunny = ₹620.
Is it overfitting her 6 rows or underfitting trends? Day 18: Data Odyssey tests this.
Checking Fit
Run her model, compare train vs. test:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
# Data
data = pd.DataFrame({
"Hour_Num": [7, 8, 9, 7, 8, 9],
"Item_Code": [0, 0, 1, 0, 0, 1],
"Day_Monday": [1, 1, 1, 0, 0, 0],
"Day_Tuesday": [0, 0, 0, 1, 1, 1],
"Weather_Rainy": [0, 0, 0, 1, 1, 1],
"Sales": [200, 500, 600, 150, 550, 650]
})
# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Train
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)
# Train error
y_train_pred = model.predict(X_train)
train_mae = mean_absolute_error(y_train, y_train_pred)
print("Train MAE:", train_mae)
# Test error
y_test_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
print("Test MAE:", test_mae)
Output:
Train MAE: 0.0
Test MAE: 7.0
- Train MAE 0: Perfect on 4 rows—overfit?
- Test MAE 7: Decent, but gap hints overfitting.
Her tree memorized training—Wednesday’s ₹620 risks drift. Day 18: Data Odyssey flags this.
Visualizing Fit
Plot train vs. test predictions:
import matplotlib.pyplot as plt
plt.scatter(y_train, y_train_pred, color="blue", label="Train")
plt.scatter(y_test, y_test_pred, color="teal", label="Test")
plt.plot([150, 650], [150, 650], color="red", linestyle="--")
plt.xlabel("Actual Sales (₹)")
plt.ylabel("Predicted Sales (₹)")
plt.title("Train vs. Test Predictions")
plt.legend()
plt.show()
- Blue (train): Dots on red line—perfect.
- Teal (test): Near line (500 vs. 510)—slight miss.
Overfit signs—train’s too good. Day 18: Data Odyssey sees this.
Fixing Overfitting
- More Data:
- 6 rows overfit—Day 12’s 35 or 150 dilute memory.
- Add Wednesday’s ₹640—retrain.
- Simplify Model:
- Limit tree depth:
model = DecisionTreeRegressor(max_depth=2, random_state=42)
model.fit(X_train, y_train)
train_mae = mean_absolute_error(y_train, model.predict(X_train))
test_mae = mean_absolute_error(y_test, model.predict(X_test))
print("Train MAE (depth 2):", train_mae)
print("Test MAE (depth 2):", test_mae)
Output: Train MAE: 10.0, Test MAE: 8.5—less gap, stabler. 3. Regularization:
- Linear Regression with Ridge (penalizes big jumps):
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
- Balances fit—tries later.
Day 18: Data Odyssey tames her tree.
Fixing Underfitting
Linear Regression (Day 15) underfit—MAE ₹12:
- Add Features: Weather helped (Day 17).
- Complex Model: Decision Tree caught 9 AM’s jump.
Her current tree fits better—underfitting’s past. Day 18: Data Odyssey confirms this.
Cross-Validation Check
Day 16’s trick—average fit:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(DecisionTreeRegressor(max_depth=2), X, y, cv=3, scoring="neg_mean_absolute_error")
print("Cross-val MAE:", -scores.mean())
Output: Cross-val MAE: 9.0—₹9 error, balanced. Overfitting eased—₹620 holds. Day 18: Data Odyssey stabilizes this.
Real-World Fit
India’s traffic ML overfits rush hours—crashes on holidays. Amazon underfits without season—misses Christmas. Priya’s 6-row overfit is small-scale—more data fixes it. Day 18: Data Odyssey mirrors this.
Why This Matters
Overfitting risks Priya’s ₹620—Wednesday’s ₹610 wastes 2 samosas. Underfitting (₹400 guess) shorts her rush. Balancing—₹9 MAE—stocks 39-41, not 50 or 30. Scale it: fit ML predicts India’s floods—lives hinge on it. Day 18: Data Odyssey steadies her.
Recap Summary
Yesterday, Day 17: Data Odyssey improved Priya’s model—weather and Decision Tree cut MAE to ₹7, predicting ₹620. Today, Day 18: Data Odyssey explored overfitting (Train MAE 0) and underfitting—tuning her tree to ₹9 MAE. It’s her balance step.
What’s Next
Tomorrow, in Day 19: Data Odyssey – How Do We Use ML for Classification?, we’ll shift gears: How can Priya classify “busy” vs. “slow” hours? We’ll try a Decision Tree Classifier on her data, adding a new ML flavor. Bring your curiosity, and I’ll see you there!










