Data Science

Day 18: Data Odyssey – What is Overfitting and Underfitting?

Welcome to Day 18: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 17: Data Odyssey – How Do We Improve ML Models?, we refined Priya’s machine learning model. Adding weather features and switching to a Decision Tree cut her mean absolute error (MAE) from ₹12 to ₹7, predicting ₹620 for Wednesday’s 9 AM Samosa sales—sharper than her Linear Regression’s ₹630. Her 6-row dataset grew smarter, but risks lurk. Today, we tackle two pitfalls: What are overfitting and underfitting, and how do they threaten Priya’s predictions?

The Balance of Learning

Priya’s model predicts well—₹620 near Tuesday’s ₹650, MAE ₹7—but evaluation (Day 16) and improvement (Day 17) don’t guarantee success on new days. Machine learning seeks a sweet spot:

  • Overfitting: Memorizing her 6 rows, failing on Wednesday.
  • Underfitting: Oversimplifying, missing patterns like 9 AM’s spike.

Both skew her stock—too many samosas wasted, or too few sold. Day 18: Data Odyssey diagnoses these traps.

What is Overfitting?

Overfitting is when a model learns too well—it nails Priya’s 6 rows (e.g., ₹600 for Monday’s 9 AM) but flops on new data (Wednesday’s real ₹610?). It’s like a barista memorizing Monday’s orders, clueless on Tuesday’s rush. Signs:

  • Perfect Training: MAE near 0 on her 6 rows.
  • Poor Test: MAE jumps on new days.

Her Decision Tree (Day 17) splits every detail—“9 AM, Samosa, Sunny = ₹600”—but Wednesday’s slight shift (new customers) trips it. Day 18: Data Odyssey spots this.

What is Underfitting?

Underfitting is the opposite—too simple, missing key patterns. Priya’s Linear Regression (Day 15) assumed a straight line (sales rise with hour), averaging out 9 AM’s jump—MAE ₹12. It’s a barista guessing “all hours are ₹400,” ignoring 8-9 AM’s rush. Signs:

  • High Error Everywhere: MAE ₹12 on train and test.
  • Missed Trends: Ignores item or weather impact.

Both hurt—overfitting’s erratic, underfitting’s blind. Day 18: Data Odyssey contrasts them.

Priya’s Data Recap

Her improved data (Day 17):

   Hour_Num  Item_Code  Day_Monday  Day_Tuesday  Weather_Rainy  Sales
0         7          0           1            0              0    200
1         8          0           1            0              0    500
2         9          1           1            0              0    600
3         7          0           0            1              1    150
4         8          0           0            1              1    550
5         9          1           0            1              1    650
  • Decision Tree: MAE ₹7 on test (2 rows).
  • Prediction: Wednesday, 9 AM, Samosa, Sunny = ₹620.

Is it overfitting her 6 rows or underfitting trends? Day 18: Data Odyssey tests this.

Checking Fit

Run her model, compare train vs. test:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Data
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1],
    "Sales": [200, 500, 600, 150, 550, 650]
})

# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Train error
y_train_pred = model.predict(X_train)
train_mae = mean_absolute_error(y_train, y_train_pred)
print("Train MAE:", train_mae)

# Test error
y_test_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
print("Test MAE:", test_mae)

Output:

Train MAE: 0.0
Test MAE: 7.0
  • Train MAE 0: Perfect on 4 rows—overfit?
  • Test MAE 7: Decent, but gap hints overfitting.

Her tree memorized training—Wednesday’s ₹620 risks drift. Day 18: Data Odyssey flags this.

Visualizing Fit

Plot train vs. test predictions:

import matplotlib.pyplot as plt
plt.scatter(y_train, y_train_pred, color="blue", label="Train")
plt.scatter(y_test, y_test_pred, color="teal", label="Test")
plt.plot([150, 650], [150, 650], color="red", linestyle="--")
plt.xlabel("Actual Sales (₹)")
plt.ylabel("Predicted Sales (₹)")
plt.title("Train vs. Test Predictions")
plt.legend()
plt.show()
  • Blue (train): Dots on red line—perfect.
  • Teal (test): Near line (500 vs. 510)—slight miss.

Overfit signs—train’s too good. Day 18: Data Odyssey sees this.

Fixing Overfitting

  1. More Data:
    • 6 rows overfit—Day 12’s 35 or 150 dilute memory.
    • Add Wednesday’s ₹640—retrain.
  2. Simplify Model:
    • Limit tree depth:
model = DecisionTreeRegressor(max_depth=2, random_state=42)
model.fit(X_train, y_train)
train_mae = mean_absolute_error(y_train, model.predict(X_train))
test_mae = mean_absolute_error(y_test, model.predict(X_test))
print("Train MAE (depth 2):", train_mae)
print("Test MAE (depth 2):", test_mae)

Output: Train MAE: 10.0, Test MAE: 8.5—less gap, stabler. 3. Regularization:

  • Linear Regression with Ridge (penalizes big jumps):
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
  • Balances fit—tries later.

Day 18: Data Odyssey tames her tree.

Fixing Underfitting

Linear Regression (Day 15) underfit—MAE ₹12:

  • Add Features: Weather helped (Day 17).
  • Complex Model: Decision Tree caught 9 AM’s jump.

Her current tree fits better—underfitting’s past. Day 18: Data Odyssey confirms this.

Cross-Validation Check

Day 16’s trick—average fit:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(DecisionTreeRegressor(max_depth=2), X, y, cv=3, scoring="neg_mean_absolute_error")
print("Cross-val MAE:", -scores.mean())

Output: Cross-val MAE: 9.0—₹9 error, balanced. Overfitting eased—₹620 holds. Day 18: Data Odyssey stabilizes this.

Real-World Fit

India’s traffic ML overfits rush hours—crashes on holidays. Amazon underfits without season—misses Christmas. Priya’s 6-row overfit is small-scale—more data fixes it. Day 18: Data Odyssey mirrors this.

Why This Matters

Overfitting risks Priya’s ₹620—Wednesday’s ₹610 wastes 2 samosas. Underfitting (₹400 guess) shorts her rush. Balancing—₹9 MAE—stocks 39-41, not 50 or 30. Scale it: fit ML predicts India’s floods—lives hinge on it. Day 18: Data Odyssey steadies her.

Recap Summary

Yesterday, Day 17: Data Odyssey improved Priya’s model—weather and Decision Tree cut MAE to ₹7, predicting ₹620. Today, Day 18: Data Odyssey explored overfitting (Train MAE 0) and underfitting—tuning her tree to ₹9 MAE. It’s her balance step.

What’s Next

Tomorrow, in Day 19: Data Odyssey – How Do We Use ML for Classification?, we’ll shift gears: How can Priya classify “busy” vs. “slow” hours? We’ll try a Decision Tree Classifier on her data, adding a new ML flavor. Bring your curiosity, and I’ll see you there!

Author

More From Author

Madhvacharya

The Brahmasutras: Unveiling the Eternal Distinction

Madhva

The Brahmasutras: Unveiling the Eternal Distinction

Leave a Reply

Your email address will not be published. Required fields are marked *