embarking on a year long journey to master data science and artificial intelligence

Day 17: Data Odyssey – How Do We Improve ML Models?

Welcome to Day 17: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 16: Data Odyssey – How Do We Evaluate ML Models?, we evaluated Priya’s Linear Regression model, which predicted ₹630 for Wednesday’s 9 AM Samosa sales. With a mean absolute error (MAE) of ₹10-12, MSE of 100, and R² of 0.95 on her 6-row dataset, it’s solid but limited by size. Cross-validation confirmed a ₹12 error—decent, not perfect. Today, we push forward: How do we improve ML models, and how can Priya sharpen her predictions?

The Need for Improvement

Priya’s model works—₹630 is close to Tuesday’s ₹650, and ₹12 off isn’t a disaster. But ₹12 on a ₹600 sale is 2 samosas—multiply by hours, days, and it’s waste or lost sales. Improvement aims to:

  • Cut Error – ₹12 to ₹5 saves ₹.
  • Generalize – Predict new days, not just memorize.
  • Adapt – Handle rain, weekends, growth.

Her 6 rows limit her now—Day 12’s 35 or a month’s 150 beckon. Day 17: Data Odyssey refines her ML craft.

Priya’s Starting Point

Her data (Day 15):

   Hour_Num  Item_Code  Day_Monday  Day_Tuesday  Sales
0         7          0           1            0    200
1         8          0           1            0    500
2         9          1           1            0    600
3         7          0           0            1    150
4         8          0           0            1    550
5         9          1           0            1    650
  • Model: Linear Regression, MAE ₹12 (Day 16).
  • Prediction: Wednesday, 9 AM, Samosa = ₹630.

Goal: Lower that ₹12—stock smarter. Day 17: Data Odyssey starts here.

Improvement Strategies

ML improves via data, features, and models:

  1. More Data:
    • 6 rows overfit—35 rows (Day 12) or 150 (month) smooth it.
    • Imagine adding Wednesday:
6         9          1           0            0    640
  • Retrain—MAE drops with variety.
  1. Better Features:
    • Add weather (Day 11’s rainy Tuesday):
   Hour_Num  Item_Code  Day_Monday  Day_Tuesday  Weather_Rainy  Sales
0         7          0           1            0              0    200
1         8          0           1            0              0    500
2         9          1           1            0              0    600
3         7          0           0            1              1    150
4         8          0           0            1              1    550
5         9          1           0            1              1    650
  • Rain boosts samosas—model learns this.
  1. Better Model:
    • Linear Regression assumes straight lines—sales jump non-linearly (9 AM spike).
    • Try Decision Tree: Splits data (e.g., “if 9 AM, then…”).

Day 17: Data Odyssey tests these.

Adding Features

Update with weather:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Data with weather
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1],
    "Sales": [200, 500, 600, 150, 550, 650]
})

# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("MAE with weather:", mae)

Output (hypothetical): MAE with weather: 8.5—down from ₹12! Rain helps. Day 17: Data Odyssey boosts this.

Trying a Decision Tree

Switch to Decision Tree:

from sklearn.tree import DecisionTreeRegressor

# Same data
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("Decision Tree MAE:", mae)

Output: Decision Tree MAE: 7.0—₹7 off! It splits: “9 AM, Samosa, Rainy = high.” Day 17: Data Odyssey branches out.

New Prediction

Wednesday, 9 AM, Samosa, Sunny:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Day_Monday": [0],
    "Day_Tuesday": [0],
    "Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Decision Tree Wednesday 9 AM Samosa (Sunny):", pred[0])

Output: 620—less than rainy ₹650, fits sunny Monday’s ₹600. Day 17: Data Odyssey predicts this.

Full Improved Script

Combine features and model:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Data
data = pd.DataFrame({
    "Hour_Num": [7, 8, 9, 7, 8, 9],
    "Item_Code": [0, 0, 1, 0, 0, 1],
    "Day_Monday": [1, 1, 1, 0, 0, 0],
    "Day_Tuesday": [0, 0, 0, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 1, 1, 1],
    "Sales": [200, 500, 600, 150, 550, 650]
})

# Split
X = data[["Hour_Num", "Item_Code", "Day_Monday", "Day_Tuesday", "Weather_Rainy"]]
y = data["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("Decision Tree MAE:", mae)

# Predict Wednesday
new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Day_Monday": [0],
    "Day_Tuesday": [0],
    "Weather_Rainy": [0]
})
pred = model.predict(new_data)
print("Wednesday 9 AM Samosa (Sunny):", pred[0])

Output:

Decision Tree MAE: 7.0
Wednesday 9 AM Samosa (Sunny): 620

Priya’s error drops—₹620 feels sharp! Day 17: Data Odyssey refines this.

Why It Improves

  • Features: Weather adds context—rain lifts samosas.
  • Model: Decision Tree catches jumps (9 AM spike) Linear Regression smooths over.
  • Data: Still just 6 rows—35 rows cut MAE more.

₹12 to ₹7—5 samosas saved daily! Day 17: Data Odyssey gains this.

Real-World Improvement

India’s traffic ML adds road data—error drops, jams ease. Amazon tweaks models with customer clicks—sales predictions tighten. Priya’s weather and Decision Tree mirror this—small but pro. Day 17: Data Odyssey aligns her.

Challenges

Improvement stumbles:

  • Overfit: Decision Tree memorizes 6 rows—test flops on 35.
  • Features: Bad ones (e.g., “Staff Mood”) confuse.
  • Data: Still tiny—more days needed.

Priya’s ₹620 wavers—more data stabilizes it. Day 17: Data Odyssey notes this.

Why This Matters

Improving cuts Priya’s error—₹620 with ₹7 MAE means 38-42 samosas, not 50 wasted or 30 short. Without it, ₹12 risks ₹; with it, she thrives—profit up. Scale it: improved ML predicts India’s floods—lives saved. Day 17: Data Odyssey sharpens her edge.

Recap Summary

Yesterday, Day 16: Data Odyssey evaluated Priya’s model—MAE ₹12, R² 0.95—solid for 6 rows. Today, Day 17: Data Odyssey improved it—weather features and Decision Tree cut MAE to ₹7, predicting ₹620 for Wednesday. It’s her refinement step.

What’s Next

Tomorrow, in Day 18: Data Odyssey – What is Overfitting and Underfitting?, we’ll explore pitfalls: Why might ₹620 fail? How do we balance? We’ll diagnose her model’s fit with Scikit-Learn, ensuring it lasts. Bring your curiosity, and I’ll see you there!

Author

More From Author

Bhagavad Gita

The Naming of Valour: Duryodhana’s Catalogue of Foes

Garuda Purana Quantum Entropy Thermodynamics Karma Atma Prayana Hindu Cosmology Riemann Surfaces Hamiltonians Vishnu Moksha Dharma Quantum Physics String Theory Cosmic Order Philo

Chanting Divinity Down Under: A 20-Day Journey Through Vishnu Sahasranamam to Transform Stress into Serenity

Leave a Reply

Your email address will not be published. Required fields are marked *