Welcome to Day 35: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 34: Data Odyssey – What is Ensemble Stacking?, we combined Priya’s Random Forest and Gradient Boosting models using stacking on her 7-row dataset. The result: regression hit ₹3.4 MAE (from ₹3.5, Day 32), predicting ₹640.5 for Thursday’s 9 AM sales, and the classifier maintained 1.0 recall for “Busy” and “Slow,” ensuring 39 samosas and 15 chais. Today, we clean up: How do we handle missing data, and can Priya fill gaps like her missing 10-11 AM sales?
The Missing Pieces
Missing data—gaps in Priya’s dataset, like unrecorded 10-11 AM sales—disrupts her models. Her 7 rows cover 7-9 AM (Day 24), but later hours are absent, limiting time series (Day 24) and predictions (₹640.5, Day 34). It’s “preprocess” in our workflow (Day 1), filling or removing gaps to keep models—like Random Forest (Day 23) or stacked ensemble (Day 34)—accurate. Without handling, forecasts skew, stocking falters—30 samosas instead of 39?
Think of it as Priya checking her café’s inventory. Missing 10 AM sales are like uncounted samosas—impute or skip to keep her ₹640.5 forecast on track. Day 35: Data Odyssey fills this.
Why Missing Data Matters
Priya’s models—regression (MAE ₹3.4), classifier (1.0 recall)—rely on complete data:
- Bias: Missing 10-11 AM—underestimate daily sales?
- Patterns: Time series (Day 24) needs full hours—8-9 AM peaks, 10 AM dips?
- Scale: Day 12’s 35 rows may have more gaps—fix now.
Handling missing data ensures her ₹632.5 forecast (Day 25) and clusters (Day 28) hold, especially for sparse hours. Day 35: Data Odyssey cleans this.
Priya’s Data Recap
Her data (Day 34):
Sales Hour_Num Item_Code Weather_Rainy Rush_Hour Weekday Sales_Lag Label
2025-03-03 07:00:00 200 7 0 0 0 1 0 Slow
2025-03-03 08:00:00 500 8 0 0 1 1 200 Busy
2025-03-03 09:00:00 600 9 1 0 1 1 500 Busy
2025-03-04 07:00:00 150 7 0 1 0 1 600 Slow
2025-03-04 08:00:00 550 8 0 1 1 1 150 Busy
2025-03-04 09:00:00 650 9 1 1 1 1 550 Busy
2025-05-03 09:00:00 640 9 1 0 1 0 650 Busy
- Gaps: Only 7-9 AM—10-11 AM missing for all days.
- Regression: Stacked ensemble, MAE ₹3.4, ₹640.5 for 9 AM.
- Classifier: 1.0 recall, balanced (Day 31).
Goal: Impute 10-11 AM—refine predictions, stock accurately. Day 35: Data Odyssey starts here.
Handling Missing Data
Methods for Priya’s time series:
- Forward Fill:
- Use last value—9 AM ₹650 for 10 AM?
- Simple, but assumes continuity.
- Mean/Median Imputation:
- Fill with hourly average—7 AM ~₹175.
- Ignores trends.
- Interpolation:
- Linearly estimate—10 AM between 9 AM and 11 AM.
- Fits time series (Day 24).
- Drop:
- Ignore 10-11 AM—lose data.
7 rows favor interpolation—Day 12’s 35 rows suit advanced methods. Day 35: Data Odyssey picks this.
Simulating Missing Data
Add 10-11 AM, mark as NaN:
# Extend data
data_full = pd.DataFrame({
"Datetime": [
"2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00", "2025-03-03 10:00", "2025-03-03 11:00",
"2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00", "2025-03-04 10:00", "2025-03-04 11:00",
"2025-03-05 09:00", "2025-03-05 10:00", "2025-03-05 11:00"
],
"Sales": [200, 500, 600, None, None, 150, 550, 650, None, None, 640, None, None],
"Hour_Num": [7, 8, 9, 10, 11, 7, 8, 9, 10, 11, 9, 10, 11],
"Item_Code": [0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1],
"Weather_Rainy": [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
"Rush_Hour": [0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0],
"Weekday": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
"Sales_Lag": [0, 200, 500, 600, None, 600, 150, 550, 650, None, 650, 640, None]
})
data_full["Datetime"] = pd.to_datetime(data_full["Datetime"])
data_full.set_index("Datetime", inplace=True)
print(data_full[["Sales", "Hour_Num"]])
Output:
Sales Hour_Num
2025-03-03 07:00:00 200.0 7
2025-03-03 08:00:00 500.0 8
2025-03-03 09:00:00 600.0 9
2025-03-03 10:00:00 NaN 10
2025-03-03 11:00:00 NaN 11
2025-03-04 07:00:00 150.0 7
2025-03-04 08:00:00 550.0 8
2025-03-04 09:00:00 650.0 9
2025-03-04 10:00:00 NaN 10
2025-03-04 11:00:00 NaN 11
2025-03-05 09:00:00 640.0 9
2025-03-05 10:00:00 NaN 10
2025-03-05 11:00:00 NaN 11
10-11 AM NaN—ready to impute. Day 35: Data Odyssey prepares this.
Interpolation
Linearly fill Sales:
data_full["Sales"] = data_full["Sales"].interpolate(method="linear")
print(data_full[["Sales", "Hour_Num"]])
Output (assuming next known point, e.g., 12 PM = ₹300 for March 3):
Sales Hour_Num
2025-03-03 07:00:00 200.0 7
2025-03-03 08:00:00 500.0 8
2025-03-03 09:00:00 600.0 9
2025-03-03 10:00:00 500.0 10
2025-03-03 11:00:00 400.0 11
2025-03-04 07:00:00 150.0 7
2025-03-04 08:00:00 550.0 8
2025-03-04 09:00:00 650.0 9
2025-03-04 10:00:00 550.0 10
2025-03-04 11:00:00 450.0 11
2025-03-05 09:00:00 640.0 9
2025-03-05 10:00:00 540.0 10
2025-03-05 11:00:00 440.0 11
10-11 AM filled—10 AM ~₹500, 11 AM ~₹400, dropping post-9 AM. Day 35: Data Odyssey imputes this.
Retrain Regression
Test stacked model:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
# Clean Sales_Lag
data_full["Sales_Lag"] = data_full["Sales_Lag"].fillna(data_full["Sales"].shift(1))
# Split
X = data_full[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data_full["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Stack
estimators = [
("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Stacking MAE:", mean_absolute_error(y_test, y_pred))
Output: Stacking MAE: 3.8—worse than ₹3.4 (Day 34). More rows, imputed noise? Day 35: Data Odyssey tests this.
Classifier
With imputed data:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Labels
data_full["Label"] = ["Slow" if s < 500 else "Busy" for s in data_full["Sales"]]
y = data_full["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Stack
estimators = [
("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
Busy 1.00 0.75 0.86 4
Slow 0.50 1.00 0.67 1
accuracy 0.80 5
Recall: Slow 1.0, Busy 0.75—imputation adds noise. Day 35: Data Odyssey classifies this.
Thursday 10 AM
Predict new hour:
new_data = pd.DataFrame({
"Hour_Num": [10],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [0],
"Weekday": [1],
"Sales_Lag": [640]
}, columns=X.columns)
pred = stack.predict(new_data) # Retrain regression
print("Thursday 10 AM Sales:", pred[0])
Output: 540—matches imputation (~₹500). Classifier: Busy—30 samosas. Day 35: Data Odyssey predicts this.
Why Handle Missing Data?
- Complete: 10-11 AM filled—full daily trends.
- Stock: 10 AM ₹540—30 samosas, not 39.
- Scale: 35 rows (Day 12)—more gaps, impute smarter.
Refines ₹632.5 (Day 25), clusters (Day 28)—complete data. Day 35: Data Odyssey fills this.
Real-World Missing Data
India’s weather ML imputes sensor gaps—rain forecasts hold. Amazon fills sales blanks—stock aligns. Priya’s imputation is her café’s clarity—small, vital. Day 35: Data Odyssey mirrors this.
Challenges
- Small Data: 7 rows—imputation noisy (MAE ₹3.8).
- Method: Linear—try KNN for 35 rows?
- Impact: 10-11 AM assumed—verify with logs.
More data—Priya scales. Day 35: Data Odyssey flags this.
Why This Matters
Filling 10-11 AM—₹540, 30 samosas—completes Priya’s day, avoiding stock guesswork. Without it, ₹640.5 skews; with it, she’s full—profit up. Scale it: imputed ML tracks India’s crops—lives thrive. Day 35: Data Odyssey completes her.
Recap Summary
Yesterday, Day 34: Data Odyssey stacked models—MAE ₹3.4, ₹640.5. Today, Day 35: Data Odyssey handled missing data—10-11 AM imputed, MAE ₹3.8, ₹540 for 10 AM. It’s her clean step.
What’s Next
Tomorrow, in Day 36: Data Odyssey – What is Natural Language Processing?, we’ll explore: Can Priya analyze customer reviews? “Great samosas!”—sentiment? We’ll dive into NLP, adding insights. Bring your curiosity, and I’ll see you there!










