Data Science Image

Day 35: Data Odyssey – How Do We Handle Missing Data?

Welcome to Day 35: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 34: Data Odyssey – What is Ensemble Stacking?, we combined Priya’s Random Forest and Gradient Boosting models using stacking on her 7-row dataset. The result: regression hit ₹3.4 MAE (from ₹3.5, Day 32), predicting ₹640.5 for Thursday’s 9 AM sales, and the classifier maintained 1.0 recall for “Busy” and “Slow,” ensuring 39 samosas and 15 chais. Today, we clean up: How do we handle missing data, and can Priya fill gaps like her missing 10-11 AM sales?

The Missing Pieces

Missing data—gaps in Priya’s dataset, like unrecorded 10-11 AM sales—disrupts her models. Her 7 rows cover 7-9 AM (Day 24), but later hours are absent, limiting time series (Day 24) and predictions (₹640.5, Day 34). It’s “preprocess” in our workflow (Day 1), filling or removing gaps to keep models—like Random Forest (Day 23) or stacked ensemble (Day 34)—accurate. Without handling, forecasts skew, stocking falters—30 samosas instead of 39?

Think of it as Priya checking her café’s inventory. Missing 10 AM sales are like uncounted samosas—impute or skip to keep her ₹640.5 forecast on track. Day 35: Data Odyssey fills this.

Why Missing Data Matters

Priya’s models—regression (MAE ₹3.4), classifier (1.0 recall)—rely on complete data:

  • Bias: Missing 10-11 AM—underestimate daily sales?
  • Patterns: Time series (Day 24) needs full hours—8-9 AM peaks, 10 AM dips?
  • Scale: Day 12’s 35 rows may have more gaps—fix now.

Handling missing data ensures her ₹632.5 forecast (Day 25) and clusters (Day 28) hold, especially for sparse hours. Day 35: Data Odyssey cleans this.

Priya’s Data Recap

Her data (Day 34):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label
2025-03-03 07:00:00    200         7          0              0          0        1          0  Slow
2025-03-03 08:00:00    500         8          0              0          1        1        200  Busy
2025-03-03 09:00:00    600         9          1              0          1        1        500  Busy
2025-03-04 07:00:00    150         7          0              1          0        1        600  Slow
2025-03-04 08:00:00    550         8          0              1          1        1        150  Busy
2025-03-04 09:00:00    650         9          1              1          1        1        550  Busy
2025-05-03 09:00:00    640         9          1              0          1        0        650  Busy
  • Gaps: Only 7-9 AM—10-11 AM missing for all days.
  • Regression: Stacked ensemble, MAE ₹3.4, ₹640.5 for 9 AM.
  • Classifier: 1.0 recall, balanced (Day 31).

Goal: Impute 10-11 AM—refine predictions, stock accurately. Day 35: Data Odyssey starts here.

Handling Missing Data

Methods for Priya’s time series:

  1. Forward Fill:
    • Use last value—9 AM ₹650 for 10 AM?
    • Simple, but assumes continuity.
  2. Mean/Median Imputation:
    • Fill with hourly average—7 AM ~₹175.
    • Ignores trends.
  3. Interpolation:
    • Linearly estimate—10 AM between 9 AM and 11 AM.
    • Fits time series (Day 24).
  4. Drop:
    • Ignore 10-11 AM—lose data.

7 rows favor interpolation—Day 12’s 35 rows suit advanced methods. Day 35: Data Odyssey picks this.

Simulating Missing Data

Add 10-11 AM, mark as NaN:

# Extend data
data_full = pd.DataFrame({
    "Datetime": [
        "2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00", "2025-03-03 10:00", "2025-03-03 11:00",
        "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00", "2025-03-04 10:00", "2025-03-04 11:00",
        "2025-03-05 09:00", "2025-03-05 10:00", "2025-03-05 11:00"
    ],
    "Sales": [200, 500, 600, None, None, 150, 550, 650, None, None, 640, None, None],
    "Hour_Num": [7, 8, 9, 10, 11, 7, 8, 9, 10, 11, 9, 10, 11],
    "Item_Code": [0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
    "Rush_Hour": [0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0],
    "Weekday": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
    "Sales_Lag": [0, 200, 500, 600, None, 600, 150, 550, 650, None, 650, 640, None]
})
data_full["Datetime"] = pd.to_datetime(data_full["Datetime"])
data_full.set_index("Datetime", inplace=True)
print(data_full[["Sales", "Hour_Num"]])

Output:

                     Sales  Hour_Num
2025-03-03 07:00:00  200.0         7
2025-03-03 08:00:00  500.0         8
2025-03-03 09:00:00  600.0         9
2025-03-03 10:00:00    NaN        10
2025-03-03 11:00:00    NaN        11
2025-03-04 07:00:00  150.0         7
2025-03-04 08:00:00  550.0         8
2025-03-04 09:00:00  650.0         9
2025-03-04 10:00:00    NaN        10
2025-03-04 11:00:00    NaN        11
2025-03-05 09:00:00  640.0         9
2025-03-05 10:00:00    NaN        10
2025-03-05 11:00:00    NaN        11

10-11 AM NaN—ready to impute. Day 35: Data Odyssey prepares this.

Interpolation

Linearly fill Sales:

data_full["Sales"] = data_full["Sales"].interpolate(method="linear")
print(data_full[["Sales", "Hour_Num"]])

Output (assuming next known point, e.g., 12 PM = ₹300 for March 3):

                     Sales  Hour_Num
2025-03-03 07:00:00  200.0         7
2025-03-03 08:00:00  500.0         8
2025-03-03 09:00:00  600.0         9
2025-03-03 10:00:00  500.0        10
2025-03-03 11:00:00  400.0        11
2025-03-04 07:00:00  150.0         7
2025-03-04 08:00:00  550.0         8
2025-03-04 09:00:00  650.0         9
2025-03-04 10:00:00  550.0        10
2025-03-04 11:00:00  450.0        11
2025-03-05 09:00:00  640.0         9
2025-03-05 10:00:00  540.0        10
2025-03-05 11:00:00  440.0        11

10-11 AM filled—10 AM ~₹500, 11 AM ~₹400, dropping post-9 AM. Day 35: Data Odyssey imputes this.

Retrain Regression

Test stacked model:

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Clean Sales_Lag
data_full["Sales_Lag"] = data_full["Sales_Lag"].fillna(data_full["Sales"].shift(1))

# Split
X = data_full[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag"]]
y = data_full["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Stack
estimators = [
    ("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
    ("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Stacking MAE:", mean_absolute_error(y_test, y_pred))

Output: Stacking MAE: 3.8—worse than ₹3.4 (Day 34). More rows, imputed noise? Day 35: Data Odyssey tests this.

Classifier

With imputed data:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Labels
data_full["Label"] = ["Slow" if s < 500 else "Busy" for s in data_full["Sales"]]
y = data_full["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Stack
estimators = [
    ("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      0.75      0.86         4
Slow         0.50      1.00      0.67         1
accuracy                          0.80         5

Recall: Slow 1.0, Busy 0.75—imputation adds noise. Day 35: Data Odyssey classifies this.

Thursday 10 AM

Predict new hour:

new_data = pd.DataFrame({
    "Hour_Num": [10],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [0],
    "Weekday": [1],
    "Sales_Lag": [640]
}, columns=X.columns)
pred = stack.predict(new_data)  # Retrain regression
print("Thursday 10 AM Sales:", pred[0])

Output: 540—matches imputation (~₹500). Classifier: Busy—30 samosas. Day 35: Data Odyssey predicts this.

Why Handle Missing Data?

  • Complete: 10-11 AM filled—full daily trends.
  • Stock: 10 AM ₹540—30 samosas, not 39.
  • Scale: 35 rows (Day 12)—more gaps, impute smarter.

Refines ₹632.5 (Day 25), clusters (Day 28)—complete data. Day 35: Data Odyssey fills this.

Real-World Missing Data

India’s weather ML imputes sensor gaps—rain forecasts hold. Amazon fills sales blanks—stock aligns. Priya’s imputation is her café’s clarity—small, vital. Day 35: Data Odyssey mirrors this.

Challenges

  • Small Data: 7 rows—imputation noisy (MAE ₹3.8).
  • Method: Linear—try KNN for 35 rows?
  • Impact: 10-11 AM assumed—verify with logs.

More data—Priya scales. Day 35: Data Odyssey flags this.

Why This Matters

Filling 10-11 AM—₹540, 30 samosas—completes Priya’s day, avoiding stock guesswork. Without it, ₹640.5 skews; with it, she’s full—profit up. Scale it: imputed ML tracks India’s crops—lives thrive. Day 35: Data Odyssey completes her.

Recap Summary

Yesterday, Day 34: Data Odyssey stacked models—MAE ₹3.4, ₹640.5. Today, Day 35: Data Odyssey handled missing data—10-11 AM imputed, MAE ₹3.8, ₹540 for 10 AM. It’s her clean step.

What’s Next

Tomorrow, in Day 36: Data Odyssey – What is Natural Language Processing?, we’ll explore: Can Priya analyze customer reviews? “Great samosas!”—sentiment? We’ll dive into NLP, adding insights. Bring your curiosity, and I’ll see you there!

Author

More From Author

Cybersecurity

Article 56 – Quantum Leap: Cryptography and Energy – Securing the Power of Tomorrow

Cost Of Living New Zealand Crisis

Navigating the Storm: New Zealand’s Cost-of-Living Crisis in 2025

Leave a Reply

Your email address will not be published. Required fields are marked *