Welcome to Day 39: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 38: Data Odyssey – What is Reinforcement Learning?, we applied Q-learning to Priya’s 13-row dataset, dynamically optimizing her 9 AM samosa stock to 32 (from 39) based on sales feedback. Her stacked ensemble improved to ₹3.5 MAE (from ₹3.6, Day 37), predicting ₹640 for Thursday’s 9 AM, maintaining 1.0 “Slow” recall. Today, we investigate: What is anomaly detection, and are Priya’s ₹150 sales at 7 AM outliers or fraud?
Spotting the Odd One Out
Anomaly detection identifies unusual patterns—like Priya’s ₹150 sales at 7 AM (Day 37) against typical ₹500-650 for “Busy” hours. Her models predict sales (₹640, Day 38) and classify “Busy”/“Slow” (Day 31), but outliers could signal errors, theft, or rare events (e.g., power outage). It’s “analyze” in our workflow (Day 1), ensuring her ₹632.5 forecast (Day 25) isn’t skewed—stock 15 chais, not 30, for 7 AM?
Think of it as Priya checking her café’s pulse. ₹150 at 7 AM stands out—data glitch or real issue? Anomaly detection flags it, guiding her 32-samosa plan (Day 38). Day 39: Data Odyssey spots this.
Why Anomaly Detection Matters
Priya’s models—regression (MAE ₹3.5), classifier (1.0 “Slow” recall)—perform well, but:
- Errors: ₹150—recording mistake or theft?
- Trends: 7 AM “Slow” (Day 37)—one-off or pattern?
- Trust: Outliers skew forecasts—₹632.5 off?
Detecting anomalies refines her customer counts (Day 37), sentiment (Day 36), and RL stock (Day 38), scaling for Day 12’s 35 rows. Day 39: Data Odyssey investigates this.
Priya’s Data Recap
Her data with RL stock (Day 38):
Sales Hour_Num Item_Code Weather_Rainy Rush_Hour Weekday Sales_Lag Label Sentiment Customer_Count RL_Stock
2025-03-03 07:00:00 200.0 7 0 0 0 1 0.0 Slow -0.4767 5.0 39
2025-03-03 08:00:00 500.0 8 0 0 1 1 200.0 Busy 0.0000 15.0 39
2025-03-03 09:00:00 600.0 9 1 0 1 1 500.0 Busy 0.6588 20.0 32
2025-03-03 10:00:00 500.0 10 1 0 0 1 600.0 Busy 0.4404 12.0 39
2025-03-03 11:00:00 400.0 11 1 0 0 1 500.0 Slow 0.0000 8.0 39
2025-03-04 07:00:00 150.0 7 0 1 0 1 600.0 Slow 0.2263 4.0 39
2025-03-04 08:00:00 550.0 8 0 1 1 1 150.0 Busy 0.5719 16.0 39
2025-03-04 09:00:00 650.0 9 1 1 1 1 550.0 Busy 0.5859 22.0 33
2025-03-04 10:00:00 550.0 10 1 1 0 1 650.0 Busy 0.0000 13.0 39
2025-03-04 11:00:00 450.0 11 1 1 0 1 550.0 Slow 0.0000 9.0 39
2025-03-05 09:00:00 640.0 9 1 0 1 0 650.0 Busy 0.6369 21.0 32
2025-03-05 10:00:00 540.0 10 1 0 0 0 640.0 Busy 0.0000 14.0 39
2025-03-05 11:00:00 440.0 11 1 0 0 0 540.0 Slow 0.0000 10.0 39
- Models: Stacked ensemble, MAE ₹3.5, ₹640 for 9 AM.
- Issue: ₹150 at 7 AM—outlier?
Goal: Detect anomalies—investigate ₹150, refine stocking. Day 39: Data Odyssey starts here.
Anomaly Detection Basics
Methods for Priya’s sales:
- Statistical Rules:
- Flag values beyond mean ± 2*std (Z-score).
- Simple, assumes normality.
- Isolation Forest:
- ML isolates anomalies—fewer splits for outliers.
- Fits small datasets.
- Clustering:
- K-Means (Day 28)—points far from clusters.
13 rows suit Isolation Forest—Day 12’s 35 rows scale to clustering. Day 39: Data Odyssey picks this.
Isolation Forest
Detect outliers:
import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np
# Features
X = data_full[["Sales", "Hour_Num", "Customer_Count", "Sales_Lag", "Sentiment"]]
# Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
data_full["Anomaly"] = iso.fit_predict(X)
data_full["Anomaly"] = data_full["Anomaly"].map({1: "Normal", -1: "Anomaly"})
print(data_full[["Sales", "Hour_Num", "Customer_Count", "Label", "Anomaly"]])
Output:
Sales Hour_Num Customer_Count Label Anomaly
2025-03-03 07:00:00 200.0 7 5.0 Slow Anomaly
2025-03-03 08:00:00 500.0 8 15.0 Busy Normal
2025-03-03 09:00:00 600.0 9 20.0 Busy Normal
2025-03-03 10:00:00 500.0 10 12.0 Busy Normal
2025-03-03 11:00:00 400.0 11 8.0 Slow Normal
2025-03-04 07:00:00 150.0 7 4.0 Slow Anomaly
2025-03-04 08:00:00 550.0 8 16.0 Busy Normal
2025-03-04 09:00:00 650.0 9 22.0 Busy Normal
2025-03-04 10:00:00 550.0 10 13.0 Busy Normal
2025-03-04 11:00:00 450.0 11 9.0 Slow Normal
2025-03-05 09:00:00 640.0 9 21.0 Busy Normal
2025-03-05 10:00:00 540.0 10 14.0 Busy Normal
2025-03-05 11:00:00 440.0 11 10.0 Slow Normal
₹150, ₹200 (7 AM)—anomalies! Low sales, low counts (4-5). Day 39: Data Odyssey flags this.
Investigate Anomalies
- March 4, 7 AM (₹150): Weather_Rainy=1, Sentiment=0.2263 (“chai okay”), 4 customers. Rain or staff issue?
- March 3, 7 AM (₹200): Sentiment=-0.4767 (“cold chai”), 5 customers. Service complaint?
Possible causes: Weather, service, or data error. Day 39: Data Odyssey probes this.
Retrain Without Anomalies
Drop anomalies:
data_clean = data_full[data_full["Anomaly"] == "Normal"]
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
X = data_clean[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag", "Sentiment", "Customer_Count", "RL_Stock"]]
y = data_clean["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Clean MAE:", mean_absolute_error(y_test, y_pred))
Output: Clean MAE: 3.4—beats ₹3.5 (Day 38)! Cleaner data helps. Day 39: Data Odyssey refines this.
Classifier
Without anomalies:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
y = data_clean["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
Busy 1.00 0.67 0.80 3
Slow 0.50 1.00 0.67 1
accuracy 0.75 4
Similar to Day 38—fewer “Slow” rows hurt recall. Day 39: Data Odyssey tests this.
Thursday 9 AM
No anomaly:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [1],
"Sales_Lag": [640],
"Sentiment": [0.6],
"Customer_Count": [20],
"RL_Stock": [32]
}, columns=X.columns)
pred = stack.predict(new_data) # Retrain regression
print("Thursday 9 AM Sales:", pred[0])
Output: 641—“Busy,” 32 samosas. Stable post-cleaning. Day 39: Data Odyssey predicts this.
Why Anomaly Detection?
- Trust: ₹150 flagged—check logs, staff.
- Accuracy: Clean data—MAE ₹3.4, sharper.
- Scale: 35 rows (Day 12)—more outliers?
Refines ₹632.5 (Day 25), RL (Day 38)—clean predictions. Day 39: Data Odyssey spots this.
Real-World Anomaly Detection
India’s banking ML flags fraud—transactions safe. Amazon detects sales errors—stock aligns. Priya’s detection is her café’s guard—small, critical. Day 39: Data Odyssey mirrors this.
Challenges
- Small Data: 13 rows—overflag risk.
- Causes: ₹150—weather or theft?
- Balance: Dropping anomalies—lose “Slow” data.
More data—Priya scales. Day 39: Data Odyssey flags this.
Why This Matters
Flagging ₹150—clean MAE ₹3.4, 32 samosas—avoids skewed stock. Without it, ₹641 trusts errors; with it, she’s accurate—profit up. Scale it: anomaly detection saves India’s grids—lives hold. Day 39: Data Odyssey guards her.
Recap Summary
Yesterday, Day 38: Data Odyssey used RL—MAE ₹3.5, ₹640. Today, Day 39: Data Odyssey detected anomalies—MAE ₹3.4, ₹641, ₹150 flagged. It’s her guard step.
What’s Next
Tomorrow, in Day 40: Data Odyssey – How Do We Deploy Models?, we’ll launch: Can Priya run ₹641 predictions live? Serve stock daily? We’ll explore model deployment, scaling her café. Bring your curiosity, and I’ll see you there!










