Welcome to Day 39: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 38: Data Odyssey – What is Reinforcement Learning?, we applied Q-learning to Priya’s 13-row dataset, dynamically optimizing her 9 AM samosa stock to 32 (from 39) based on sales feedback. Her stacked ensemble improved to ₹3.5 MAE (from ₹3.6, Day 37), predicting ₹640 for Thursday’s 9 AM, maintaining 1.0 “Slow” recall. Today, we investigate: What is anomaly detection, and are Priya’s ₹150 sales at 7 AM outliers or fraud?

Spotting the Odd One Out

Anomaly detection identifies unusual patterns—like Priya’s ₹150 sales at 7 AM (Day 37) against typical ₹500-650 for “Busy” hours. Her models predict sales (₹640, Day 38) and classify “Busy”/“Slow” (Day 31), but outliers could signal errors, theft, or rare events (e.g., power outage). It’s “analyze” in our workflow (Day 1), ensuring her ₹632.5 forecast (Day 25) isn’t skewed—stock 15 chais, not 30, for 7 AM?

Think of it as Priya checking her café’s pulse. ₹150 at 7 AM stands out—data glitch or real issue? Anomaly detection flags it, guiding her 32-samosa plan (Day 38). Day 39: Data Odyssey spots this.

Why Anomaly Detection Matters

Priya’s models—regression (MAE ₹3.5), classifier (1.0 “Slow” recall)—perform well, but:

Errors: ₹150—recording mistake or theft?
Trends: 7 AM “Slow” (Day 37)—one-off or pattern?
Trust: Outliers skew forecasts—₹632.5 off?

Detecting anomalies refines her customer counts (Day 37), sentiment (Day 36), and RL stock (Day 38), scaling for Day 12’s 35 rows. Day 39: Data Odyssey investigates this.

Priya’s Data Recap

Her data with RL stock (Day 38):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label  Sentiment  Customer_Count  RL_Stock
2025-03-03 07:00:00  200.0         7          0              0          0        1      0.0  Slow    -0.4767             5.0        39
2025-03-03 08:00:00  500.0         8          0              0          1        1    200.0  Busy     0.0000            15.0        39
2025-03-03 09:00:00  600.0         9          1              0          1        1    500.0  Busy     0.6588            20.0        32
2025-03-03 10:00:00  500.0        10          1              0          0        1    600.0  Busy     0.4404            12.0        39
2025-03-03 11:00:00  400.0        11          1              0          0        1    500.0  Slow     0.0000             8.0        39
2025-03-04 07:00:00  150.0         7          0              1          0        1    600.0  Slow     0.2263             4.0        39
2025-03-04 08:00:00  550.0         8          0              1          1        1    150.0  Busy     0.5719            16.0        39
2025-03-04 09:00:00  650.0         9          1              1          1        1    550.0  Busy     0.5859            22.0        33
2025-03-04 10:00:00  550.0        10          1              1          0        1    650.0  Busy     0.0000            13.0        39
2025-03-04 11:00:00  450.0        11          1              1          0        1    550.0  Slow     0.0000             9.0        39
2025-03-05 09:00:00  640.0         9          1              0          1        0    650.0  Busy     0.6369            21.0        32
2025-03-05 10:00:00  540.0        10          1              0          0        0    640.0  Busy     0.0000            14.0        39
2025-03-05 11:00:00  440.0        11          1              0          0        0    540.0  Slow     0.0000            10.0        39

Models: Stacked ensemble, MAE ₹3.5, ₹640 for 9 AM.
Issue: ₹150 at 7 AM—outlier?

Goal: Detect anomalies—investigate ₹150, refine stocking. Day 39: Data Odyssey starts here.

Anomaly Detection Basics

Methods for Priya’s sales:

Statistical Rules:
- Flag values beyond mean ± 2*std (Z-score).
- Simple, assumes normality.
Isolation Forest:
- ML isolates anomalies—fewer splits for outliers.
- Fits small datasets.
Clustering:
- K-Means (Day 28)—points far from clusters.

13 rows suit Isolation Forest—Day 12’s 35 rows scale to clustering. Day 39: Data Odyssey picks this.

Isolation Forest

Detect outliers:

import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np

# Features
X = data_full[["Sales", "Hour_Num", "Customer_Count", "Sales_Lag", "Sentiment"]]

# Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
data_full["Anomaly"] = iso.fit_predict(X)
data_full["Anomaly"] = data_full["Anomaly"].map({1: "Normal", -1: "Anomaly"})
print(data_full[["Sales", "Hour_Num", "Customer_Count", "Label", "Anomaly"]])

Output:

                     Sales  Hour_Num  Customer_Count  Label   Anomaly
2025-03-03 07:00:00  200.0         7             5.0  Slow   Anomaly
2025-03-03 08:00:00  500.0         8            15.0  Busy    Normal
2025-03-03 09:00:00  600.0         9            20.0  Busy    Normal
2025-03-03 10:00:00  500.0        10            12.0  Busy    Normal
2025-03-03 11:00:00  400.0        11             8.0  Slow    Normal
2025-03-04 07:00:00  150.0         7             4.0  Slow   Anomaly
2025-03-04 08:00:00 550.0         8            16.0  Busy    Normal
2025-03-04 09:00:00  650.0         9            22.0  Busy    Normal
2025-03-04 10:00:00  550.0        10            13.0  Busy    Normal
2025-03-04 11:00:00  450.0        11             9.0  Slow    Normal
2025-03-05 09:00:00  640.0         9            21.0  Busy    Normal
2025-03-05 10:00:00  540.0        10            14.0  Busy    Normal
2025-03-05 11:00:00  440.0        11            10.0  Slow    Normal

₹150, ₹200 (7 AM)—anomalies! Low sales, low counts (4-5). Day 39: Data Odyssey flags this.

Investigate Anomalies

March 4, 7 AM (₹150): Weather_Rainy=1, Sentiment=0.2263 (“chai okay”), 4 customers. Rain or staff issue?
March 3, 7 AM (₹200): Sentiment=-0.4767 (“cold chai”), 5 customers. Service complaint?

Possible causes: Weather, service, or data error. Day 39: Data Odyssey probes this.

Retrain Without Anomalies

Drop anomalies:

data_clean = data_full[data_full["Anomaly"] == "Normal"]
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X = data_clean[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag", "Sentiment", "Customer_Count", "RL_Stock"]]
y = data_clean["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
    ("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
    ("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Clean MAE:", mean_absolute_error(y_test, y_pred))

Output: Clean MAE: 3.4—beats ₹3.5 (Day 38)! Cleaner data helps. Day 39: Data Odyssey refines this.

Classifier

Without anomalies:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

y = data_clean["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
    ("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      0.67      0.80         3
Slow         0.50      1.00      0.67         1
accuracy                          0.75         4

Similar to Day 38—fewer “Slow” rows hurt recall. Day 39: Data Odyssey tests this.

Thursday 9 AM

No anomaly:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640],
    "Sentiment": [0.6],
    "Customer_Count": [20],
    "RL_Stock": [32]
}, columns=X.columns)
pred = stack.predict(new_data)  # Retrain regression
print("Thursday 9 AM Sales:", pred[0])

Output: 641—“Busy,” 32 samosas. Stable post-cleaning. Day 39: Data Odyssey predicts this.

Why Anomaly Detection?

Trust: ₹150 flagged—check logs, staff.
Accuracy: Clean data—MAE ₹3.4, sharper.
Scale: 35 rows (Day 12)—more outliers?

Refines ₹632.5 (Day 25), RL (Day 38)—clean predictions. Day 39: Data Odyssey spots this.

Real-World Anomaly Detection

India’s banking ML flags fraud—transactions safe. Amazon detects sales errors—stock aligns. Priya’s detection is her café’s guard—small, critical. Day 39: Data Odyssey mirrors this.

Challenges

Small Data: 13 rows—overflag risk.
Causes: ₹150—weather or theft?
Balance: Dropping anomalies—lose “Slow” data.

More data—Priya scales. Day 39: Data Odyssey flags this.

Why This Matters

Flagging ₹150—clean MAE ₹3.4, 32 samosas—avoids skewed stock. Without it, ₹641 trusts errors; with it, she’s accurate—profit up. Scale it: anomaly detection saves India’s grids—lives hold. Day 39: Data Odyssey guards her.

Recap Summary

Yesterday, Day 38: Data Odyssey used RL—MAE ₹3.5, ₹640. Today, Day 39: Data Odyssey detected anomalies—MAE ₹3.4, ₹641, ₹150 flagged. It’s her guard step.

What’s Next

Tomorrow, in Day 40: Data Odyssey – How Do We Deploy Models?, we’ll launch: Can Priya run ₹641 predictions live? Serve stock daily? We’ll explore model deployment, scaling her café. Bring your curiosity, and I’ll see you there!

Author

Vinay Karanam

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W