Data Science Image

Day 39: Data Odyssey – What is Anomaly Detection?

Welcome to Day 39: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 38: Data Odyssey – What is Reinforcement Learning?, we applied Q-learning to Priya’s 13-row dataset, dynamically optimizing her 9 AM samosa stock to 32 (from 39) based on sales feedback. Her stacked ensemble improved to ₹3.5 MAE (from ₹3.6, Day 37), predicting ₹640 for Thursday’s 9 AM, maintaining 1.0 “Slow” recall. Today, we investigate: What is anomaly detection, and are Priya’s ₹150 sales at 7 AM outliers or fraud?

Spotting the Odd One Out

Anomaly detection identifies unusual patterns—like Priya’s ₹150 sales at 7 AM (Day 37) against typical ₹500-650 for “Busy” hours. Her models predict sales (₹640, Day 38) and classify “Busy”/“Slow” (Day 31), but outliers could signal errors, theft, or rare events (e.g., power outage). It’s “analyze” in our workflow (Day 1), ensuring her ₹632.5 forecast (Day 25) isn’t skewed—stock 15 chais, not 30, for 7 AM?

Think of it as Priya checking her café’s pulse. ₹150 at 7 AM stands out—data glitch or real issue? Anomaly detection flags it, guiding her 32-samosa plan (Day 38). Day 39: Data Odyssey spots this.

Why Anomaly Detection Matters

Priya’s models—regression (MAE ₹3.5), classifier (1.0 “Slow” recall)—perform well, but:

  • Errors: ₹150—recording mistake or theft?
  • Trends: 7 AM “Slow” (Day 37)—one-off or pattern?
  • Trust: Outliers skew forecasts—₹632.5 off?

Detecting anomalies refines her customer counts (Day 37), sentiment (Day 36), and RL stock (Day 38), scaling for Day 12’s 35 rows. Day 39: Data Odyssey investigates this.

Priya’s Data Recap

Her data with RL stock (Day 38):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label  Sentiment  Customer_Count  RL_Stock
2025-03-03 07:00:00  200.0         7          0              0          0        1      0.0  Slow    -0.4767             5.0        39
2025-03-03 08:00:00  500.0         8          0              0          1        1    200.0  Busy     0.0000            15.0        39
2025-03-03 09:00:00  600.0         9          1              0          1        1    500.0  Busy     0.6588            20.0        32
2025-03-03 10:00:00  500.0        10          1              0          0        1    600.0  Busy     0.4404            12.0        39
2025-03-03 11:00:00  400.0        11          1              0          0        1    500.0  Slow     0.0000             8.0        39
2025-03-04 07:00:00  150.0         7          0              1          0        1    600.0  Slow     0.2263             4.0        39
2025-03-04 08:00:00  550.0         8          0              1          1        1    150.0  Busy     0.5719            16.0        39
2025-03-04 09:00:00  650.0         9          1              1          1        1    550.0  Busy     0.5859            22.0        33
2025-03-04 10:00:00  550.0        10          1              1          0        1    650.0  Busy     0.0000            13.0        39
2025-03-04 11:00:00  450.0        11          1              1          0        1    550.0  Slow     0.0000             9.0        39
2025-03-05 09:00:00  640.0         9          1              0          1        0    650.0  Busy     0.6369            21.0        32
2025-03-05 10:00:00  540.0        10          1              0          0        0    640.0  Busy     0.0000            14.0        39
2025-03-05 11:00:00  440.0        11          1              0          0        0    540.0  Slow     0.0000            10.0        39
  • Models: Stacked ensemble, MAE ₹3.5, ₹640 for 9 AM.
  • Issue: ₹150 at 7 AM—outlier?

Goal: Detect anomalies—investigate ₹150, refine stocking. Day 39: Data Odyssey starts here.

Anomaly Detection Basics

Methods for Priya’s sales:

  1. Statistical Rules:
    • Flag values beyond mean ± 2*std (Z-score).
    • Simple, assumes normality.
  2. Isolation Forest:
    • ML isolates anomalies—fewer splits for outliers.
    • Fits small datasets.
  3. Clustering:
    • K-Means (Day 28)—points far from clusters.

13 rows suit Isolation Forest—Day 12’s 35 rows scale to clustering. Day 39: Data Odyssey picks this.

Isolation Forest

Detect outliers:

import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np

# Features
X = data_full[["Sales", "Hour_Num", "Customer_Count", "Sales_Lag", "Sentiment"]]

# Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
data_full["Anomaly"] = iso.fit_predict(X)
data_full["Anomaly"] = data_full["Anomaly"].map({1: "Normal", -1: "Anomaly"})
print(data_full[["Sales", "Hour_Num", "Customer_Count", "Label", "Anomaly"]])

Output:

                     Sales  Hour_Num  Customer_Count  Label   Anomaly
2025-03-03 07:00:00  200.0         7             5.0  Slow   Anomaly
2025-03-03 08:00:00  500.0         8            15.0  Busy    Normal
2025-03-03 09:00:00  600.0         9            20.0  Busy    Normal
2025-03-03 10:00:00  500.0        10            12.0  Busy    Normal
2025-03-03 11:00:00  400.0        11             8.0  Slow    Normal
2025-03-04 07:00:00  150.0         7             4.0  Slow   Anomaly
2025-03-04 08:00:00 550.0         8            16.0  Busy    Normal
2025-03-04 09:00:00  650.0         9            22.0  Busy    Normal
2025-03-04 10:00:00  550.0        10            13.0  Busy    Normal
2025-03-04 11:00:00  450.0        11             9.0  Slow    Normal
2025-03-05 09:00:00  640.0         9            21.0  Busy    Normal
2025-03-05 10:00:00  540.0        10            14.0  Busy    Normal
2025-03-05 11:00:00  440.0        11            10.0  Slow    Normal

₹150, ₹200 (7 AM)—anomalies! Low sales, low counts (4-5). Day 39: Data Odyssey flags this.

Investigate Anomalies

  • March 4, 7 AM (₹150): Weather_Rainy=1, Sentiment=0.2263 (“chai okay”), 4 customers. Rain or staff issue?
  • March 3, 7 AM (₹200): Sentiment=-0.4767 (“cold chai”), 5 customers. Service complaint?

Possible causes: Weather, service, or data error. Day 39: Data Odyssey probes this.

Retrain Without Anomalies

Drop anomalies:

data_clean = data_full[data_full["Anomaly"] == "Normal"]
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X = data_clean[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag", "Sentiment", "Customer_Count", "RL_Stock"]]
y = data_clean["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
    ("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
    ("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Clean MAE:", mean_absolute_error(y_test, y_pred))

Output: Clean MAE: 3.4—beats ₹3.5 (Day 38)! Cleaner data helps. Day 39: Data Odyssey refines this.

Classifier

Without anomalies:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

y = data_clean["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
    ("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      0.67      0.80         3
Slow         0.50      1.00      0.67         1
accuracy                          0.75         4

Similar to Day 38—fewer “Slow” rows hurt recall. Day 39: Data Odyssey tests this.

Thursday 9 AM

No anomaly:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640],
    "Sentiment": [0.6],
    "Customer_Count": [20],
    "RL_Stock": [32]
}, columns=X.columns)
pred = stack.predict(new_data)  # Retrain regression
print("Thursday 9 AM Sales:", pred[0])

Output: 641—“Busy,” 32 samosas. Stable post-cleaning. Day 39: Data Odyssey predicts this.

Why Anomaly Detection?

  • Trust: ₹150 flagged—check logs, staff.
  • Accuracy: Clean data—MAE ₹3.4, sharper.
  • Scale: 35 rows (Day 12)—more outliers?

Refines ₹632.5 (Day 25), RL (Day 38)—clean predictions. Day 39: Data Odyssey spots this.

Real-World Anomaly Detection

India’s banking ML flags fraud—transactions safe. Amazon detects sales errors—stock aligns. Priya’s detection is her café’s guard—small, critical. Day 39: Data Odyssey mirrors this.

Challenges

  • Small Data: 13 rows—overflag risk.
  • Causes: ₹150—weather or theft?
  • Balance: Dropping anomalies—lose “Slow” data.

More data—Priya scales. Day 39: Data Odyssey flags this.

Why This Matters

Flagging ₹150—clean MAE ₹3.4, 32 samosas—avoids skewed stock. Without it, ₹641 trusts errors; with it, she’s accurate—profit up. Scale it: anomaly detection saves India’s grids—lives hold. Day 39: Data Odyssey guards her.

Recap Summary

Yesterday, Day 38: Data Odyssey used RL—MAE ₹3.5, ₹640. Today, Day 39: Data Odyssey detected anomalies—MAE ₹3.4, ₹641, ₹150 flagged. It’s her guard step.

What’s Next

Tomorrow, in Day 40: Data Odyssey – How Do We Deploy Models?, we’ll launch: Can Priya run ₹641 predictions live? Serve stock daily? We’ll explore model deployment, scaling her café. Bring your curiosity, and I’ll see you there!

Author

More From Author

Pay Equity Bill

New Zealand Halts Pay Equity Claims: A Setback for Women’s Rights Sparks Outrage

Bharat Handicraft

Article 60: Bharat Is Not for Beginners – The Sacred Craft Returns Again: Bharat’s Handicraft Traditions and Living Art

Leave a Reply

Your email address will not be published. Required fields are marked *