Data Science

Day 26: Data Odyssey – What is Anomaly Detection?

Welcome to Day 26: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 25: Data Odyssey – How Do We Forecast with Time Series?, we forecasted Priya’s Thursday 9 AM sales using her 7-row time series. Moving average gave ₹630, exponential smoothing ₹632.5, and trend extension ₹650—settling near ₹632.5 to stock 41 samosas, aligning with her Random Forest’s ₹642 (Day 23). Today, we shift focus: What is anomaly detection, and why did Tuesday’s 7 AM sales drop to ₹150?

Spotting the Unusual

Anomaly detection finds oddities in data—points that don’t fit patterns. Day 24’s time series showed Priya’s 9 AM peak (₹630 avg) and 7 AM low (₹175 avg), but Tuesday’s ₹150 at 7 AM stands out against Monday’s ₹200. It’s “analyze” in our workflow (Day 1), flagging outliers that skew forecasts (Day 25) or signal issues—rain, a late start?

Think of it as Priya checking her café’s pulse. Most hours hum—₹500-650 at 9 AM—but ₹150 jars. Is it noise or a clue? Day 26: Data Odyssey hunts this.

Why Anomaly Detection Matters

Priya’s forecasts (₹632.5) and models (₹642, MAE ₹4) assume consistency. Anomalies like ₹150:

  • Skew: Pull averages down—7 AM ₹175 vs. ₹200 expected.
  • Signal: Rainy Tuesday (Day 21’s feature)—fewer chai sales?
  • Fix: Clean data or adjust stock—15 chais, not 20.

Her 7 rows hide few oddities—Day 12’s 35 rows reveal more. Day 26: Data Odyssey spots this.

Priya’s Time Series Recap

Her data (Day 24):

                     Sales
2025-03-03 07:00:00    200
2025-03-03 08:00:00    500
2025-03-03 09:00:00    600
2025-03-04 07:00:00    150
2025-03-04 08:00:00    550
2025-03-04 09:00:00    650
2025-05-03 09:00:00    640
  • Pattern: 7 AM low (₹150-200), 8-9 AM high (₹500-650).
  • Oddity: Tuesday 7 AM ₹150—below Monday’s ₹200, hourly avg ₹175.

Goal: Flag ₹150—why? Day 26: Data Odyssey starts here.

Anomaly Detection Methods

Simple tricks for 7 rows:

  1. Threshold:
    • Mean ± standard deviation—outside is odd.
  2. Rolling Statistics:
    • Compare to moving average—big deviations flag.
  3. Isolation Forest:
    • ML isolates outliers—scalable later.

Her sparse data suits basics—35 rows (Day 12) unlock ML. Day 26: Data Odyssey tries these.

Threshold Method

Mean and std for all sales:

import pandas as pd

data = pd.DataFrame({
    "Datetime": ["2025-03-03 07:00", "2025-03-03 08:00", "2025-03-03 09:00",
                 "2025-03-04 07:00", "2025-03-04 08:00", "2025-03-04 09:00",
                 "2025-03-05 09:00"],
    "Sales": [200, 500, 600, 150, 550, 650, 640]
})
data["Datetime"] = pd.to_datetime(data["Datetime"])
data.set_index("Datetime", inplace=True)

mean = data["Sales"].mean()  # ~470
std = data["Sales"].std()    # ~208
lower = mean - 2 * std       # ~54
upper = mean + 2 * std       # ~886
data["Anomaly"] = (data["Sales"] < lower) | (data["Sales"] > upper)
print(data[["Sales", "Anomaly"]])

Output:

                     Sales  Anomaly
2025-03-03 07:00:00    200    False
2025-03-03 08:00:00    500    False
2025-03-03 09:00:00    600    False
2025-03-04 07:00:00    150    False
2025-03-04 08:00:00    550    False
2025-03-04 09:00:00    650    False
2025-03-05 09:00:00    640    False

₹150 within 54-886—no flag. Too broad—mixes 7-9 AM. Day 26: Data Odyssey adjusts this.

Hourly Threshold

Group by hour:

hourly = data.groupby(data.index.hour)["Sales"].agg(["mean", "std"])
hourly["Lower"] = hourly["mean"] - 2 * hourly["std"]
hourly["Upper"] = hourly["mean"] + 2 * hourly["std"]
print(hourly)

data["Hour"] = data.index.hour
data = data.merge(hourly[["Lower", "Upper"]], left_on="Hour", right_index=True)
data["Anomaly"] = (data["Sales"] < data["Lower"]) | (data["Sales"] > data["Upper"])
print(data[["Sales", "Anomaly"]])

Output:

      mean         std  Lower  Upper
Hour                                
7    175.0   35.355339  104.3  245.7
8    525.0   35.355339  454.3  595.7
9    630.0    7.071068  615.9  644.1

                     Sales  Anomaly
2025-03-03 07:00:00    200    False
2025-03-03 08:00:00    500    False
2025-03-03 09:00:00    600     True
2025-03-04 07:00:00    150    False
2025-03-04 08:00:00    550    False
2025-03-04 09:00:00    650     True
2025-03-05 09:00:00    640    False
  • 7 AM: ₹150 vs. 104.3-245.7—okay.
  • 9 AM: ₹600, ₹650 outside 615.9-644.1—odd!

₹150 isn’t low enough—9 AM flags high. Day 26: Data Odyssey narrows this.

Rolling Statistics

Moving average deviation:

rolling_mean = data["Sales"].rolling(window=3, min_periods=1).mean()
rolling_std = data["Sales"].rolling(window=3, min_periods=1).std()
data["Lower"] = rolling_mean - 2 * rolling_std
data["Upper"] = rolling_mean + 2 * rolling_std
data["Anomaly"] = (data["Sales"] < data["Lower"]) | (data["Sales"] > data["Upper"])
print(data[["Sales", "Anomaly"]])

Output:

                     Sales  Anomaly
2025-03-03 07:00:00    200    False
2025-03-03 08:00:00    500     True
2025-03-03 09:00:00    600    False
2025-03-04 07:00:00    150     True
2025-03-04 08:00:00    550    False
2025-03-04 09:00:00    650    False
2025-03-05 09:00:00    640    False
  • ₹500: Jumps from ₹200—odd.
  • ₹150: Drops from ₹600—flagged!

₹150 caught—Tuesday’s dip! Day 26: Data Odyssey rolls this.

Plotting Anomalies

Visualize:

import matplotlib.pyplot as plt
plt.plot(data.index, data["Sales"], marker="o", color="teal", label="Sales")
plt.fill_between(data.index, data["Lower"], data["Upper"], color="gray", alpha=0.2, label="Normal Range")
plt.scatter(data[data["Anomaly"]].index, data[data["Anomaly"]]["Sales"], color="red", label="Anomaly")
plt.title("Priya’s Sales with Anomalies")
plt.xlabel("Date and Hour")
plt.ylabel("Sales (₹)")
plt.legend()
plt.show()

Red dots: ₹500 (jump), ₹150 (drop)—clear oddities! Day 26: Data Odyssey sees this.

Why ₹150?

Check features (Day 21):

  • Tuesday 7 AM: Weather_Rainy = 1.
  • Rain slows chai—₹150 vs. ₹200 sunny.

Anomaly signals rain—adjust stock! Day 26: Data Odyssey explains this.

Why Detect?

  • Clean: ₹150 skews ₹632.5—remove or adjust.
  • Insight: Rain dips 7 AM—15 chais, not 20.
  • Scale: 35 rows (Day 12)—catch ₹5000 typos.

Priya’s ₹150—fix forecasts. Day 26: Data Odyssey flags this.

Real-World Anomalies

India’s power grid spots usage spikes—fixes outages. Amazon catches sales drops—restocks fast. Priya’s ₹150 is her café’s alert—small, critical. Day 26: Data Odyssey ties this.

Challenges

  • Sparse: 7 rows—false flags (₹500).
  • Threshold: 2*std—tweak to 1.5?
  • Context: Rain explains—features key.

More data (35 rows) refines—Priya grows. Day 26: Data Odyssey notes this.

Why This Matters

Detecting ₹150—rainy dip—means 15 chais, not 20 wasted, refining ₹632.5. Without it, forecasts drift; with it, she adapts—profit up. Scale it: anomaly detection saves India’s grids—lives hold. Day 26: Data Odyssey guards her.

Recap Summary

Yesterday, Day 25: Data Odyssey forecasted Priya’s 9 AM—₹632.5 via ES. Today, Day 26: Data Odyssey detected anomalies—₹150 at 7 AM flagged, rain explained. It’s her alert step.

What’s Next

Tomorrow, in Day 27: Data Odyssey – What is Clustering?, we’ll group: How do Priya’s hours cluster? 7 AM vs. 9 AM? We’ll use her data to find patterns, no labels needed. Bring your curiosity, and I’ll see you there!

Author

More From Author

Top 5 Cybersecurity Threats And Vulnerabilities

Article 48 – Quantum Leap: Cryptography and Media – Securing the Stories of Tomorrow

Rana

Tahawwur Rana Extradition: A Milestone in India-US Counter-Terrorism Ties

Leave a Reply

Your email address will not be published. Required fields are marked *