Day 12: Data Odyssey – How Do We Handle Bigger Data?

Welcome to Day 12: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 11: Data Odyssey – What is Data Wrangling?, we explored data wrangling, transforming Priya’s scattered POS data into a unified form. We merged her Monday and Tuesday sales, pivoted to see 9 AM’s ₹1250 peak, and enriched it with weather (sunny vs. rainy), all using Pandas. It turned her two-day snapshot into a broader story. Today, we scale up: How do we handle bigger data—like a month of Priya’s sales—and keep it manageable?

The Challenge of Bigger Data

Priya’s two days—10 rows—worked fine in Pandas (Day 9) and Matplotlib (Day 10). But a month? 30 days × 5 hours = 150 rows. A year? 1825 rows. Add items, weather, customers—it balloons fast. Bigger data brings:

  • Volume – More rows, columns, files.
  • Speed – Loading, calculating, plotting slows.
  • Memory – Laptops choke on millions of rows.

Wrangling (Day 11) merged two days, but a month needs efficiency—smart loading, trimming, processing. Day 12: Data Odyssey equips Priya (and you) for this leap.

Priya’s Growing Data

Imagine Priya’s POS now tracks a week:

  • Monday-Tuesday: 10 rows (Day 11).
  • Wednesday-Friday: 15 more rows.
  • Saturday-Sunday: 10 rows (weekend hours).

Total: 35 rows in separate CSVs (monday.csv, tuesday.csv, etc.). She wants: “What’s my weekly trend?” “Peak days?” Day 11’s merge works, but loading seven files one-by-one clogs her laptop. Day 12: Data Odyssey scales smarter.

Loading Big Data Efficiently

Pandas’ pd.read_csv() loads one file—fine for two days. For many:

  1. Loop Files:
import pandas as pd
import glob

# List CSVs
files = glob.glob("*.csv")  # Grabs monday.csv, tuesday.csv, etc.
data_list = []

# Load each
for file in files:
    day_data = pd.read_csv(file)
    day_name = file.replace(".csv", "")
    day_data["Day"] = day_name.capitalize()
    data_list.append(day_data)

# Combine
data = pd.concat(data_list, ignore_index=True)
print(data.shape)  # Rows, columns

For 7 days, 5 hours each: (35, 4)—35 rows, 4 columns (Hour, Sales, Item, Day). Priya runs this—her week’s in one DataFrame! Day 12: Data Odyssey automates this.

  1. Selective Columns: If her POS adds fluff (e.g., “Customer ID”), load only what’s needed:
day_data = pd.read_csv(file, usecols=["Hour", "Sales", "Item"])

Saves memory—Priya skips junk. Day 12: Data Odyssey trims fat.

Chunking Big Files

One giant CSV—say, month.csv with 150 rows? Load in chunks:

chunks = pd.read_csv("month.csv", chunksize=50)
data = pd.concat(chunks, ignore_index=True)
print(data.shape)  # (150, 4)

Processes 50 rows at a time—her laptop breathes. Day 12: Data Odyssey chunks it.

Summarizing to Save Space

150 rows slow stats and plots? Aggregate first:

weekly = data.groupby("Day")["Sales"].sum().reset_index()
print(weekly)

Output (imagined):

       Day  Sales
0   Friday   2200
1   Monday   2000
2 Saturday   2500
3   Sunday   1800
4  Thursday   2100
5  Tuesday   1950
6 Wednesday  2050

7 rows, not 35—faster to plot or analyze. Priya sees Saturday’s ₹2500 lead. Day 12: Data Odyssey shrinks smartly.

Plotting Bigger Data

Day 10’s bar chart for 5 hours scales poorly for 35. Summarize, then plot:

import matplotlib.pyplot as plt

plt.bar(weekly["Day"], weekly["Sales"], color="teal")
plt.title("Weekly Sales by Day")
plt.xlabel("Day")
plt.ylabel("Sales (₹)")
plt.xticks(rotation=45)
plt.show()

Bars show Saturday’s peak—readable, quick. For hourly trends:

hourly = data.groupby("Hour")["Sales"].mean().reset_index()
plt.plot(hourly["Hour"], hourly["Sales"], marker="o")
plt.title("Average Sales by Hour (Week)")
plt.xlabel("Hour")
plt.ylabel("Avg Sales (₹)")
plt.show()

Line peaks at 9 AM—her rush holds. Day 12: Data Odyssey visualizes big.

Filtering for Focus

Too much noise? Filter:

  • Rush Hours: rush = data[data[“Sales”] > 500].
    • Output: 8-9 AM rows across days.
  • Chai Only: chai = data[data[“Item”] == “Chai”].
    • Smaller table—chai’s story.

Priya filters rush hours—focuses stock there. Day 12: Data Odyssey narrows it.

Memory Tricks

Big data hogs RAM:

  • Dtypes: data[“Sales”] = data[“Sales”].astype(“int32”)—smaller numbers, less space.
  • Drop Columns: data.drop(columns=[“Item”])—if unneeded now.
  • Sample: sample = data.sample(frac=0.1)—10% for quick tests.

Priya’s 150 rows fit fine, but a year’s 1825? These save her laptop. Day 12: Data Odyssey optimizes.

Priya’s Month Test

A month’s CSV (150 rows, imagined):

data = pd.read_csv("month.csv", usecols=["Day", "Hour", "Sales", "Item"])
daily_totals = data.groupby("Day")["Sales"].sum().reset_index()
plt.figure(figsize=(10, 6))
plt.bar(daily_totals["Day"], daily_totals["Sales"], color="teal")
plt.title("Monthly Sales by Day")
plt.xlabel("Day")
plt.ylabel("Sales (₹)")
plt.xticks(rotation=90)
plt.show()

30 bars—Saturday’s often tallest. Hourly:

hourly_avg = data.groupby("Hour")["Sales"].mean().reset_index()
plt.plot(hourly_avg["Hour"], hourly_avg["Sales"], marker="o")
plt.title("Monthly Avg Sales by Hour")
plt.xlabel("Hour")
plt.ylabel("Avg Sales (₹)")
plt.show()

9 AM reigns—her pattern scales. Day 12: Data Odyssey handles it.

Real-World Scale

India’s traffic data—millions of rows—chunks into Pandas, filters rush hours, plots jams. Amazon’s sales—billions—group by day, visualize peaks. Priya’s month is small, but the tricks are pro. Day 12: Data Odyssey bridges her.

Challenges

Bigger data bites:

  • Slowdown: 1825 rows lag—chunk or summarize.
  • Errors: File missing? Check paths.
  • Memory: Crash at 10,000 rows—use dtypes, sample.

Priya forgets usecols—lags, fixes it. Day 12: Data Odyssey learns with her.

Why This Matters

Handling bigger data lets Priya see a month—Saturday’s ₹2500, 9 AM’s avg peak—planning stock and staff smarter. Without it, she’s stuck at two days; with it, she grows. Scale it: India’s census chunks billions—policy shifts. Day 12: Data Odyssey scales you up.

Recap Summary

Yesterday, Day 11: Data Odyssey wrangled Priya’s data—merged days, pivoted 9 AM’s ₹1250, added weather—with Pandas. Today, Day 12: Data Odyssey scaled to bigger data—looping files, chunking, summarizing her week (Saturday ₹2500)—keeping it efficient. It’s her growth step.

What’s Next

Tomorrow, in Day 13: Data Odyssey – What is Data Preprocessing?, we’ll explore preprocessing: How do we prep Priya’s data for modeling? Standardize sales, encode items? We’ll tweak her table for machine learning’s next leap. Bring your curiosity, and I’ll see you there!

Author

More From Author

Health Nz

Health NZ’s Spending Spiral: A $28B Excel Fiasco Unraveled

indias green energy initiatives

Article 36: Bharat Is Not for Beginners – The Green Tapestry: Bharat’s Botanical Heritage and Living Forests

Leave a Reply

Your email address will not be published. Required fields are marked *