Welcome to Day 12: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 11: Data Odyssey – What is Data Wrangling?, we explored data wrangling, transforming Priya’s scattered POS data into a unified form. We merged her Monday and Tuesday sales, pivoted to see 9 AM’s ₹1250 peak, and enriched it with weather (sunny vs. rainy), all using Pandas. It turned her two-day snapshot into a broader story. Today, we scale up: How do we handle bigger data—like a month of Priya’s sales—and keep it manageable?
The Challenge of Bigger Data
Priya’s two days—10 rows—worked fine in Pandas (Day 9) and Matplotlib (Day 10). But a month? 30 days × 5 hours = 150 rows. A year? 1825 rows. Add items, weather, customers—it balloons fast. Bigger data brings:
- Volume – More rows, columns, files.
- Speed – Loading, calculating, plotting slows.
- Memory – Laptops choke on millions of rows.
Wrangling (Day 11) merged two days, but a month needs efficiency—smart loading, trimming, processing. Day 12: Data Odyssey equips Priya (and you) for this leap.
Priya’s Growing Data
Imagine Priya’s POS now tracks a week:
- Monday-Tuesday: 10 rows (Day 11).
- Wednesday-Friday: 15 more rows.
- Saturday-Sunday: 10 rows (weekend hours).
Total: 35 rows in separate CSVs (monday.csv, tuesday.csv, etc.). She wants: “What’s my weekly trend?” “Peak days?” Day 11’s merge works, but loading seven files one-by-one clogs her laptop. Day 12: Data Odyssey scales smarter.
Loading Big Data Efficiently
Pandas’ pd.read_csv() loads one file—fine for two days. For many:
- Loop Files:
import pandas as pd
import glob
# List CSVs
files = glob.glob("*.csv") # Grabs monday.csv, tuesday.csv, etc.
data_list = []
# Load each
for file in files:
day_data = pd.read_csv(file)
day_name = file.replace(".csv", "")
day_data["Day"] = day_name.capitalize()
data_list.append(day_data)
# Combine
data = pd.concat(data_list, ignore_index=True)
print(data.shape) # Rows, columns
For 7 days, 5 hours each: (35, 4)—35 rows, 4 columns (Hour, Sales, Item, Day). Priya runs this—her week’s in one DataFrame! Day 12: Data Odyssey automates this.
- Selective Columns: If her POS adds fluff (e.g., “Customer ID”), load only what’s needed:
day_data = pd.read_csv(file, usecols=["Hour", "Sales", "Item"])
Saves memory—Priya skips junk. Day 12: Data Odyssey trims fat.
Chunking Big Files
One giant CSV—say, month.csv with 150 rows? Load in chunks:
chunks = pd.read_csv("month.csv", chunksize=50)
data = pd.concat(chunks, ignore_index=True)
print(data.shape) # (150, 4)
Processes 50 rows at a time—her laptop breathes. Day 12: Data Odyssey chunks it.
Summarizing to Save Space
150 rows slow stats and plots? Aggregate first:
weekly = data.groupby("Day")["Sales"].sum().reset_index()
print(weekly)
Output (imagined):
Day Sales
0 Friday 2200
1 Monday 2000
2 Saturday 2500
3 Sunday 1800
4 Thursday 2100
5 Tuesday 1950
6 Wednesday 2050
7 rows, not 35—faster to plot or analyze. Priya sees Saturday’s ₹2500 lead. Day 12: Data Odyssey shrinks smartly.
Plotting Bigger Data
Day 10’s bar chart for 5 hours scales poorly for 35. Summarize, then plot:
import matplotlib.pyplot as plt
plt.bar(weekly["Day"], weekly["Sales"], color="teal")
plt.title("Weekly Sales by Day")
plt.xlabel("Day")
plt.ylabel("Sales (₹)")
plt.xticks(rotation=45)
plt.show()
Bars show Saturday’s peak—readable, quick. For hourly trends:
hourly = data.groupby("Hour")["Sales"].mean().reset_index()
plt.plot(hourly["Hour"], hourly["Sales"], marker="o")
plt.title("Average Sales by Hour (Week)")
plt.xlabel("Hour")
plt.ylabel("Avg Sales (₹)")
plt.show()
Line peaks at 9 AM—her rush holds. Day 12: Data Odyssey visualizes big.
Filtering for Focus
Too much noise? Filter:
- Rush Hours: rush = data[data[“Sales”] > 500].
- Output: 8-9 AM rows across days.
- Chai Only: chai = data[data[“Item”] == “Chai”].
- Smaller table—chai’s story.
Priya filters rush hours—focuses stock there. Day 12: Data Odyssey narrows it.
Memory Tricks
Big data hogs RAM:
- Dtypes: data[“Sales”] = data[“Sales”].astype(“int32”)—smaller numbers, less space.
- Drop Columns: data.drop(columns=[“Item”])—if unneeded now.
- Sample: sample = data.sample(frac=0.1)—10% for quick tests.
Priya’s 150 rows fit fine, but a year’s 1825? These save her laptop. Day 12: Data Odyssey optimizes.
Priya’s Month Test
A month’s CSV (150 rows, imagined):
data = pd.read_csv("month.csv", usecols=["Day", "Hour", "Sales", "Item"])
daily_totals = data.groupby("Day")["Sales"].sum().reset_index()
plt.figure(figsize=(10, 6))
plt.bar(daily_totals["Day"], daily_totals["Sales"], color="teal")
plt.title("Monthly Sales by Day")
plt.xlabel("Day")
plt.ylabel("Sales (₹)")
plt.xticks(rotation=90)
plt.show()
30 bars—Saturday’s often tallest. Hourly:
hourly_avg = data.groupby("Hour")["Sales"].mean().reset_index()
plt.plot(hourly_avg["Hour"], hourly_avg["Sales"], marker="o")
plt.title("Monthly Avg Sales by Hour")
plt.xlabel("Hour")
plt.ylabel("Avg Sales (₹)")
plt.show()
9 AM reigns—her pattern scales. Day 12: Data Odyssey handles it.
Real-World Scale
India’s traffic data—millions of rows—chunks into Pandas, filters rush hours, plots jams. Amazon’s sales—billions—group by day, visualize peaks. Priya’s month is small, but the tricks are pro. Day 12: Data Odyssey bridges her.
Challenges
Bigger data bites:
- Slowdown: 1825 rows lag—chunk or summarize.
- Errors: File missing? Check paths.
- Memory: Crash at 10,000 rows—use dtypes, sample.
Priya forgets usecols—lags, fixes it. Day 12: Data Odyssey learns with her.
Why This Matters
Handling bigger data lets Priya see a month—Saturday’s ₹2500, 9 AM’s avg peak—planning stock and staff smarter. Without it, she’s stuck at two days; with it, she grows. Scale it: India’s census chunks billions—policy shifts. Day 12: Data Odyssey scales you up.
Recap Summary
Yesterday, Day 11: Data Odyssey wrangled Priya’s data—merged days, pivoted 9 AM’s ₹1250, added weather—with Pandas. Today, Day 12: Data Odyssey scaled to bigger data—looping files, chunking, summarizing her week (Saturday ₹2500)—keeping it efficient. It’s her growth step.
What’s Next
Tomorrow, in Day 13: Data Odyssey – What is Data Preprocessing?, we’ll explore preprocessing: How do we prep Priya’s data for modeling? Standardize sales, encode items? We’ll tweak her table for machine learning’s next leap. Bring your curiosity, and I’ll see you there!










