Day 11: Data Odyssey – What is Data Wrangling?

Welcome to Day 11: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 10: Data Odyssey – How Do We Visualize Data in Python?, we added visualization to Priya’s toolkit with Matplotlib. We plotted her cleaned POS data—bar charts showed her 8-9 AM rush, line plots compared Monday vs. Tuesday, and a pie chart highlighted chai’s 70% dominance. Paired with Pandas (Day 9), it turned her numbers into vivid insights. Today, we tackle a new skill: What is data wrangling, and how can Priya reshape her data for bigger questions?

The Art of Data Wrangling

Data wrangling is the process of transforming raw or messy data into a tidy, usable form. It’s an evolution of cleaning (Day 5)—where cleaning fixes errors like typos (₹5000 to ₹500), wrangling reshapes data to fit your needs. It’s the “prep” step in our workflow (Day 1): define, collect, clean, analyze, model, communicate—often overlapping with cleaning and exploration (Day 6). Wrangling bends data to answer: “What’s my weekly trend?” or “How do items vary by hour?”

Think of it as tailoring a suit. Cleaning mends holes; wrangling cuts, stitches, and fits it to Priya’s frame—merging days, splitting items, adding context. With Pandas, it’s her next power-up. Day 11: Data Odyssey dives into this craft.

Why Wrangling Matters

Priya’s data—Monday’s sales (Day 9), Tuesday’s (Day 10)—lives in separate tables or CSVs. Visuals showed daily peaks, but what about a week? Seasons? She needs:

Combination – Merge Monday-Tuesday into one dataset.
Reshaping – Pivot sales by hour and item.
Enrichment – Add weather or day type (weekday/weekend).

Without wrangling, she’s stuck with silos—Monday’s ₹400 mean doesn’t talk to Tuesday’s ₹390. Wrangling unites them, prepping for deeper stats or models. Day 11: Data Odyssey makes her data whole.

Priya’s Data Recap

Her cleaned Monday (from monday.csv):

   Hour  Sales   Item
0  7 AM    200   Chai
1  8 AM    500   Chai
2  9 AM    600  Samosa
3 10 AM    400   Chai
4 11 AM    300   Chai

Tuesday (from tuesday.csv):

   Hour  Sales   Item
0  7 AM    150   Chai
1  8 AM    550   Chai
2  9 AM    650  Samosa
3 10 AM    500   Chai
4 11 AM    250   Chai

Day 10 plotted these separately. Now, Priya asks: “What’s my two-day trend?” “Chai vs. samosa by hour?” Wrangling’s her answer. Day 11: Data Odyssey starts here.

Merging Data

Combine days into one table:

import pandas as pd

# Load
mon = pd.read_csv("monday.csv")
tue = pd.read_csv("tuesday.csv")

# Add day column
mon["Day"] = "Monday"
tue["Day"] = "Tuesday"

# Merge
data = pd.concat([mon, tue], ignore_index=True)
print(data)

Output:

    Hour  Sales   Item      Day
0   7 AM    200   Chai   Monday
1   8 AM    500   Chai   Monday
2   9 AM    600  Samosa Monday
3  10 AM    400   Chai   Monday
4  11 AM    300   Chai   Monday
5   7 AM    150   Chai   Tuesday
6   8 AM    550   Chai   Tuesday
7   9 AM    650  Samosa Tuesday
8  10 AM    500   Chai   Tuesday
9  11 AM    250   Chai   Tuesday

pd.concat() stacks them—Priya’s got 10 rows, two days, one dataset! Day 11: Data Odyssey unites her world.

Pivoting Data

What’s total sales by hour across days? Pivot it:

pivot = data.pivot_table(values="Sales", index="Hour", aggfunc="sum")
print(pivot)

Output:

       Sales
Hour        
7 AM     350
8 AM    1050
9 AM    1250
10 AM    900
11 AM    550

9 AM’s ₹1250 (600 + 650) tops all—her rush holds across days. pivot_table sums sales per hour—wrangling in action. Day 11: Data Odyssey reshapes this.

Item Breakdown

Chai vs. samosa by hour? Pivot with items:

pivot_items = data.pivot_table(values="Sales", index="Hour", columns="Item", aggfunc="sum", fill_value=0)
print(pivot_items)

Output:

Item    Chai  Samosa
Hour                
7 AM     350       0
8 AM    1050       0
9 AM       0    1250
10 AM    900       0
11 AM    550       0

Chai rules all but 9 AM—samosas spike there. Wrangling reveals this split. Day 11: Data Odyssey digs deeper.

Adding Context

Enrich with weather (imagined):

weather = pd.DataFrame({
    "Day": ["Monday", "Tuesday"],
    "Weather": ["Sunny", "Rainy"]
})
data = data.merge(weather, on="Day")
print(data)

Output:

    Hour  Sales   Item      Day Weather
0   7 AM    200   Chai   Monday   Sunny
1   8 AM    500   Chai   Monday   Sunny
2   9 AM    600  Samosa Monday   Sunny
3  10 AM    400   Chai   Monday   Sunny
4  11 AM    300   Chai   Monday   Sunny
5   7 AM    150   Chai  Tuesday   Rainy
6   8 AM    550   Chai  Tuesday   Rainy
7   9 AM    650  Samosa Tuesday   Rainy
8  10 AM    500   Chai  Tuesday   Rainy
9  11 AM    250   Chai  Tuesday   Rainy

merge adds weather—does rain boost samosas? Day 11: Data Odyssey enriches her view.

Cleaning in Wrangling

Wrangling often cleans too. Duplicates? data.drop_duplicates(). Odd formats? Standardize “7 AM” to “07:00”:

data["Hour"] = data["Hour"].replace(" AM", ":00")
print(data["Hour"])

Output: [“07:00”, “08:00”, “09:00”, …]. Priya’s data’s tidier. Day 11: Data Odyssey blends these steps.

Why Wrangle?

Wrangling turns Priya’s scattered CSVs into a unified story—9 AM’s ₹1250 peak, chai’s hourly reign, weather’s hint. Without it, she’s juggling files; with it, she’s asking: “Rainy days mean more samosas?” It preps stats (Day 4) and visuals (Day 10) for scale. Day 11: Data Odyssey makes this hers.

Real-World Wrangling

India’s census wrangles millions of rows—merging states, pivoting demographics—for policy. Amazon blends sales and weather—does rain spike umbrella buys? Priya’s two days are small, but the skill’s the same. Day 11: Data Odyssey ties her in.

Challenges

Wrangling trips up:

Mismatch: Merge fails if “Monday” vs. “monday”—case matters.
Gaps: Pivot skips missing hours—fill them first.
Complexity: Big merges slow—Priya’s fine now.

She mistypes pivott_table—error! Fixes to pivot_table. Day 11: Data Odyssey expects stumbles.

Why This Matters

Wrangling gives Priya a week’s view—9 AM’s king, chai’s steady, weather’s a clue. Without it, she’s stuck in daily silos; with it, she plans stock smarter. Scale it: wrangled traffic data cuts India’s jams—lives shift. Day 11: Data Odyssey hands you this craft.

Recap Summary

Yesterday, Day 10: Data Odyssey plotted Priya’s sales—bar charts (8-9 AM rush), pies (chai 70%)—with Matplotlib and Pandas. Today, Day 11: Data Odyssey explored wrangling—merging her days, pivoting sales (9 AM ₹1250), adding weather—with Pandas. It’s her data, reshaped.

What’s Next

Tomorrow, in Day 12: Data Odyssey – How Do We Handle Bigger Data?, we’ll tackle bigger data: How do we scale Priya’s week to a month? Manage memory? We’ll use Pandas tricks to handle her growing POS files efficiently. Bring your curiosity, and I’ll see you there!

Author

Vincent Mathews

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W