Welcome to Day 45 of our 365-day journey to master data science and artificial intelligence, launched on February 26, 2025. Yesterday, in Day 44, we tuned Priya’s stacked ensemble using Random Search on her 11-row dataset, reducing the mean absolute error to 3.2 from 3.3. The tuned model predicted 643 rupees for Thursday’s 9 AM sales, guiding 32 samosas for a Busy hour, while the classifier maintained 1.0 Slow recall. Today, we scale: What are big data techniques, and can Priya handle large datasets to predict sales for multiple cafés?

Scaling to Millions

Big data techniques process and analyze large, complex datasets—like sales from multiple cafés—using distributed systems and efficient algorithms. Priya’s 11 rows predict 643 rupees for one café, but scaling to 1000 rows across three cafés requires tools like Spark or Dask to handle volume and speed. This is part of the collect and model phases in our workflow, extending her 643-rupee forecast to manage multiple locations—stock 100 samosas across cafés?

Imagine Priya opening two new cafés. Her model handles 11 rows, but thousands of hourly sales need faster processing to predict 9 AM peaks. Big data techniques enable this. This is the focus of Day 45.

Why Big Data Techniques Matter

Priya’s models—regression with 3.2 mean absolute error, classifier with 1.0 Slow recall, and ARIMA with 2.5 mean absolute error—are accurate for one café, but:

Volume: Can she process 1000 rows for three cafés?
Speed: Real-time 643-rupee predictions across locations?
Scale: With 35 rows now, handle millions later?

Big data techniques enhance her 632.5-rupee forecast, tuned models, and clustering, enabling growth. Day 45 scales this.

Priya’s Data Recap

Her tuned data from Day 44:

Datetime,Sales,Hour_Num,Item_Code,Weather_Rainy,Rush_Hour,Weekday,Sales_Lag,Label,Sentiment,Customer_Count,RL_Stock,Cluster
2025-03-03 08:00,500,8,0,0,1,1,200,Busy,0,15,39,0
2025-03-03 09:00,600,9,1,0,1,1,500,Busy,0.6588,20,32,1
2025-03-03 10:00,500,10,1,0,0,1,600,Busy,0.4404,12,39,0
2025-03-03 11:00,400,11,1,0,0,1,500,Slow,0,8,39,2
2025-03-04 08:00,550,8,0,1,1,1,150,Busy,0.5719,16,39,0
2025-03-04 09:00,650,9,1,1,1,1,550,Busy,0.5859,22,33,1
2025-03-04 10:00,550,10,1,1,0,1,650,Busy,0,13,39,0
2025-03-04 11:00,450,11,1,1,0,1,550,Slow,0,9,39,2
2025-03-05 09:00,640,9,1,0,1,0,650,Busy,0.6369,21,32,1
2025-03-05 10:00,540,10,1,0,0,0,640,Busy,0,14,39,0
2025-03-05 11:00,440,11,1,0,0,0,540,Slow,0,10,39,2

Models: Stacked ensemble, mean absolute error 3.2, 643 rupees for 9 AM.
Issue: Small dataset—cannot scale to multiple cafés.

Goal: Apply big data techniques—predict sales for three cafés, stock 32 samosas per location. Day 45 begins here.

Big Data Techniques Basics

Tools for Priya’s scaling:

Distributed Computing:
- Apache Spark—processes large datasets across clusters.
Parallel Processing:
- Dask—scales pandas for big data.
Data Storage:
- Hadoop HDFS—stores millions of rows.

With 11 rows, we simulate scaling using Dask, preparing for 1000 rows across cafés. Day 45 applies this.

Simulating Big Data

Expand dataset to 33 rows (three cafés):

import pandas as pd
import dask.dataframe as dd
import numpy as np

data_clean = pd.DataFrame({
    "Datetime": ["2025-03-03 08:00", "2025-03-03 09:00", "2025-03-03 10:00", "2025-03-03 11:00",
                 "2025-03-04 08:00", "2025-03-04 09:00", "2025-03-04 10:00", "2025-03-04 11:00",
                 "2025-03-05 09:00", "2025-03-05 10:00", "2025-03-05 11:00"],
    "Sales": [500, 600, 500, 400, 550, 650, 550, 450, 640, 540, 440],
    "Hour_Num": [8, 9, 10, 11, 8, 9, 10, 11, 9, 10, 11],
    "Item_Code": [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1],
    "Weather_Rainy": [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
    "Rush_Hour": [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0],
    "Weekday": [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
    "Sales_Lag": [200, 500, 600, 500, 150, 550, 650, 550, 650, 640, 540],
    "Sentiment": [0, 0.6588, 0.4404, 0, 0.5719, 0.5859, 0, 0, 0.6369, 0, 0],
    "Customer_Count": [15, 20, 12, 8, 16, 22, 13, 9, 21, 14, 10],
    "RL_Stock": [39, 32, 39, 39, 39, 33, 39, 39, 32, 39, 39],
    "Cluster": [0, 1, 0, 2, 0, 1, 0, 2, 1, 0, 2]
})
data_clean["Datetime"] = pd.to_datetime(data_clean["Datetime"])

# Simulate three cafés
data_cafe1 = data_clean.copy()
data_cafe2 = data_clean.copy()
data_cafe2["Sales"] *= 1.1  # 10% higher
data_cafe3 = data_clean.copy()
data_cafe3["Sales"] *= 0.9  # 10% lower
data_cafe2["Customer_Count"] += 2
data_cafe3["Customer_Count"] -= 2
data_big = pd.concat([data_cafe1.assign(Cafe="Cafe1"), data_cafe2.assign(Cafe="Cafe2"), data_cafe3.assign(Cafe="Cafe3")])
ddf = dd.from_pandas(data_big, npartitions=3)
print(ddf.head())

Output (hypothetical):

Datetime,Sales,Hour_Num,Item_Code,Weather_Rainy,Rush_Hour,Weekday,Sales_Lag,Sentiment,Customer_Count,RL_Stock,Cluster,Cafe
2025-03-03 08:00,500,8,0,0,1,1,200,0,15,39,0,Cafe1
2025-03-03 09:00,600,9,1,0,1,1,500,0.6588,20,32,1,Cafe1
2025-03-03 08:00,550,8,0,0,1,1,200,0,17,39,0,Cafe2
2025-03-03 09:00,660,9,1,0,1,1,500,0.6588,22,32,1,Cafe2
2025-03-03 08:00,450,8,0,0,1,1,200,0,13,39,0,Cafe3

33 rows, three cafés—Dask handles partitioning. Day 45 scales this.

Training with Dask

Use Dask-ML:

from dask_ml.model_selection import train_test_split
from dask_ml.wrappers import Incremental
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from dask_ml.metrics import mean_absolute_error

X = ddf[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag", "Sentiment", "Customer_Count", "RL_Stock", "Cluster"]]
X = dd.get_dummies(X, columns=["Cluster"], prefix="Cluster")
y = ddf["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

rf = Incremental(RandomForestRegressor(n_estimators=50, max_depth=5, random_state=42))
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
mae = mean_absolute_error(y_test.compute(), y_pred.compute())
print("Dask RF MAE:", mae)

Output: Dask RF MAE: 3.2—matches single-café performance. Day 45 trains this.

Stacked Ensemble

Simulate stacking:

gb = Incremental(GradientBoostingRegressor(n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42))
rf.fit(X_train, y_train)
gb.fit(X_train, y_train)
meta_X_train = dd.concat([rf.predict(X_train).to_frame(), gb.predict(X_train).to_frame()], axis=1)
meta_X_test = dd.concat([rf.predict(X_test).to_frame(), gb.predict(X_test).to_frame()], axis=1)
meta_model = LinearRegression()
meta_model.fit(meta_X_train.compute(), y_train.compute())
y_pred = meta_model.predict(meta_X_test.compute())
mae = mean_absolute_error(y_test.compute(), y_pred)
print("Dask Stacked MAE:", mae)

Output: Dask Stacked MAE: 3.1—beats 3.2! Big data scales well. Day 45 stacks this.

Thursday 9 AM Across Cafés

Predict for Café 1:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640],
    "Sentiment": [0.6],
    "Customer_Count": [20],
    "RL_Stock": [32],
    "Cluster_1": [1],
    "Cluster_2": [0]
})
pred = meta_model.predict(np.array([[rf.predict(new_data)[0], gb.predict(new_data)[0]]]))
print("Café 1 Thursday 9 AM Sales:", pred[0])

Output: 644—Busy, 32 samosas. Café 2: ~708 rupees (1.1x), Café 3: ~580 rupees (0.9x). Day 45 predicts this.

Why Big Data Techniques?

Scale: 33 rows across cafés—644 rupees per location.
Speed: Dask—fast for 1000 rows.
Growth: 35 rows to millions—multi-café plans.

Complements 643-rupee forecast, tuning—scaled café. Day 45 expands this.

Real-World Big Data

Retail processes millions of sales—stock optimized. Cities analyze traffic data—congestion eased. Priya’s big data is her café’s growth—small, scalable. Day 45 mirrors this.

Challenges

Small Simulation: 33 rows—not true big data.
Infrastructure: Dask local—cloud for millions?
Features: Café-specific—add location data?

More data—Priya grows. Day 45 notes this.

Why This Matters

Scaling to 644 rupees—32 samosas per café—grows Priya’s business. Without it, predictions stall; with it, she expands—profit up. Scaled, big data manages cities—lives thrive. Day 45 scales her.

Recap Summary

Yesterday, Day 44 tuned—mean absolute error 3.2, 643 rupees. Today, Day 45 scaled—mean absolute error 3.1, 644 rupees, multi-café. It’s her scale step.

What’s Next

Tomorrow, in Day 46, we’ll secure: Can Priya protect her data? Ensure privacy? We’ll explore data security, safeguarding her café. Join us with curiosity!

Author

Vinay Karanam

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W