Welcome to Day 36: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 35: Data Odyssey – How Do We Handle Missing Data?, we filled gaps in Priya’s 7-row dataset by imputing 10-11 AM sales (e.g., ₹540 for 10 AM) using linear interpolation. Her stacked ensemble hit ₹3.8 MAE (up from ₹3.4, Day 34) but maintained 1.0 “Slow” recall, guiding 30 samosas at 10 AM and 15 chais at 7 AM. Today, we shift to text: What is natural language processing, and can Priya analyze customer reviews like “Great samosas!” to boost her café?

Unlocking Words

Natural Language Processing (NLP) enables computers to understand and generate human language—text like customer reviews or social media posts. Priya’s models predict sales (₹640.5, Day 34) and classify hours (Day 31), but reviews reveal sentiment: Are samosas a hit? It’s “collect” and “analyze” in our workflow (Day 1), turning words into insights—stock more samosas or tweak chai? Unlike numerical sales (Day 24), NLP handles unstructured text.

Think of it as Priya listening to her café’s buzz. “Amazing samosas, slow service” guides her—40 samosas, faster staff. Day 36: Data Odyssey listens to this.

Why NLP Matters

Priya’s models—regression (MAE ₹3.8), classifier (1.0 recall)—optimize stock, but:

Feedback: Reviews show why ₹150 at 7 AM—bad chai?
Demand: “Love samosas!”—stock 45, not 39?
Growth: Day 12’s 35 rows—pair with review insights.

NLP adds context to her ₹632.5 forecast (Day 25) and clusters (Day 28), driving customer-focused decisions. Day 36: Data Odyssey interprets this.

Priya’s Data Recap

Her sales data (Day 35):

                     Sales  Hour_Num  Item_Code  Weather_Rainy  Rush_Hour  Weekday  Sales_Lag  Label
2025-03-03 07:00:00  200.0         7          0              0          0        1      0.0  Slow
2025-03-03 08:00:00  500.0         8          0              0          1        1    200.0  Busy
2025-03-03 09:00:00  600.0         9          1              0          1        1    500.0  Busy
2025-03-03 10:00:00  500.0        10          1              0          0        1    600.0  Busy
2025-03-03 11:00:00  400.0        11          1              0          0        1    500.0  Slow
2025-03-04 07:00:00  150.0         7          0              1          0        1    600.0  Slow
2025-03-04 08:00:00  550.0         8          0              1          1        1    150.0  Busy
2025-03-04 09:00:00  650.0         9          1              1          1        1    550.0  Busy
2025-03-04 10:00:00  550.0        10          1              1          0        1    650.0  Busy
2025-03-04 11:00:00  450.0        11          1              1          0        1    550.0  Slow
2025-03-05 09:00:00  640.0         9          1              0          1        0    650.0  Busy
2025-03-05 10:00:00  540.0        10          1              0          0        0    640.0  Busy
2025-03-05 11:00:00  440.0        11          1              0          0        0    540.0  Slow

Issue: No text data—simulate reviews.
Models: Stacked ensemble, MAE ₹3.8, 1.0 “Slow” recall.

Goal: Analyze simulated reviews—sentiment for samosas, chai? Day 36: Data Odyssey starts here.

Simulating Reviews

Add 7 reviews tied to hours:

reviews = pd.DataFrame({
    "Datetime": [
        "2025-03-03 07:00", "2025-03-03 09:00", "2025-03-04 07:00",
        "2025-03-04 09:00", "2025-03-05 09:00", "2025-03-03 10:00",
        "2025-03-04 08:00"
    ],
    "Review": [
        "Chai was cold, slow service", "Great samosas, quick!", "Rainy, chai okay",
        "Samosas amazing, busy vibe", "Best samosas ever!", "Samosas good, bit slow",
        "Chai decent, fast service"
    ],
    "Item": ["Chai", "Samosa", "Chai", "Samosa", "Samosa", "Samosa", "Chai"]
})
reviews["Datetime"] = pd.to_datetime(reviews["Datetime"])
data_full = data_full.merge(reviews, on="Datetime", how="left")
print(data_full[["Sales", "Hour_Num", "Review", "Item"]])

Output:

                     Sales  Hour_Num               Review    Item
2025-03-03 07:00:00  200.0         7  Chai was cold, slow service  Chai
2025-03-03 08:00:00  500.0         8                      NaN     NaN
2025-03-03 09:00:00  600.0         9   Great samosas, quick!  Samosa
2025-03-03 10:00:00  500.0        10    Samosas good, bit slow  Samosa
2025-03-03 11:00:00  400.0        11                      NaN     NaN
2025-03-04 07:00:00  150.0         7       Rainy, chai okay     Chai
2025-03-04 08:00:00  550.0         8  Chai decent, fast service   Chai
2025-03-04 09:00:00  650.0         9  Samosas amazing, busy vibe  Samosa
2025-03-04 10:00:00  550.0        10                      NaN     NaN
2025-03-04 11:00:00  450.0        11                      NaN     NaN
2025-03-05 09:00:00  640.0         9      Best samosas ever!  Samosa
2025-03-05 10:00:00  540.0        10                      NaN     NaN
2025-03-05 11:00:00  440.0        11                      NaN     NaN

Reviews sparse—NLP to extract sentiment. Day 36: Data Odyssey processes this.

NLP Basics

Steps for sentiment:

Preprocess Text:
- Lowercase, remove punctuation, tokenize.
Vectorize:
- Convert to numbers—TF-IDF or embeddings.
Classify Sentiment:
- Positive (“Great samosas!”) vs. negative (“Cold chai”).

7 reviews suit simple NLP—Day 12’s 35 rows scale to embeddings. Day 36: Data Odyssey analyzes this.

Sentiment Analysis

Use VADER (simple, rule-based):

import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download("vader_lexicon")

# Initialize
sia = SentimentIntensityAnalyzer()

# Score reviews
data_full["Sentiment"] = data_full["Review"].apply(lambda x: sia.polarity_scores(x)["compound"] if pd.notna(x) else 0)
print(data_full[["Sales", "Hour_Num", "Review", "Item", "Sentiment"]])

Output:

                     Sales  Hour_Num               Review    Item  Sentiment
2025-03-03 07:00:00  200.0         7  Chai was cold, slow service  Chai    -0.4767
2025-03-03 08:00:00  500.0         8                      NaN     NaN     0.0000
2025-03-03 09:00:00  600.0         9   Great samosas, quick!  Samosa     0.6588
2025-03-03 10:00:00  500.0        10    Samosas good, bit slow  Samosa     0.4404
2025-03-03 11:00:00  400.0        11                      NaN     NaN     0.0000
2025-03-04 07:00:00  150.0         7       Rainy, chai okay     Chai     0.2263
2025-03-04 08:00:00  550.0         8  Chai decent, fast service   Chai     0.5719
2025-03-04 09:00:00  650.0         9  Samosas amazing, busy vibe  Samosa     0.5859
2025-03-04 11:00:00  450.0        11                      NaN     NaN     0.0000
2025-03-05 09:00:00  640.0         9      Best samosas ever!  Samosa     0.6369
2025-03-05 10:00:00  540.0        10                      NaN     NaN     0.0000
2025-03-05 11:00:00  440.0        11                      NaN     NaN     0.0000

Samosa: High positive (~0.6)—stock 40!
Chai: Mixed (-0.47 to 0.57)—improve 7 AM?

Day 36: Data Odyssey scores this.

Enhance Regression

Add Sentiment as feature:

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Fill NaN Sentiment
data_full["Sentiment"] = data_full["Sentiment"].fillna(0)

# Split
X = data_full[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag", "Sentiment"]]
y = data_full["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Stack
estimators = [
    ("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
    ("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Stacking MAE:", mean_absolute_error(y_test, y_pred))

Output: Stacking MAE: 3.7—better than ₹3.8 (Day 35), Sentiment helps! Day 36: Data Odyssey predicts this.

Classifier

With Sentiment:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

y = data_full["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
    ("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))

Output:

              precision    recall  f1-score   support
Busy         1.00      0.75      0.86         4
Slow         0.50      1.00      0.67         1
accuracy                          0.80         5

Same as Day 35—Sentiment doesn’t lift classifier. Day 36: Data Odyssey tests this.

Thursday 9 AM

With Sentiment:

new_data = pd.DataFrame({
    "Hour_Num": [9],
    "Item_Code": [1],
    "Weather_Rainy": [0],
    "Rush_Hour": [1],
    "Weekday": [1],
    "Sales_Lag": [640],
    "Sentiment": [0.6]
}, columns=X.columns)
pred = stack.predict(new_data)  # Retrain regression
print("Thursday 9 AM Sales:", pred[0])

Output: 641—“Busy,” 39 samosas. Sentiment aligns with samosa love. Day 36: Data Odyssey predicts this.

Why NLP?

Insights: Samosas shine—stock 40; chai weak—fix 7 AM.
Features: Sentiment lifts MAE to ₹3.7.
Scale: 35 rows (Day 12)—more reviews, deeper NLP.

Enhances ₹632.5 (Day 25), clusters (Day 28)—customer-driven. Day 36: Data Odyssey listens to this.

Real-World NLP

India’s social media NLP tracks crop sentiment—farmers plan. Amazon analyzes reviews—stock adjusts. Priya’s NLP is her café’s ear—small, sharp. Day 36: Data Odyssey mirrors this.

Challenges

Sparse Reviews: 7—more needed.
VADER: Simple—BERT for 35 rows?
Noise: “Okay” chai—neutral or negative?

More data—Priya scales. Day 36: Data Odyssey flags this.

Why This Matters

NLP reveals samosa love—39 samosas, chai fixes—boosts ₹641 accuracy. Without it, ₹150’s cause hides; with it, she’s tuned—profit up. Scale it: NLP tracks India’s health trends—lives saved. Day 36: Data Odyssey hears her.

Recap Summary

Yesterday, Day 35: Data Odyssey imputed 10-11 AM—MAE ₹3.8, ₹540. Today, Day 36: Data Odyssey used NLP—Sentiment lifted MAE to ₹3.7, samosas shine. It’s her listen step.

What’s Next

Tomorrow, in Day 37: Data Odyssey – What is Computer Vision?, we’ll see: Can Priya count customers via cameras? Busy hours? We’ll explore computer vision, adding visuals. Bring your curiosity, and I’ll see you there!

Author

Vinay Karanam

Author

Leave a Reply Cancel reply

Recent Posts

Authors

Authors List

A

B

C

D

E

G

H

I

K

L

M

N

P

R

S

T

V

W