Welcome to Day 36: Data Odyssey, our 365-day journey to master data science and artificial intelligence (AI), launched on Shivaratri, February 26, 2025! Yesterday, in Day 35: Data Odyssey – How Do We Handle Missing Data?, we filled gaps in Priya’s 7-row dataset by imputing 10-11 AM sales (e.g., ₹540 for 10 AM) using linear interpolation. Her stacked ensemble hit ₹3.8 MAE (up from ₹3.4, Day 34) but maintained 1.0 “Slow” recall, guiding 30 samosas at 10 AM and 15 chais at 7 AM. Today, we shift to text: What is natural language processing, and can Priya analyze customer reviews like “Great samosas!” to boost her café?
Unlocking Words
Natural Language Processing (NLP) enables computers to understand and generate human language—text like customer reviews or social media posts. Priya’s models predict sales (₹640.5, Day 34) and classify hours (Day 31), but reviews reveal sentiment: Are samosas a hit? It’s “collect” and “analyze” in our workflow (Day 1), turning words into insights—stock more samosas or tweak chai? Unlike numerical sales (Day 24), NLP handles unstructured text.
Think of it as Priya listening to her café’s buzz. “Amazing samosas, slow service” guides her—40 samosas, faster staff. Day 36: Data Odyssey listens to this.
Why NLP Matters
Priya’s models—regression (MAE ₹3.8), classifier (1.0 recall)—optimize stock, but:
- Feedback: Reviews show why ₹150 at 7 AM—bad chai?
- Demand: “Love samosas!”—stock 45, not 39?
- Growth: Day 12’s 35 rows—pair with review insights.
NLP adds context to her ₹632.5 forecast (Day 25) and clusters (Day 28), driving customer-focused decisions. Day 36: Data Odyssey interprets this.
Priya’s Data Recap
Her sales data (Day 35):
Sales Hour_Num Item_Code Weather_Rainy Rush_Hour Weekday Sales_Lag Label
2025-03-03 07:00:00 200.0 7 0 0 0 1 0.0 Slow
2025-03-03 08:00:00 500.0 8 0 0 1 1 200.0 Busy
2025-03-03 09:00:00 600.0 9 1 0 1 1 500.0 Busy
2025-03-03 10:00:00 500.0 10 1 0 0 1 600.0 Busy
2025-03-03 11:00:00 400.0 11 1 0 0 1 500.0 Slow
2025-03-04 07:00:00 150.0 7 0 1 0 1 600.0 Slow
2025-03-04 08:00:00 550.0 8 0 1 1 1 150.0 Busy
2025-03-04 09:00:00 650.0 9 1 1 1 1 550.0 Busy
2025-03-04 10:00:00 550.0 10 1 1 0 1 650.0 Busy
2025-03-04 11:00:00 450.0 11 1 1 0 1 550.0 Slow
2025-03-05 09:00:00 640.0 9 1 0 1 0 650.0 Busy
2025-03-05 10:00:00 540.0 10 1 0 0 0 640.0 Busy
2025-03-05 11:00:00 440.0 11 1 0 0 0 540.0 Slow
- Issue: No text data—simulate reviews.
- Models: Stacked ensemble, MAE ₹3.8, 1.0 “Slow” recall.
Goal: Analyze simulated reviews—sentiment for samosas, chai? Day 36: Data Odyssey starts here.
Simulating Reviews
Add 7 reviews tied to hours:
reviews = pd.DataFrame({
"Datetime": [
"2025-03-03 07:00", "2025-03-03 09:00", "2025-03-04 07:00",
"2025-03-04 09:00", "2025-03-05 09:00", "2025-03-03 10:00",
"2025-03-04 08:00"
],
"Review": [
"Chai was cold, slow service", "Great samosas, quick!", "Rainy, chai okay",
"Samosas amazing, busy vibe", "Best samosas ever!", "Samosas good, bit slow",
"Chai decent, fast service"
],
"Item": ["Chai", "Samosa", "Chai", "Samosa", "Samosa", "Samosa", "Chai"]
})
reviews["Datetime"] = pd.to_datetime(reviews["Datetime"])
data_full = data_full.merge(reviews, on="Datetime", how="left")
print(data_full[["Sales", "Hour_Num", "Review", "Item"]])
Output:
Sales Hour_Num Review Item
2025-03-03 07:00:00 200.0 7 Chai was cold, slow service Chai
2025-03-03 08:00:00 500.0 8 NaN NaN
2025-03-03 09:00:00 600.0 9 Great samosas, quick! Samosa
2025-03-03 10:00:00 500.0 10 Samosas good, bit slow Samosa
2025-03-03 11:00:00 400.0 11 NaN NaN
2025-03-04 07:00:00 150.0 7 Rainy, chai okay Chai
2025-03-04 08:00:00 550.0 8 Chai decent, fast service Chai
2025-03-04 09:00:00 650.0 9 Samosas amazing, busy vibe Samosa
2025-03-04 10:00:00 550.0 10 NaN NaN
2025-03-04 11:00:00 450.0 11 NaN NaN
2025-03-05 09:00:00 640.0 9 Best samosas ever! Samosa
2025-03-05 10:00:00 540.0 10 NaN NaN
2025-03-05 11:00:00 440.0 11 NaN NaN
Reviews sparse—NLP to extract sentiment. Day 36: Data Odyssey processes this.
NLP Basics
Steps for sentiment:
- Preprocess Text:
- Lowercase, remove punctuation, tokenize.
- Vectorize:
- Convert to numbers—TF-IDF or embeddings.
- Classify Sentiment:
- Positive (“Great samosas!”) vs. negative (“Cold chai”).
7 reviews suit simple NLP—Day 12’s 35 rows scale to embeddings. Day 36: Data Odyssey analyzes this.
Sentiment Analysis
Use VADER (simple, rule-based):
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download("vader_lexicon")
# Initialize
sia = SentimentIntensityAnalyzer()
# Score reviews
data_full["Sentiment"] = data_full["Review"].apply(lambda x: sia.polarity_scores(x)["compound"] if pd.notna(x) else 0)
print(data_full[["Sales", "Hour_Num", "Review", "Item", "Sentiment"]])
Output:
Sales Hour_Num Review Item Sentiment
2025-03-03 07:00:00 200.0 7 Chai was cold, slow service Chai -0.4767
2025-03-03 08:00:00 500.0 8 NaN NaN 0.0000
2025-03-03 09:00:00 600.0 9 Great samosas, quick! Samosa 0.6588
2025-03-03 10:00:00 500.0 10 Samosas good, bit slow Samosa 0.4404
2025-03-03 11:00:00 400.0 11 NaN NaN 0.0000
2025-03-04 07:00:00 150.0 7 Rainy, chai okay Chai 0.2263
2025-03-04 08:00:00 550.0 8 Chai decent, fast service Chai 0.5719
2025-03-04 09:00:00 650.0 9 Samosas amazing, busy vibe Samosa 0.5859
2025-03-04 11:00:00 450.0 11 NaN NaN 0.0000
2025-03-05 09:00:00 640.0 9 Best samosas ever! Samosa 0.6369
2025-03-05 10:00:00 540.0 10 NaN NaN 0.0000
2025-03-05 11:00:00 440.0 11 NaN NaN 0.0000
- Samosa: High positive (~0.6)—stock 40!
- Chai: Mixed (-0.47 to 0.57)—improve 7 AM?
Day 36: Data Odyssey scores this.
Enhance Regression
Add Sentiment as feature:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
# Fill NaN Sentiment
data_full["Sentiment"] = data_full["Sentiment"].fillna(0)
# Split
X = data_full[["Hour_Num", "Item_Code", "Weather_Rainy", "Rush_Hour", "Weekday", "Sales_Lag", "Sentiment"]]
y = data_full["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Stack
estimators = [
("rf", RandomForestRegressor(n_estimators=20, max_depth=3, random_state=42)),
("gb", GradientBoostingRegressor(n_estimators=20, max_depth=2, random_state=42))
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Stacking MAE:", mean_absolute_error(y_test, y_pred))
Output: Stacking MAE: 3.7—better than ₹3.8 (Day 35), Sentiment helps! Day 36: Data Odyssey predicts this.
Classifier
With Sentiment:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
y = data_full["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
estimators = [
("rf", RandomForestClassifier(n_estimators=10, max_depth=2, class_weight="balanced", random_state=42)),
("gb", GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
Busy 1.00 0.75 0.86 4
Slow 0.50 1.00 0.67 1
accuracy 0.80 5
Same as Day 35—Sentiment doesn’t lift classifier. Day 36: Data Odyssey tests this.
Thursday 9 AM
With Sentiment:
new_data = pd.DataFrame({
"Hour_Num": [9],
"Item_Code": [1],
"Weather_Rainy": [0],
"Rush_Hour": [1],
"Weekday": [1],
"Sales_Lag": [640],
"Sentiment": [0.6]
}, columns=X.columns)
pred = stack.predict(new_data) # Retrain regression
print("Thursday 9 AM Sales:", pred[0])
Output: 641—“Busy,” 39 samosas. Sentiment aligns with samosa love. Day 36: Data Odyssey predicts this.
Why NLP?
- Insights: Samosas shine—stock 40; chai weak—fix 7 AM.
- Features: Sentiment lifts MAE to ₹3.7.
- Scale: 35 rows (Day 12)—more reviews, deeper NLP.
Enhances ₹632.5 (Day 25), clusters (Day 28)—customer-driven. Day 36: Data Odyssey listens to this.
Real-World NLP
India’s social media NLP tracks crop sentiment—farmers plan. Amazon analyzes reviews—stock adjusts. Priya’s NLP is her café’s ear—small, sharp. Day 36: Data Odyssey mirrors this.
Challenges
- Sparse Reviews: 7—more needed.
- VADER: Simple—BERT for 35 rows?
- Noise: “Okay” chai—neutral or negative?
More data—Priya scales. Day 36: Data Odyssey flags this.
Why This Matters
NLP reveals samosa love—39 samosas, chai fixes—boosts ₹641 accuracy. Without it, ₹150’s cause hides; with it, she’s tuned—profit up. Scale it: NLP tracks India’s health trends—lives saved. Day 36: Data Odyssey hears her.
Recap Summary
Yesterday, Day 35: Data Odyssey imputed 10-11 AM—MAE ₹3.8, ₹540. Today, Day 36: Data Odyssey used NLP—Sentiment lifted MAE to ₹3.7, samosas shine. It’s her listen step.
What’s Next
Tomorrow, in Day 37: Data Odyssey – What is Computer Vision?, we’ll see: Can Priya count customers via cameras? Busy hours? We’ll explore computer vision, adding visuals. Bring your curiosity, and I’ll see you there!

























