DE101-PD05 — Performance & Production Patterns (From Pandas to Scale)
~90–120 minutes
🎯 Why This Chapter Matters
Most Pandas tutorials stop at:
“Here is how it works”
Real engineers must ask:
“Will this survive production?”
Performance mistakes silently:
- Waste memory
- Slow pipelines
- Break models
- Crash jobs
This chapter teaches you how professionals think.
🧠 Mental Model: Pandas Is Vectorized Python
Pandas is fast only when you let it operate on whole columns.
❌ Bad:
for i in range(len(df)):
df.loc[i, "x"] += 1
✅ Good:
df["x"] += 1
Vectorization = speed + clarity.
🚫 Avoid Python Loops (Unless Necessary)
Loops move work from C → Python.
| Approach | Speed |
|---|---|
| Python loop | ❌ slow |
.apply() |
⚠️ medium |
| Vectorized ops | ✅ fastest |
🧪 Example: Conditional Logic
❌ Slow:
df["flag"] = df["value"].apply(lambda x: 1 if x > 0 else 0)
✅ Fast:
df["flag"] = (df["value"] > 0).astype(int)
🧠 Use the Right Data Types
Wrong dtypes waste memory.
Before
df.info()
Convert strings to categories
df["country"] = df["country"].astype("category")
df["device"] = df["device"].astype("category")
This can reduce memory 10×–100×.
🧠 Numeric Downcasting
df["age"] = pd.to_numeric(df["age"], downcast="integer")
df["revenue"] = pd.to_numeric(df["revenue"], downcast="float")
Used in large ETL pipelines.
📦 Memory Profiling
df.memory_usage(deep=True).sum() / 1024**2
Always measure before optimizing.
🧠 Copy vs View (Danger Zone)
df2 = df[df["age"] > 18]
df2["flag"] = 1 # ⚠️ SettingWithCopyWarning
Safer:
df2 = df.loc[df["age"] > 18].copy()
df2["flag"] = 1
Silent bugs happen here.
⚙️ Chunk Processing (Large Files)
chunks = pd.read_csv("large.csv", chunksize=100_000)
for chunk in chunks:
process(chunk)
Used when data barely fits in memory.
🧠 Efficient Joins
- Join on indexed columns
- Avoid object dtype keys
- Filter before join
users = users.set_index("user_id")
orders = orders.set_index("user_id")
users.join(orders, how="left")
🧠 Sorting Is Expensive
df.sort_values("timestamp")
Sorting is O(n log n) Avoid repeated sorts.
🧠 Use query() for Readability
df.query("country == 'US' and revenue > 100")
Often faster and clearer than chained indexing.
🧪 Profiling Runtime
%timeit df["revenue"].sum()
In production:
- Measure
- Compare
- Decide
🧠 Parallelism: Pandas Is Mostly Single-Threaded
Workarounds:
- Split data
- Use multiprocessing
- Or switch tools
🚀 When Pandas Is Not Enough
When does Pandas become slow?
When data no longer fits in memory.
At this point, professionals move to:
| Tool | When |
|---|---|
| Dask | Larger-than-memory Pandas |
| Spark | Distributed clusters |
| DuckDB | Fast SQL analytics |
| Polars | Modern fast DataFrame |
🧠 Google / Meta / OpenAI Pattern
- Prototype in Pandas
- Validate logic
- Optimize
- Scale to Spark / SQL
- Monitor in production
Same logic, different engine.
🧪 Real Interview Question
“Your Pandas pipeline is slow. What do you do?”
Expected answer:
- Measure memory
- Check dtypes
- Remove loops
- Vectorize
- Reduce joins
- Scale tool if needed
🏁 Golden Rules
- Measure before optimizing
- Avoid Python loops
- Control memory
- Validate joins
- Think about scale early
🚀 Final Thought
Pandas is not slow. Bad usage is slow.
Engineers who understand performance write code that survives real systems.
🔜 Next Chapter
👉 DE101-PD06 — Time Series, Rolling Windows & Feature Engineering
This is where ML & analytics meet.