DE101-PD05 — Performance & Production Patterns (From Pandas to Scale)

~90–120 minutes

🎯 Why This Chapter Matters

Most Pandas tutorials stop at:

“Here is how it works”

Real engineers must ask:

“Will this survive production?”

Performance mistakes silently:

  • Waste memory
  • Slow pipelines
  • Break models
  • Crash jobs

This chapter teaches you how professionals think.


🧠 Mental Model: Pandas Is Vectorized Python

Pandas is fast only when you let it operate on whole columns.

❌ Bad:

for i in range(len(df)):
    df.loc[i, "x"] += 1

✅ Good:

df["x"] += 1

Vectorization = speed + clarity.


🚫 Avoid Python Loops (Unless Necessary)

Loops move work from C → Python.

Approach Speed
Python loop ❌ slow
.apply() ⚠️ medium
Vectorized ops ✅ fastest

🧪 Example: Conditional Logic

❌ Slow:

df["flag"] = df["value"].apply(lambda x: 1 if x > 0 else 0)

✅ Fast:

df["flag"] = (df["value"] > 0).astype(int)

🧠 Use the Right Data Types

Wrong dtypes waste memory.

Before

df.info()

Convert strings to categories

df["country"] = df["country"].astype("category")
df["device"] = df["device"].astype("category")

This can reduce memory 10×–100×.


🧠 Numeric Downcasting

df["age"] = pd.to_numeric(df["age"], downcast="integer")
df["revenue"] = pd.to_numeric(df["revenue"], downcast="float")

Used in large ETL pipelines.


📦 Memory Profiling

df.memory_usage(deep=True).sum() / 1024**2

Always measure before optimizing.


🧠 Copy vs View (Danger Zone)

df2 = df[df["age"] > 18]
df2["flag"] = 1  # ⚠️ SettingWithCopyWarning

Safer:

df2 = df.loc[df["age"] > 18].copy()
df2["flag"] = 1

Silent bugs happen here.


⚙️ Chunk Processing (Large Files)

chunks = pd.read_csv("large.csv", chunksize=100_000)

for chunk in chunks:
    process(chunk)

Used when data barely fits in memory.


🧠 Efficient Joins

  • Join on indexed columns
  • Avoid object dtype keys
  • Filter before join
users = users.set_index("user_id")
orders = orders.set_index("user_id")

users.join(orders, how="left")

🧠 Sorting Is Expensive

df.sort_values("timestamp")

Sorting is O(n log n) Avoid repeated sorts.


🧠 Use query() for Readability

df.query("country == 'US' and revenue > 100")

Often faster and clearer than chained indexing.


🧪 Profiling Runtime

%timeit df["revenue"].sum()

In production:

  • Measure
  • Compare
  • Decide

🧠 Parallelism: Pandas Is Mostly Single-Threaded

Workarounds:

  • Split data
  • Use multiprocessing
  • Or switch tools

🚀 When Pandas Is Not Enough

When does Pandas become slow?

When data no longer fits in memory.

At this point, professionals move to:

Tool When
Dask Larger-than-memory Pandas
Spark Distributed clusters
DuckDB Fast SQL analytics
Polars Modern fast DataFrame

🧠 Google / Meta / OpenAI Pattern

  1. Prototype in Pandas
  2. Validate logic
  3. Optimize
  4. Scale to Spark / SQL
  5. Monitor in production

Same logic, different engine.


🧪 Real Interview Question

“Your Pandas pipeline is slow. What do you do?”

Expected answer:

  • Measure memory
  • Check dtypes
  • Remove loops
  • Vectorize
  • Reduce joins
  • Scale tool if needed

🏁 Golden Rules

  • Measure before optimizing
  • Avoid Python loops
  • Control memory
  • Validate joins
  • Think about scale early

🚀 Final Thought

Pandas is not slow. Bad usage is slow.

Engineers who understand performance write code that survives real systems.


🔜 Next Chapter

👉 DE101-PD06 — Time Series, Rolling Windows & Feature Engineering

This is where ML & analytics meet.

Previous
Next