DE101-PD02 — Pandas Series, DataFrame, and Indexing (Real-World Mindset)
~45–60 minutes
🎯 Why Pandas Matters in the Real World
At companies like Google, Meta, OpenAI, Pandas is often used for:
- Data exploration (EDA)
- Debugging datasets before training models
- Analyzing experiments / A-B tests
- Cleaning logs and user behavior data
- Prototyping ideas before moving to Spark / BigQuery
Pandas is not about speed.
It is about thinking clearly with data.
🧱 Core Pandas Data Structures
Pandas has two fundamental objects:
- Series → 1D labeled array
- DataFrame → 2D table (rows + columns)
📌 Pandas Series
Think of a Series as:
- A column in a table
- A vector with labels
import pandas as pd
goals = pd.Series([91, 85, 70], index=["Messi", "Ronaldo", "Mbappe"])
print(goals)
Output
Messi 91
Ronaldo 85
Mbappe 70
dtype: int64
Key Ideas
- Series = data + index
- Index gives meaning, not just position
- Used heavily in statistics and ML features
📊 Pandas DataFrame
A DataFrame is like an Excel table or SQL table.
Let’s simulate a FIFA World Cup dataset:
data = {
"player": ["Messi", "Mbappe", "Neymar", "Ronaldo"],
"country": ["Argentina", "France", "Brazil", "Portugal"],
"goals": [7, 8, 6, 1],
"age": [35, 23, 30, 37]
}
df = pd.DataFrame(data)
print(df)
Output
player country goals age
0 Messi Argentina 7 35
1 Mbappe France 8 23
2 Neymar Brazil 6 30
3 Ronaldo Portugal 1 37
🧠 How Engineers Think About DataFrames
At big tech companies, engineers think:
- Rows → observations (players, users, events)
- Columns → features (age, goals, clicks)
- Index → identity or order
Bad indexing = confusing analysis.
🔍 Column Selection (Most Common Operation)
df["player"]
df["goals"]
This returns a Series.
📍 .loc — Label-Based Indexing (Human Thinking)
Use .loc when:
- You care about meaning
- You think in terms of labels
# All rows, specific columns
df.loc[:, ["player", "goals"]]
# Filter rows by condition
df.loc[df["country"] == "Argentina"]
Example: Find Messi’s record
df.loc[df["player"] == "Messi"]
📐 .iloc — Position-Based Indexing (Machine Thinking)
Use .iloc when:
- You care about position
- You don’t trust index labels
# First row
df.iloc[0]
# First two rows, first two columns
df.iloc[0:2, 0:2]
Think of .iloc like Python slicing.
⚠️ .loc vs .iloc (Interview Favorite)
| Method | Based on | Example |
|---|---|---|
.loc |
labels | df.loc[0] ❌ |
.iloc |
positions | df.iloc[0] ✅ |
Interviewers love asking this.
🔎 Boolean Filtering (SQL WHERE Equivalent)
# Players older than 30
df[df["age"] > 30]
Real-world usage:
- Filter active users
- Find failed experiments
- Analyze edge cases
🧮 .query() — Clean & Readable Filtering
df.query("goals >= 7 and age < 30")
Why engineers love .query():
- Looks like SQL
- Cleaner than long boolean chains
- Easier to debug
🧩 Combining Conditions
df[(df["goals"] > 5) & (df["country"] != "Brazil")]
⚠️ Must use & instead of and
🧠 Index Design (Very Important)
You can set a meaningful index:
df = df.set_index("player")
Now:
df.loc["Messi"]
This is:
- Cleaner
- Faster lookup
- Less error-prone
Why index design matters in production?
Good index = clear logic + faster access + fewer bugs
🧪 Real-World Use Case (Meta / OpenAI Style)
Imagine this DataFrame:
- Rows = model experiments
- Columns = loss, accuracy, timestamp
Engineers use Pandas to:
- Filter failed runs
- Compare metrics
- Debug data leakage
- Sanity-check training data
Before scaling → Pandas first.
📦 Common Pandas Patterns You MUST Know
df.head()
df.tail()
df.shape
df.columns
df.info()
df.describe()
These are used daily by professionals.
🧠 Interview Insight
Interviewers expect you to:
- Know
.locvs.iloc - Filter rows correctly
- Avoid chained indexing
- Write readable Pandas code
Clean Pandas code = clean thinking.
🏁 Summary
- Series = labeled vector
- DataFrame = table
.loc= label-based.iloc= position-based.query()= readable filtering- Indexing = design decision
Master these, and you already think like a data engineer.
🚀 Next Chapter
👉 DE101-PD03 — GroupBy, Aggregation, and Analytics Thinking
This is where Pandas becomes powerful.