DE101-PD02 — Pandas Series, DataFrame, and Indexing (Real-World Mindset)

~45–60 minutes

🎯 Why Pandas Matters in the Real World

At companies like Google, Meta, OpenAI, Pandas is often used for:

Data exploration (EDA)
Debugging datasets before training models
Analyzing experiments / A-B tests
Cleaning logs and user behavior data
Prototyping ideas before moving to Spark / BigQuery

Pandas is not about speed.
It is about thinking clearly with data.

🧱 Core Pandas Data Structures

Pandas has two fundamental objects:

Series → 1D labeled array
DataFrame → 2D table (rows + columns)

📌 Pandas Series

Think of a Series as:

A column in a table
A vector with labels

import pandas as pd

goals = pd.Series([91, 85, 70], index=["Messi", "Ronaldo", "Mbappe"])
print(goals)

Output

Messi      91
Ronaldo   85
Mbappe    70
dtype: int64

Key Ideas

Series = data + index
Index gives meaning, not just position
Used heavily in statistics and ML features

📊 Pandas DataFrame

A DataFrame is like an Excel table or SQL table.

Let’s simulate a FIFA World Cup dataset:

data = {
    "player": ["Messi", "Mbappe", "Neymar", "Ronaldo"],
    "country": ["Argentina", "France", "Brazil", "Portugal"],
    "goals": [7, 8, 6, 1],
    "age": [35, 23, 30, 37]
}

df = pd.DataFrame(data)
print(df)

Output

     player     country  goals  age
0     Messi   Argentina      7   35
1    Mbappe      France      8   23
2    Neymar      Brazil      6   30
3   Ronaldo    Portugal      1   37

🧠 How Engineers Think About DataFrames

At big tech companies, engineers think:

Rows → observations (players, users, events)
Columns → features (age, goals, clicks)
Index → identity or order

Bad indexing = confusing analysis.

🔍 Column Selection (Most Common Operation)

df["player"]
df["goals"]

This returns a Series.

📍 `.loc` — Label-Based Indexing (Human Thinking)

Use .loc when:

You care about meaning
You think in terms of labels

# All rows, specific columns
df.loc[:, ["player", "goals"]]

# Filter rows by condition
df.loc[df["country"] == "Argentina"]

Example: Find Messi’s record

df.loc[df["player"] == "Messi"]

📐 `.iloc` — Position-Based Indexing (Machine Thinking)

Use .iloc when:

You care about position
You don’t trust index labels

# First row
df.iloc[0]

# First two rows, first two columns
df.iloc[0:2, 0:2]

Think of .iloc like Python slicing.

⚠️ `.loc` vs `.iloc` (Interview Favorite)

Method	Based on	Example
`.loc`	labels	df.loc[0] ❌
`.iloc`	positions	df.iloc[0] ✅

Interviewers love asking this.

🔎 Boolean Filtering (SQL WHERE Equivalent)

# Players older than 30
df[df["age"] > 30]

Real-world usage:

Filter active users
Find failed experiments
Analyze edge cases

🧮 `.query()` — Clean & Readable Filtering

df.query("goals >= 7 and age < 30")

Why engineers love .query():

Looks like SQL
Cleaner than long boolean chains
Easier to debug

🧩 Combining Conditions

df[(df["goals"] > 5) & (df["country"] != "Brazil")]

⚠️ Must use & instead of and

🧠 Index Design (Very Important)

You can set a meaningful index:

df = df.set_index("player")

Now:

df.loc["Messi"]

This is:

Cleaner
Faster lookup
Less error-prone

Why index design matters in production?

Good index = clear logic + faster access + fewer bugs

🧪 Real-World Use Case (Meta / OpenAI Style)

Imagine this DataFrame:

Rows = model experiments
Columns = loss, accuracy, timestamp

Engineers use Pandas to:

Filter failed runs
Compare metrics
Debug data leakage
Sanity-check training data

Before scaling → Pandas first.

📦 Common Pandas Patterns You MUST Know

df.head()
df.tail()
df.shape
df.columns
df.info()
df.describe()

These are used daily by professionals.

🧠 Interview Insight

Interviewers expect you to:

Know .loc vs .iloc
Filter rows correctly
Avoid chained indexing
Write readable Pandas code

Clean Pandas code = clean thinking.

🏁 Summary

Series = labeled vector
DataFrame = table
.loc = label-based
.iloc = position-based
.query() = readable filtering
Indexing = design decision

Master these, and you already think like a data engineer.

🚀 Next Chapter

👉 DE101-PD03 — GroupBy, Aggregation, and Analytics Thinking

This is where Pandas becomes powerful.

Last updated on 2025

DE101-PD02 — Pandas Series, DataFrame, and Indexing (Real-World Mindset)

🎯 Why Pandas Matters in the Real World

🧱 Core Pandas Data Structures

📌 Pandas Series

Output

Key Ideas

📊 Pandas DataFrame

Output

🧠 How Engineers Think About DataFrames

🔍 Column Selection (Most Common Operation)

📍 .loc — Label-Based Indexing (Human Thinking)

Example: Find Messi’s record

📐 .iloc — Position-Based Indexing (Machine Thinking)

⚠️ .loc vs .iloc (Interview Favorite)

🔎 Boolean Filtering (SQL WHERE Equivalent)

🧮 .query() — Clean & Readable Filtering

🧩 Combining Conditions

🧠 Index Design (Very Important)

🧪 Real-World Use Case (Meta / OpenAI Style)

📦 Common Pandas Patterns You MUST Know

🧠 Interview Insight

🏁 Summary

🚀 Next Chapter

📍 `.loc` — Label-Based Indexing (Human Thinking)

📐 `.iloc` — Position-Based Indexing (Machine Thinking)

⚠️ `.loc` vs `.iloc` (Interview Favorite)

🧮 `.query()` — Clean & Readable Filtering