DE101-PD02 — Pandas Series, DataFrame, and Indexing (Real-World Mindset)

~45–60 minutes

🎯 Why Pandas Matters in the Real World

At companies like Google, Meta, OpenAI, Pandas is often used for:

  • Data exploration (EDA)
  • Debugging datasets before training models
  • Analyzing experiments / A-B tests
  • Cleaning logs and user behavior data
  • Prototyping ideas before moving to Spark / BigQuery

Pandas is not about speed.
It is about thinking clearly with data.


🧱 Core Pandas Data Structures

Pandas has two fundamental objects:

  1. Series → 1D labeled array
  2. DataFrame → 2D table (rows + columns)

📌 Pandas Series

Think of a Series as:

  • A column in a table
  • A vector with labels
import pandas as pd

goals = pd.Series([91, 85, 70], index=["Messi", "Ronaldo", "Mbappe"])
print(goals)

Output

Messi      91
Ronaldo   85
Mbappe    70
dtype: int64

Key Ideas

  • Series = data + index
  • Index gives meaning, not just position
  • Used heavily in statistics and ML features

📊 Pandas DataFrame

A DataFrame is like an Excel table or SQL table.

Let’s simulate a FIFA World Cup dataset:

data = {
    "player": ["Messi", "Mbappe", "Neymar", "Ronaldo"],
    "country": ["Argentina", "France", "Brazil", "Portugal"],
    "goals": [7, 8, 6, 1],
    "age": [35, 23, 30, 37]
}

df = pd.DataFrame(data)
print(df)

Output

     player     country  goals  age
0     Messi   Argentina      7   35
1    Mbappe      France      8   23
2    Neymar      Brazil      6   30
3   Ronaldo    Portugal      1   37

🧠 How Engineers Think About DataFrames

At big tech companies, engineers think:

  • Rows → observations (players, users, events)
  • Columns → features (age, goals, clicks)
  • Index → identity or order

Bad indexing = confusing analysis.


🔍 Column Selection (Most Common Operation)

df["player"]
df["goals"]

This returns a Series.


📍 .loc — Label-Based Indexing (Human Thinking)

Use .loc when:

  • You care about meaning
  • You think in terms of labels
# All rows, specific columns
df.loc[:, ["player", "goals"]]

# Filter rows by condition
df.loc[df["country"] == "Argentina"]

Example: Find Messi’s record

df.loc[df["player"] == "Messi"]

📐 .iloc — Position-Based Indexing (Machine Thinking)

Use .iloc when:

  • You care about position
  • You don’t trust index labels
# First row
df.iloc[0]

# First two rows, first two columns
df.iloc[0:2, 0:2]

Think of .iloc like Python slicing.


⚠️ .loc vs .iloc (Interview Favorite)

Method Based on Example
.loc labels df.loc[0] ❌
.iloc positions df.iloc[0] ✅

Interviewers love asking this.


🔎 Boolean Filtering (SQL WHERE Equivalent)

# Players older than 30
df[df["age"] > 30]

Real-world usage:

  • Filter active users
  • Find failed experiments
  • Analyze edge cases

🧮 .query() — Clean & Readable Filtering

df.query("goals >= 7 and age < 30")

Why engineers love .query():

  • Looks like SQL
  • Cleaner than long boolean chains
  • Easier to debug

🧩 Combining Conditions

df[(df["goals"] > 5) & (df["country"] != "Brazil")]

⚠️ Must use & instead of and


🧠 Index Design (Very Important)

You can set a meaningful index:

df = df.set_index("player")

Now:

df.loc["Messi"]

This is:

  • Cleaner
  • Faster lookup
  • Less error-prone
Why index design matters in production?

Good index = clear logic + faster access + fewer bugs


🧪 Real-World Use Case (Meta / OpenAI Style)

Imagine this DataFrame:

  • Rows = model experiments
  • Columns = loss, accuracy, timestamp

Engineers use Pandas to:

  • Filter failed runs
  • Compare metrics
  • Debug data leakage
  • Sanity-check training data

Before scaling → Pandas first.


📦 Common Pandas Patterns You MUST Know

df.head()
df.tail()
df.shape
df.columns
df.info()
df.describe()

These are used daily by professionals.


🧠 Interview Insight

Interviewers expect you to:

  • Know .loc vs .iloc
  • Filter rows correctly
  • Avoid chained indexing
  • Write readable Pandas code

Clean Pandas code = clean thinking.


🏁 Summary

  • Series = labeled vector
  • DataFrame = table
  • .loc = label-based
  • .iloc = position-based
  • .query() = readable filtering
  • Indexing = design decision

Master these, and you already think like a data engineer.


🚀 Next Chapter

👉 DE101-PD03 — GroupBy, Aggregation, and Analytics Thinking

This is where Pandas becomes powerful.

Previous
Next