DE101-PD07 — Pandas with Python
🌍 Why This Document Exists
Data Engineer interviews do not test pandas syntax.
They test:
- Can you reason about data?
- Can you aggregate correctly?
- Do you understand edge cases?
- Can you explain trade-offs?
- Do you fall back to logic when tools disappear?
This chapter trains real data thinking.
🏆 Pandas Logic — Data Engineer Edition (20 Problems)
📦 Mock Dataset (Used in Many Problems)
import pandas as pd
data = {
"user_id": [1,1,2,2,3,3,3],
"event": ["login","pay","login","logout","login","pay","logout"],
"amount": [0,100,0,0,0,200,0],
"day": ["Mon","Mon","Mon","Tue","Tue","Tue","Wed"]
}
df = pd.DataFrame(data)
🟢 Problem 1 — Count Total Rows
Task: How many rows are in the dataset?
✅ Pandas
df.shape[0]
✅ Pure Logic
count = 0
for _ in df.itertuples():
count += 1
print(count)
Time: O(n)
🟢 Problem 2 — Unique Users
✅ Pandas
df["user_id"].nunique()
✅ Pure Logic
seen = set()
for u in df["user_id"]:
seen.add(u)
print(len(seen))
🟢 Problem 3 — Filter Login Events
✅ Pandas
df[df["event"] == "login"]
✅ Pure Logic
rows = []
for r in df.to_dict("records"):
if r["event"] == "login":
rows.append(r)
print(rows)
🟡 Problem 4 — Total Revenue
Sum of all amount
✅ Pandas
df["amount"].sum()
✅ Pure Logic
total = 0
for a in df["amount"]:
total += a
print(total)
🟡 Problem 5 — Revenue per User
✅ Pandas
df.groupby("user_id")["amount"].sum()
✅ Pure Logic
rev = {}
for u, a in zip(df["user_id"], df["amount"]):
rev[u] = rev.get(u, 0) + a
print(rev)
🟡 Problem 6 — Count Events per Day
✅ Pandas
df.groupby("day").size()
✅ Pure Logic
count = {}
for d in df["day"]:
count[d] = count.get(d, 0) + 1
print(count)
🟡 Problem 7 — Users Who Ever Paid
✅ Pandas
df[df["amount"] > 0]["user_id"].unique()
✅ Pure Logic
paid = set()
for u, a in zip(df["user_id"], df["amount"]):
if a > 0:
paid.add(u)
print(list(paid))
🟡 Problem 8 — Max Transaction Amount
✅ Pandas
df["amount"].max()
✅ Pure Logic
mx = 0
for a in df["amount"]:
if a > mx:
mx = a
print(mx)
🔵 Problem 9 — User with Highest Revenue
✅ Pandas
df.groupby("user_id")["amount"].sum().idxmax()
✅ Pure Logic
rev = {}
for u, a in zip(df["user_id"], df["amount"]):
rev[u] = rev.get(u, 0) + a
best = max(rev, key=rev.get)
print(best)
🔵 Problem 10 — Days with No Payments
✅ Pandas
df.groupby("day")["amount"].sum()[lambda x: x == 0]
✅ Pure Logic
daily = {}
for d, a in zip(df["day"], df["amount"]):
daily[d] = daily.get(d, 0) + a
print([d for d,v in daily.items() if v == 0])
🔵 Problem 11 — Add Is_Payment Column
✅ Pandas
df["is_payment"] = df["amount"] > 0
✅ Pure Logic
flags = []
for a in df["amount"]:
flags.append(a > 0)
print(flags)
🔵 Problem 12 — Session Count per User
(Login = session start)
✅ Pandas
df[df["event"] == "login"].groupby("user_id").size()
✅ Pure Logic
sessions = {}
for u, e in zip(df["user_id"], df["event"]):
if e == "login":
sessions[u] = sessions.get(u, 0) + 1
print(sessions)
🔴 Problem 13 — Conversion Rate per User
(pay / login)
✅ Pandas
logins = df[df["event"]=="login"].groupby("user_id").size()
pays = df[df["event"]=="pay"].groupby("user_id").size()
(pays / logins).fillna(0)
✅ Pure Logic
logins, pays = {}, {}
for u, e in zip(df["user_id"], df["event"]):
if e == "login":
logins[u] = logins.get(u, 0) + 1
if e == "pay":
pays[u] = pays.get(u, 0) + 1
rate = {u: pays.get(u,0)/logins[u] for u in logins}
print(rate)
🔴 Problem 14 — Detect Duplicate Rows
✅ Pandas
df[df.duplicated()]
✅ Pure Logic
seen = set()
dups = []
for r in df.to_dict("records"):
t = tuple(r.items())
if t in seen:
dups.append(r)
else:
seen.add(t)
print(dups)
🔴 Problem 15 — Sort by Revenue Desc
✅ Pandas
df.sort_values("amount", ascending=False)
✅ Pure Logic
rows = df.to_dict("records")
rows.sort(key=lambda x: x["amount"], reverse=True)
print(rows)
🔴 Problem 16 — Top Day by Revenue
✅ Pandas
df.groupby("day")["amount"].sum().idxmax()
✅ Pure Logic
daily = {}
for d, a in zip(df["day"], df["amount"]):
daily[d] = daily.get(d, 0) + a
print(max(daily, key=daily.get))
🔴 Problem 17 — Pivot Table (User × Event)
✅ Pandas
pd.pivot_table(df, index="user_id", columns="event", values="amount", aggfunc="count", fill_value=0)
✅ Pure Logic
pivot = {}
for u, e in zip(df["user_id"], df["event"]):
pivot.setdefault(u, {})
pivot[u][e] = pivot[u].get(e, 0) + 1
print(pivot)
🔴 Problem 18 — Find Users with No Logout
✅ Pandas
users = set(df["user_id"])
logout = set(df[df["event"]=="logout"]["user_id"])
users - logout
✅ Pure Logic
users, logout = set(), set()
for u, e in zip(df["user_id"], df["event"]):
users.add(u)
if e == "logout":
logout.add(u)
print(users - logout)
🔴 Problem 19 — Running Total Revenue
✅ Pandas
df["running_total"] = df["amount"].cumsum()
✅ Pure Logic
run = 0
res = []
for a in df["amount"]:
run += a
res.append(run)
print(res)
🔴 Problem 20 — Data Engineer Sanity Check
Question (Interview): When does pandas fail?
✅ Expected Answer
- Very large datasets (memory-bound)
- Skewed groupby keys
- Hidden NaNs
- Timezone bugs
- When logic is unclear
👉 Strong DEs fall back to logic, not APIs
🎯 Final Advice (Very Important)
If you can:
- Explain pandas and
- Replace pandas with logic
- Talk about trade-offs
Then you are Data Engineer ready.
Tools change. Thinking does not.