DE101-PD07 — Pandas with Python

🌍 Why This Document Exists

Data Engineer interviews do not test pandas syntax.

They test:

  • Can you reason about data?
  • Can you aggregate correctly?
  • Do you understand edge cases?
  • Can you explain trade-offs?
  • Do you fall back to logic when tools disappear?

This chapter trains real data thinking.


🏆 Pandas Logic — Data Engineer Edition (20 Problems)


📦 Mock Dataset (Used in Many Problems)

import pandas as pd

data = {
    "user_id": [1,1,2,2,3,3,3],
    "event": ["login","pay","login","logout","login","pay","logout"],
    "amount": [0,100,0,0,0,200,0],
    "day": ["Mon","Mon","Mon","Tue","Tue","Tue","Wed"]
}

df = pd.DataFrame(data)

🟢 Problem 1 — Count Total Rows

Task: How many rows are in the dataset?

✅ Pandas
df.shape[0]
✅ Pure Logic
count = 0
for _ in df.itertuples():
    count += 1
print(count)

Time: O(n)


🟢 Problem 2 — Unique Users

✅ Pandas
df["user_id"].nunique()
✅ Pure Logic
seen = set()
for u in df["user_id"]:
    seen.add(u)
print(len(seen))

🟢 Problem 3 — Filter Login Events

✅ Pandas
df[df["event"] == "login"]
✅ Pure Logic
rows = []
for r in df.to_dict("records"):
    if r["event"] == "login":
        rows.append(r)
print(rows)

🟡 Problem 4 — Total Revenue

Sum of all amount

✅ Pandas
df["amount"].sum()
✅ Pure Logic
total = 0
for a in df["amount"]:
    total += a
print(total)

🟡 Problem 5 — Revenue per User

✅ Pandas
df.groupby("user_id")["amount"].sum()
✅ Pure Logic
rev = {}
for u, a in zip(df["user_id"], df["amount"]):
    rev[u] = rev.get(u, 0) + a
print(rev)

🟡 Problem 6 — Count Events per Day

✅ Pandas
df.groupby("day").size()
✅ Pure Logic
count = {}
for d in df["day"]:
    count[d] = count.get(d, 0) + 1
print(count)

🟡 Problem 7 — Users Who Ever Paid

✅ Pandas
df[df["amount"] > 0]["user_id"].unique()
✅ Pure Logic
paid = set()
for u, a in zip(df["user_id"], df["amount"]):
    if a > 0:
        paid.add(u)
print(list(paid))

🟡 Problem 8 — Max Transaction Amount

✅ Pandas
df["amount"].max()
✅ Pure Logic
mx = 0
for a in df["amount"]:
    if a > mx:
        mx = a
print(mx)

🔵 Problem 9 — User with Highest Revenue

✅ Pandas
df.groupby("user_id")["amount"].sum().idxmax()
✅ Pure Logic
rev = {}
for u, a in zip(df["user_id"], df["amount"]):
    rev[u] = rev.get(u, 0) + a

best = max(rev, key=rev.get)
print(best)

🔵 Problem 10 — Days with No Payments

✅ Pandas
df.groupby("day")["amount"].sum()[lambda x: x == 0]
✅ Pure Logic
daily = {}
for d, a in zip(df["day"], df["amount"]):
    daily[d] = daily.get(d, 0) + a

print([d for d,v in daily.items() if v == 0])

🔵 Problem 11 — Add Is_Payment Column

✅ Pandas
df["is_payment"] = df["amount"] > 0
✅ Pure Logic
flags = []
for a in df["amount"]:
    flags.append(a > 0)
print(flags)

🔵 Problem 12 — Session Count per User

(Login = session start)

✅ Pandas
df[df["event"] == "login"].groupby("user_id").size()
✅ Pure Logic
sessions = {}
for u, e in zip(df["user_id"], df["event"]):
    if e == "login":
        sessions[u] = sessions.get(u, 0) + 1
print(sessions)

🔴 Problem 13 — Conversion Rate per User

(pay / login)

✅ Pandas
logins = df[df["event"]=="login"].groupby("user_id").size()
pays = df[df["event"]=="pay"].groupby("user_id").size()
(pays / logins).fillna(0)
✅ Pure Logic
logins, pays = {}, {}

for u, e in zip(df["user_id"], df["event"]):
    if e == "login":
        logins[u] = logins.get(u, 0) + 1
    if e == "pay":
        pays[u] = pays.get(u, 0) + 1

rate = {u: pays.get(u,0)/logins[u] for u in logins}
print(rate)

🔴 Problem 14 — Detect Duplicate Rows

✅ Pandas
df[df.duplicated()]
✅ Pure Logic
seen = set()
dups = []

for r in df.to_dict("records"):
    t = tuple(r.items())
    if t in seen:
        dups.append(r)
    else:
        seen.add(t)

print(dups)

🔴 Problem 15 — Sort by Revenue Desc

✅ Pandas
df.sort_values("amount", ascending=False)
✅ Pure Logic
rows = df.to_dict("records")
rows.sort(key=lambda x: x["amount"], reverse=True)
print(rows)

🔴 Problem 16 — Top Day by Revenue

✅ Pandas
df.groupby("day")["amount"].sum().idxmax()
✅ Pure Logic
daily = {}
for d, a in zip(df["day"], df["amount"]):
    daily[d] = daily.get(d, 0) + a

print(max(daily, key=daily.get))

🔴 Problem 17 — Pivot Table (User × Event)

✅ Pandas
pd.pivot_table(df, index="user_id", columns="event", values="amount", aggfunc="count", fill_value=0)
✅ Pure Logic
pivot = {}

for u, e in zip(df["user_id"], df["event"]):
    pivot.setdefault(u, {})
    pivot[u][e] = pivot[u].get(e, 0) + 1

print(pivot)

🔴 Problem 18 — Find Users with No Logout

✅ Pandas
users = set(df["user_id"])
logout = set(df[df["event"]=="logout"]["user_id"])
users - logout
✅ Pure Logic
users, logout = set(), set()
for u, e in zip(df["user_id"], df["event"]):
    users.add(u)
    if e == "logout":
        logout.add(u)
print(users - logout)

🔴 Problem 19 — Running Total Revenue

✅ Pandas
df["running_total"] = df["amount"].cumsum()
✅ Pure Logic
run = 0
res = []
for a in df["amount"]:
    run += a
    res.append(run)
print(res)

🔴 Problem 20 — Data Engineer Sanity Check

Question (Interview): When does pandas fail?

✅ Expected Answer
  • Very large datasets (memory-bound)
  • Skewed groupby keys
  • Hidden NaNs
  • Timezone bugs
  • When logic is unclear

👉 Strong DEs fall back to logic, not APIs


🎯 Final Advice (Very Important)

If you can:

  • Explain pandas and
  • Replace pandas with logic
  • Talk about trade-offs

Then you are Data Engineer ready.

Tools change. Thinking does not.


Previous