DK-003 — Probability
🎯 Why This Note Exists
Probability is not about gambling.
It is about reasoning under uncertainty.
Every time you:
- design an AI model
- debug noisy data
- read metrics (accuracy, precision, recall)
- make decisions with incomplete information
You are doing probability thinking.
This note is a full recap of probability & statistics you must know, from zero intuition → conditional probability → Naive Bayes.
No heavy math. Only ideas you can remember forever.
🎲 What Is Probability? (Human Version)
Probability = how likely something is to happen
It is always a number between:
$$ 0 \le P(\text{event}) \le 1 $$
0→ impossible1→ guaranteed
⚽ Messi Example — Simple Probability
Imagine:
- Messi takes 10 penalties
- He scores 8 goals
Probability Messi scores:
$$ P(\text{goal}) = \frac{8}{10} = 0.8 $$
Human meaning:
If Messi takes a penalty,
80% chance it goes in.
🧙 Harry Potter Example — House Sorting
Suppose Hogwarts has 100 students:
| House | Students |
|---|---|
| Gryffindor | 40 |
| Slytherin | 25 |
| Ravenclaw | 20 |
| Hufflepuff | 15 |
Probability a random student is Gryffindor:
$$ P(\text{Gryffindor}) = \frac{40}{100} = 0.4 $$
📦 Sample Space & Events
Sample Space (Ω)
All possible outcomes.
Example: rolling a die
$$ \Omega = {1,2,3,4,5,6} $$
Event
A subset of outcomes.
Example:
- Event A = “even number” =
{2,4,6}
➕ Addition Rule (OR)
Probability that A or B happens:
$$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$
🎴 Card Example
- A = draw a Heart
- B = draw a Face card
We subtract overlap to avoid double counting.
✖️ Multiplication Rule (AND)
For independent events:
$$ P(A \cap B) = P(A) \times P(B) $$
⚽ Messi Example
- Probability Messi scores = 0.8
- Probability goalkeeper guesses wrong = 0.7
Both happen:
$$ 0.8 \times 0.7 = 0.56 $$
🔗 Conditional Probability — The CORE IDEA
📐 Formula
$$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} $$
Read as:
Probability of A,
given that B already happened
🧠 Intuition (Very Important)
Conditional probability changes the universe.
You are no longer asking:
“What is the chance overall?”
You are asking:
“What is the chance inside a filtered world?”
🧙 Harry Potter Example — Conditional Probability
Suppose:
- 40% of students are Gryffindor
- Among Gryffindor, 70% are brave
Question:
What is the probability a student is brave given Gryffindor?
$$ P(\text{Brave} \mid \text{Gryffindor}) = 0.7 $$
The universe is now only Gryffindor students.
🔄 Bayes’ Theorem (The Famous One)
📐 Formula
$$ P(A \mid B) = \frac{P(B \mid A),P(A)}{P(B)} $$
This is just conditional probability rearranged.
🧠 Why Bayes Is Powerful
Bayes lets you:
Reverse the direction of thinking
From:
- “If I know the cause, what happens?”
To:
- “If I see the result, what is the cause?”
This is the heart of:
- diagnosis
- spam detection
- machine learning
- Naive Bayes
⚽ Messi Injury Example — Bayes Thinking
Suppose:
- 1% of players are injured
- If injured, Messi plays badly 90% of the time
- If not injured, Messi plays badly 5% of the time
You see Messi plays badly.
Question:
What is the probability he is injured?
This is Bayes.
📊 Statistics You Must Know (Light & Practical)
Mean (Average)
$$ \mu = \frac{1}{n} \sum x $$
Messi goals per match → overall performance.
Variance & Standard Deviation
How spread out the data is.
- Low variance → consistent
- High variance → unpredictable
Harry Potter exams:
- Hermione → low variance
- Seamus → 🔥💥 high variance
🤖 From Probability to Machine Learning
Most ML models answer:
“Given features X, what is the probability of class Y?”
This is exactly:
$$ P(Y \mid X) $$
🧠 Naive Bayes — Final Boss (But Simple)
Core Assumption
Features are conditionally independent
That’s why it’s called naive.
📐 Formula
$$ P(C \mid x_1,x_2,\dots,x_n) \propto P(C)\prod_{i=1}^{n} P(x_i \mid C) $$
Meaning:
- Start with prior belief
P(C) - Multiply likelihoods of features
- Choose the class with highest score
📧 Spam Email Example (Classic)
Features:
- contains “free”
- contains “win”
- contains “urgent”
Class:
- Spam / Not Spam
Naive Bayes asks:
If an email has these words,
which class is more probable?
🧙 Harry Potter Sorting Hat — Naive Bayes Style
Features:
- brave = yes
- ambitious = no
- loves books = yes
Compute probability for each house:
- Gryffindor
- Ravenclaw
- Slytherin
- Hufflepuff
Pick max probability.
🎩✨ That’s Naive Bayes.
🧠 Final Mental Model (Remember This)
Probability is about:
- Counting possibilities
- Filtering worlds
- Updating belief with evidence
If you understand:
- P(A)
- P(A ∣ B)
- Bayes’ rule
You already think like:
- a data scientist
- an AI engineer
- a rational decision-maker
🔢 From Scores to Probabilities (The AI Bridge)
In real AI systems, models do not output probabilities directly.
They output scores (also called logits).
Example:
- Messi form score =
2.3 - Harry bravery score =
-1.2
These scores:
- can be negative
- can be larger than 1
- are not probabilities
So we need a function that converts:
any real number → valid probability
That’s where Sigmoid and Softmax come in.
🔁 Sigmoid Function — Probability for Binary Decisions
🎯 When to Use Sigmoid
Use Sigmoid when:
- only 2 outcomes
- yes / no
- spam / not spam
- injured / not injured
This is the heart of Logistic Regression.
📐 Sigmoid Formula
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
Properties:
- Output is always between 0 and 1
- Large positive
x→ probability close to 1 - Large negative
x→ probability close to 0
🧠 Intuition (Human Version)
Sigmoid answers:
“Given this score, how confident should I be?”
It squashes any number into a probability.
⚽ Messi Example — Sigmoid Intuition
Suppose an AI model computes:
Messi injury score = 2.0
This means:
- evidence supports injury
Let’s compute probability.
Step-by-step calculation
$$ \sigma(2) = \frac{1}{1 + e^{-2}} $$
We know:
- $$ e^{-2} \approx 0.135 $$
So:
$$ \sigma(2) \approx \frac{1}{1 + 0.135} \approx 0.88 $$
🎯 Interpretation:
88% probability Messi is injured
🧑💻 Pure Python — Sigmoid (NO Libraries)
def exp(x):
# simple exponential approximation
e = 2.718281828
return e ** x
def sigmoid(x):
return 1 / (1 + exp(-x))
print(sigmoid(2)) # ~0.88
print(sigmoid(0)) # 0.5
print(sigmoid(-2)) # ~0.12
📌 Key Sigmoid Landmarks (Very Important)
| x | sigmoid(x) | Meaning |
|---|---|---|
| -∞ | 0 | impossible |
| -2 | ~0.12 | unlikely |
| 0 | 0.5 | unsure |
| +2 | ~0.88 | likely |
| +∞ | 1 | guaranteed |
This explains decision boundaries in AI.
🧠 Logistic Regression = Probability Model
Logistic Regression does:
- Compute a score $$ z = w_1x_1 + w_2x_2 + b $$
- Convert score → probability using Sigmoid
- Decide using a threshold (e.g. 0.5)
So it is pure probability thinking, not magic.
🎯 Why Sigmoid Works for Probability
Because it satisfies:
- outputs in
[0,1] - smooth & differentiable
- interpretable as confidence
That’s why:
- Logistic Regression
- Binary classifiers
- Neural networks (binary output)
all use Sigmoid.
🌈 Softmax — Probability for Multiple Classes
🎯 When to Use Softmax
Use Softmax when:
- more than 2 classes
- image classification
- language models
- Hogwarts house sorting 🧙
📐 Softmax Formula
For scores $$ z_1, z_2, \dots, z_n $$
$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} $$
🧠 Intuition (Human Version)
Softmax:
-
turns scores into probabilities
-
ensures:
- all probabilities ≥ 0
- sum = 1
It answers:
“Which class is most likely, relative to the others?”
🧙 Harry Potter Example — Sorting Hat Scores
Suppose the Sorting Hat gives scores:
| House | Score |
|---|---|
| Gryffindor | 2.0 |
| Ravenclaw | 1.0 |
| Slytherin | 0.5 |
| Hufflepuff | 0.0 |
These are not probabilities yet.
Step-by-step Softmax Calculation
Compute exponentials:
- $$ e^{2.0} ≈ 7.39 $$
- $$ e^{1.0} ≈ 2.72 $$
- $$ e^{0.5} ≈ 1.65 $$
- $$ e^{0.0} = 1.0 $$
Sum:
$$ 7.39 + 2.72 + 1.65 + 1.0 = 12.76 $$
Final probabilities:
| House | Probability |
|---|---|
| Gryffindor | 7.39 / 12.76 ≈ 0.58 |
| Ravenclaw | 2.72 / 12.76 ≈ 0.21 |
| Slytherin | 1.65 / 12.76 ≈ 0.13 |
| Hufflepuff | 1.0 / 12.76 ≈ 0.08 |
🎩 Result: Gryffindor wins
🧑💻 Pure Python — Softmax (NO Libraries)
def exp(x):
e = 2.718281828
return e ** x
def softmax(scores):
exp_scores = [exp(s) for s in scores]
total = sum(exp_scores)
return [s / total for s in exp_scores]
scores = [2.0, 1.0, 0.5, 0.0]
probs = softmax(scores)
for house, p in zip(
["Gryffindor", "Ravenclaw", "Slytherin", "Hufflepuff"],
probs
):
print(house, round(p, 3))
🔑 Sigmoid vs Softmax (Must Remember)
| Aspect | Sigmoid | Softmax |
|---|---|---|
| Output classes | 2 | ≥ 2 |
| Output sum | not required | always = 1 |
| Typical use | binary classification | multiclass classification |
| Example | spam / not spam | digit 0–9 |
🧠 Final Unifying Picture (Very Important)
Everything connects:
Raw score (logit)
↓
Sigmoid / Softmax
↓
Probability
↓
Decision
This means:
- Logistic Regression
- Naive Bayes
- Neural Networks
- Deep Learning
👉 are probabilistic models at heart
🏁 Final Thought for Students
AI does not predict labels. AI predicts probabilities.
Once students understand:
- probability
- Bayes
- Sigmoid
- Softmax
They are AI-ready, not just ML-ready.
🏁 Closing Thought
Probability is not math to memorize.
It is logic for uncertain worlds.
Once this clicks,
AI stops being magic
and becomes engineering.
📉 Why AI Needs a “Loss Function”
A model outputs a probability. But training needs a number to minimize.
Loss answers one question:
“How wrong is this probability?”
If prediction is perfect → loss = 0 If prediction is confident but wrong → loss = huge
🔥 Cross-Entropy Loss — Probability Punishment
🎯 What Cross-Entropy Measures
Cross-Entropy measures:
Distance between true probability and predicted probability
It strongly punishes:
- confident wrong predictions
- weak confidence in correct answers
This is why it dominates modern AI.
🎯 Binary Cross-Entropy (Most Important First)
Used with Sigmoid / Logistic Regression
📐 Formula
For one data point:
$$ L(y, \hat{y}) = - \big[ y \log(\hat{y}) + (1-y)\log(1-\hat{y}) \big] $$
Where:
y= true label (0 or 1)ŷ= predicted probability
🧠 Intuition (Human Version)
| Situation | Loss |
|---|---|
| Correct & confident | very small |
| Correct but unsure | medium |
| Wrong & confident | huge |
AI learns by avoiding embarrassment 😅
⚽ Messi Injury Example — Cross-Entropy
True label:
y = 1 (Messi is injured)
Case 1 — Good prediction
ŷ = 0.9
Loss:
$$ L = -\log(0.9) ≈ 0.105 $$
✅ small punishment
Case 2 — Terrible prediction
ŷ = 0.01
Loss:
$$ L = -\log(0.01) ≈ 4.6 $$
🔥 massive punishment
📉 Why LOG is Used (Very Important)
Students always ask:
“Why not just (y − ŷ)² ?”
Here is the real reason.
📉 Reason 1: Log Turns Multiplication into Addition
Probabilities multiply:
$$ P = P_1 \times P_2 \times P_3 $$
Taking log:
$$ \log P = \log P_1 + \log P_2 + \log P_3 $$
✅ numerically stable ✅ easier to optimize ✅ no underflow
📉 Reason 2: Log Explodes Confident Mistakes
If:
ŷ → 0 but y = 1
Then:
log(ŷ) → -∞
🔥 Loss → ∞
This forces the model to learn fast.
📉 Reason 3: Log-Likelihood = Probability Maximization
Training AI is actually:
Maximize probability of observed data
But optimizers minimize, so we use:
$$ \text{Loss} = - \log(\text{Likelihood}) $$
This gives:
- Cross-Entropy
- Log-Loss
- Negative Log-Likelihood (NLL)
👉 all the same family.
🧠 Cross-Entropy = Negative Log-Likelihood
This identity is core AI knowledge:
Training a classifier = maximizing likelihood
Loss just flips the sign.
🤖 Logistic Regression — From Scratch (Pure Python)
Now everything connects.
🧠 Logistic Regression Model
Step 1 — Linear Score
$$ z = wx + b $$
Step 2 — Sigmoid
$$ \hat{y} = \sigma(z) $$
Step 3 — Cross-Entropy Loss
$$ L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})] $$
🧑💻 Pure Python Implementation (NO LIBS)
🔢 Math Helpers
def exp(x):
e = 2.718281828
return e ** x
def log(x):
# natural log approximation
n = 1000
return n * ((x ** (1/n)) - 1)
def sigmoid(x):
return 1 / (1 + exp(-x))
📉 Loss Function
def binary_cross_entropy(y, y_hat):
eps = 1e-9 # avoid log(0)
return - (y * log(y_hat + eps) +
(1 - y) * log(1 - y_hat + eps))
🔁 Training Loop (1 Feature)
# training data
X = [1, 2, 3, 4] # feature (e.g. injury indicators)
Y = [0, 0, 1, 1] # labels
# parameters
w = 0.0
b = 0.0
lr = 0.1
for epoch in range(1000):
dw = 0
db = 0
loss = 0
for x, y in zip(X, Y):
z = w * x + b
y_hat = sigmoid(z)
loss += binary_cross_entropy(y, y_hat)
# gradients
dw += (y_hat - y) * x
db += (y_hat - y)
# update
w -= lr * dw / len(X)
b -= lr * db / len(X)
if epoch % 200 == 0:
print("epoch", epoch, "loss", round(loss, 4))
🧠 What Students Should Realize
This model:
- uses probability
- uses log
- uses cross-entropy
- uses gradient descent
👉 This is real AI, not toy math.
🔗 Everything Connects (Final Mental Map)
Linear score
↓
Sigmoid
↓
Probability
↓
Log
↓
Cross-Entropy
↓
Gradient Descent
↓
Learning
🏁 Final Truth (Put This on a Slide)
AI is not guessing labels. AI is optimizing probabilities using logs.
Once students understand this, they can:
- read ML papers
- debug models
- move to deep learning smoothly
🧠 Backpropagation — Derived by Hand (No Magic)
🎯 What Backpropagation Really Is
Backpropagation is not magic.
It is simply:
The chain rule applied repeatedly, backwards
🔗 Chain Rule (Foundation)
$$ y = f(g(x)) $$
$$ \frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx} $$
Backpropagation applies this rule from the loss back to the parameters.
🧠 Minimal Neural Network (One Neuron)
Pipeline:
$$ x \rightarrow z \rightarrow \hat{y} \rightarrow L $$
Definitions
Linear transformation:
$$ z = wx + b $$
Sigmoid activation:
$$ \hat{y} = \sigma(z) $$
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
Binary cross-entropy loss:
$$ L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})] $$
🔥 Goal of Backpropagation
We want to compute:
$$ \frac{\partial L}{\partial w} $$
$$ \frac{\partial L}{\partial b} $$
These gradients tell us how to update the model.
✍️ Step-by-Step Gradient Derivation
Step 1 — Loss → Prediction
Step 2 — Prediction → Linear Score
📌 This equation is the heart of modern AI
Step 3 — Linear Score → Parameters
Partial derivatives:
$$ \frac{\partial z}{\partial w} = x $$
$$ \frac{\partial z}{\partial b} = 1 $$
Final gradients:
$$ \frac{\partial L}{\partial w} = (\hat{y} - y)x $$
$$ \frac{\partial L}{\partial b} = (\hat{y} - y) $$
🧠 Why This Is Powerful
-
No complicated calculus during training
-
Same gradient structure works for:
- Logistic Regression
- Neural Networks
- Deep Learning
Backpropagation is reusable probability calculus.
🔥 CNN From Scratch — Mathematical View
🎯 Why Convolution Exists
Fully connected networks:
- ignore spatial structure
- require too many parameters
Convolutions exploit:
Local connectivity and weight sharing
🧠 Definition of Convolution
A convolution is a sliding dot product.
🧊 1D Convolution Example
Input signal:
$$ x = [1, 2, 3, 4, 5] $$
Kernel:
$$ k = [1, 0, -1] $$
✍️ Manual Computation
$$ [1,2,3] \cdot [1,0,-1] = -2 $$
$$ [2,3,4] \cdot [1,0,-1] = -2 $$
$$ [3,4,5] \cdot [1,0,-1] = -2 $$
Output:
$$ [-2, -2, -2] $$
🧠 CNN Processing Pipeline
$$ \text{Image} \rightarrow \text{Convolution} \rightarrow \text{ReLU} \rightarrow \text{Pooling} \rightarrow \text{Fully Connected} \rightarrow \text{Softmax} $$
CNNs still end with Softmax + Cross-Entropy, making them probabilistic classifiers.
🧪 Numerical Stability — Log-Sum-Exp Trick
❌ The Numerical Problem
Softmax definition:
$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$
If logits are large:
$$ z = [1000, 1001, 1002] $$
Then:
$$ e^{1002} \rightarrow \infty $$
💥 overflow breaks training.
✅ Log-Sum-Exp Identity
Let:
$$ m = \max(z) $$
🧠 Why This Works
- Keeps exponentials numerically small
- Preserves exact probabilities
- Used in all major deep-learning frameworks
📉 Stable Cross-Entropy (Direct Form)
Instead of computing:
$$ \text{Softmax} \rightarrow \log \rightarrow \text{Loss} $$
We compute:
$$ L = -z_y + \log \sum_i e^{z_i} $$
(using log-sum-exp)
1️⃣ What does probability measure?
Show answer
How likely an event is to happen, represented by a number between 0 and 1.
2️⃣ If Messi scores 8 goals out of 10 penalties, what is the probability he scores?
A) 0.2 B) 0.5 C) 0.8 D) 1.0
Show answer
C) 0.8
3️⃣ What is the sample space when rolling a standard die?
Show answer
{1, 2, 3, 4, 5, 6}
4️⃣ If Gryffindor has 40 students out of 100, what is
P(Gryffindor)?
Show answer
40 / 100 = 0.4
5️⃣ What does an event mean in probability?
Show answer
A subset of the sample space (a group of outcomes).
6️⃣ What is the probability range of any event?
A) −1 to 1 B) 0 to 10 C) 0 to 1 D) Any real number
Show answer
C) 0 to 1
7️⃣ Which formula represents conditional probability?
A) P(A ∪ B) B) P(A ∩ B) C) P(A | B) = P(A ∩ B) / P(B) D) P(A) + P(B)
Show answer
C) P(A | B) = P(A ∩ B) / P(B)
8️⃣ What does P(A | B) mean in plain English?
Show answer
The probability of A happening, given that B has already happened.
9️⃣ In Hogwarts terms, conditional probability means:
Show answer
You only look at students inside a specific house, not the whole school.
🔟 Which rule is used for independent events?
Show answer
Multiplication rule: P(A ∩ B) = P(A) × P(B)
1️⃣1️⃣ If Messi scores with probability 0.8
and the goalkeeper guesses wrong with probability 0.7, what is the probability both happen?
Show answer
0.8 × 0.7 = 0.56
1️⃣2️⃣ What problem does the addition rule solve?
Show answer
It avoids double-counting overlapping events.
1️⃣3️⃣ What is Bayes’ Theorem mainly used for?
Show answer
Reversing probability: updating beliefs after seeing evidence.
1️⃣4️⃣ Which formula is Bayes’ Theorem?
A) P(A ∩ B) B) P(A | B) = P(B | A)P(A) / P(B) C) P(A) + P(B) D) P(A) − P(B)
Show answer
B) P(A | B) = P(B | A)P(A) / P(B)
1️⃣5️⃣ In Messi injury analysis, Bayes helps answer:
Show answer
Given Messi played badly, how likely is it that he is injured?
1️⃣6️⃣ What does mean (average) measure?
Show answer
The central value of the data.
1️⃣7️⃣ High variance means what?
Show answer
Data is spread out and unpredictable.
1️⃣8️⃣ Who has lower variance in exam scores?
A) Hermione B) Seamus
Show answer
A) Hermione
1️⃣9️⃣ What is the key assumption of Naive Bayes?
Show answer
Features are conditionally independent given the class.
2️⃣0️⃣ Why is Naive Bayes powerful despite being “naive”?
Show answer
Because it is simple, fast, and works surprisingly well in real-world problems like spam detection and text classification.