DK-003 — Probability

🎯 Why This Note Exists

Probability is not about gambling.
It is about reasoning under uncertainty.

Every time you:

design an AI model
debug noisy data
read metrics (accuracy, precision, recall)
make decisions with incomplete information

You are doing probability thinking.

This note is a full recap of probability & statistics you must know, from zero intuition → conditional probability → Naive Bayes.

No heavy math. Only ideas you can remember forever.

🎲 What Is Probability? (Human Version)

Probability = how likely something is to happen

It is always a number between:

$$ 0 \le P(\text{event}) \le 1 $$

0 → impossible
1 → guaranteed

⚽ Messi Example — Simple Probability

Imagine:

Messi takes 10 penalties
He scores 8 goals

Probability Messi scores:

$$ P(\text{goal}) = \frac{8}{10} = 0.8 $$

Human meaning:

If Messi takes a penalty,
80% chance it goes in.

🧙 Harry Potter Example — House Sorting

Suppose Hogwarts has 100 students:

House	Students
Gryffindor	40
Slytherin	25
Ravenclaw	20
Hufflepuff	15

Probability a random student is Gryffindor:

$$ P(\text{Gryffindor}) = \frac{40}{100} = 0.4 $$

📦 Sample Space & Events

Sample Space (Ω)

All possible outcomes.

Example: rolling a die

$$ \Omega = {1,2,3,4,5,6} $$

Event

A subset of outcomes.

Example:

Event A = “even number” = {2,4,6}

➕ Addition Rule (OR)

Probability that A or B happens:

$$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$

🎴 Card Example

A = draw a Heart
B = draw a Face card

We subtract overlap to avoid double counting.

✖️ Multiplication Rule (AND)

For independent events:

$$ P(A \cap B) = P(A) \times P(B) $$

⚽ Messi Example

Probability Messi scores = 0.8
Probability goalkeeper guesses wrong = 0.7

Both happen:

$$ 0.8 \times 0.7 = 0.56 $$

🔗 Conditional Probability — The CORE IDEA

📐 Formula

$$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} $$

Read as:

Probability of A,
given that B already happened

🧠 Intuition (Very Important)

Conditional probability changes the universe.

You are no longer asking:

“What is the chance overall?”

You are asking:

“What is the chance inside a filtered world?”

🧙 Harry Potter Example — Conditional Probability

Suppose:

40% of students are Gryffindor
Among Gryffindor, 70% are brave

Question:

What is the probability a student is brave given Gryffindor?

$$ P(\text{Brave} \mid \text{Gryffindor}) = 0.7 $$

The universe is now only Gryffindor students.

🔄 Bayes’ Theorem (The Famous One)

📐 Formula

$$ P(A \mid B) = \frac{P(B \mid A),P(A)}{P(B)} $$

This is just conditional probability rearranged.

🧠 Why Bayes Is Powerful

Bayes lets you:

Reverse the direction of thinking

From:

“If I know the cause, what happens?”

To:

“If I see the result, what is the cause?”

This is the heart of:

diagnosis
spam detection
machine learning
Naive Bayes

⚽ Messi Injury Example — Bayes Thinking

Suppose:

1% of players are injured
If injured, Messi plays badly 90% of the time
If not injured, Messi plays badly 5% of the time

You see Messi plays badly.

Question:

What is the probability he is injured?

This is Bayes.

📊 Statistics You Must Know (Light & Practical)

Mean (Average)

$$ \mu = \frac{1}{n} \sum x $$

Messi goals per match → overall performance.

Variance & Standard Deviation

How spread out the data is.

Low variance → consistent
High variance → unpredictable

Harry Potter exams:

Hermione → low variance
Seamus → 🔥💥 high variance

🤖 From Probability to Machine Learning

Most ML models answer:

“Given features X, what is the probability of class Y?”

This is exactly:

$$ P(Y \mid X) $$

🧠 Naive Bayes — Final Boss (But Simple)

Core Assumption

Features are conditionally independent

That’s why it’s called naive.

📐 Formula

$$ P(C \mid x_1,x_2,\dots,x_n) \propto P(C)\prod_{i=1}^{n} P(x_i \mid C) $$

Meaning:

Start with prior belief P(C)
Multiply likelihoods of features
Choose the class with highest score

📧 Spam Email Example (Classic)

Features:

contains “free”
contains “win”
contains “urgent”

Class:

Spam / Not Spam

Naive Bayes asks:

If an email has these words,
which class is more probable?

🧙 Harry Potter Sorting Hat — Naive Bayes Style

Features:

brave = yes
ambitious = no
loves books = yes

Compute probability for each house:

Gryffindor
Ravenclaw
Slytherin
Hufflepuff

Pick max probability.

🎩✨ That’s Naive Bayes.

🧠 Final Mental Model (Remember This)

Probability is about:

Counting possibilities
Filtering worlds
Updating belief with evidence

If you understand:

P(A)
P(A ∣ B)
Bayes’ rule

You already think like:

a data scientist
an AI engineer
a rational decision-maker

🔢 From Scores to Probabilities (The AI Bridge)

In real AI systems, models do not output probabilities directly.

They output scores (also called logits).

Example:

Messi form score = 2.3
Harry bravery score = -1.2

These scores:

can be negative
can be larger than 1
are not probabilities

So we need a function that converts:

any real number → valid probability

That’s where Sigmoid and Softmax come in.

🔁 Sigmoid Function — Probability for Binary Decisions

🎯 When to Use Sigmoid

Use Sigmoid when:

only 2 outcomes
yes / no
spam / not spam
injured / not injured

This is the heart of Logistic Regression.

📐 Sigmoid Formula

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

Properties:

Output is always between 0 and 1
Large positive x → probability close to 1
Large negative x → probability close to 0

🧠 Intuition (Human Version)

Sigmoid answers:

“Given this score, how confident should I be?”

It squashes any number into a probability.

⚽ Messi Example — Sigmoid Intuition

Suppose an AI model computes:

Messi injury score = 2.0

This means:

evidence supports injury

Let’s compute probability.

Step-by-step calculation

$$ \sigma(2) = \frac{1}{1 + e^{-2}} $$

We know:

$$ e^{-2} \approx 0.135 $$

So:

$$ \sigma(2) \approx \frac{1}{1 + 0.135} \approx 0.88 $$

🎯 Interpretation:

88% probability Messi is injured

🧑‍💻 Pure Python — Sigmoid (NO Libraries)

def exp(x):
    # simple exponential approximation
    e = 2.718281828
    return e ** x

def sigmoid(x):
    return 1 / (1 + exp(-x))

print(sigmoid(2))     # ~0.88
print(sigmoid(0))     # 0.5
print(sigmoid(-2))    # ~0.12

📌 Key Sigmoid Landmarks (Very Important)

x	sigmoid(x)	Meaning
-∞	0	impossible
-2	~0.12	unlikely
0	0.5	unsure
+2	~0.88	likely
+∞	1	guaranteed

This explains decision boundaries in AI.

🧠 Logistic Regression = Probability Model

Logistic Regression does:

Compute a score $$ z = w_1x_1 + w_2x_2 + b $$
Convert score → probability using Sigmoid
Decide using a threshold (e.g. 0.5)

So it is pure probability thinking, not magic.

🎯 Why Sigmoid Works for Probability

Because it satisfies:

outputs in [0,1]
smooth & differentiable
interpretable as confidence

That’s why:

Logistic Regression
Binary classifiers
Neural networks (binary output)

all use Sigmoid.

🌈 Softmax — Probability for Multiple Classes

🎯 When to Use Softmax

Use Softmax when:

more than 2 classes
image classification
language models
Hogwarts house sorting 🧙

📐 Softmax Formula

For scores $$ z_1, z_2, \dots, z_n $$

$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} $$

🧠 Intuition (Human Version)

Softmax:

turns scores into probabilities
ensures:
- all probabilities ≥ 0
- sum = 1

It answers:

“Which class is most likely, relative to the others?”

🧙 Harry Potter Example — Sorting Hat Scores

Suppose the Sorting Hat gives scores:

House	Score
Gryffindor	2.0
Ravenclaw	1.0
Slytherin	0.5
Hufflepuff	0.0

These are not probabilities yet.

Step-by-step Softmax Calculation

Compute exponentials:

$$ e^{2.0} ≈ 7.39 $$
$$ e^{1.0} ≈ 2.72 $$
$$ e^{0.5} ≈ 1.65 $$
$$ e^{0.0} = 1.0 $$

Sum:

$$ 7.39 + 2.72 + 1.65 + 1.0 = 12.76 $$

Final probabilities:

House	Probability
Gryffindor	7.39 / 12.76 ≈ 0.58
Ravenclaw	2.72 / 12.76 ≈ 0.21
Slytherin	1.65 / 12.76 ≈ 0.13
Hufflepuff	1.0 / 12.76 ≈ 0.08

🎩 Result: Gryffindor wins

🧑‍💻 Pure Python — Softmax (NO Libraries)

def exp(x):
    e = 2.718281828
    return e ** x

def softmax(scores):
    exp_scores = [exp(s) for s in scores]
    total = sum(exp_scores)
    return [s / total for s in exp_scores]

scores = [2.0, 1.0, 0.5, 0.0]
probs = softmax(scores)

for house, p in zip(
    ["Gryffindor", "Ravenclaw", "Slytherin", "Hufflepuff"],
    probs
):
    print(house, round(p, 3))

🔑 Sigmoid vs Softmax (Must Remember)

Aspect	Sigmoid	Softmax
Output classes	2	≥ 2
Output sum	not required	always = 1
Typical use	binary classification	multiclass classification
Example	spam / not spam	digit 0–9

🧠 Final Unifying Picture (Very Important)

Everything connects:

Raw score (logit)
   ↓
Sigmoid / Softmax
   ↓
Probability
   ↓
Decision

This means:

Logistic Regression
Naive Bayes
Neural Networks
Deep Learning

👉 are probabilistic models at heart

🏁 Final Thought for Students

AI does not predict labels. AI predicts probabilities.

Once students understand:

probability
Bayes
Sigmoid
Softmax

They are AI-ready, not just ML-ready.

🏁 Closing Thought

Probability is not math to memorize.
It is logic for uncertain worlds.

Once this clicks, AI stops being magic
and becomes engineering.

📉 Why AI Needs a “Loss Function”

A model outputs a probability. But training needs a number to minimize.

Loss answers one question:

“How wrong is this probability?”

If prediction is perfect → loss = 0 If prediction is confident but wrong → loss = huge

🔥 Cross-Entropy Loss — Probability Punishment

🎯 What Cross-Entropy Measures

Cross-Entropy measures:

Distance between true probability and predicted probability

It strongly punishes:

confident wrong predictions
weak confidence in correct answers

This is why it dominates modern AI.

🎯 Binary Cross-Entropy (Most Important First)

Used with Sigmoid / Logistic Regression

📐 Formula

For one data point:

$$ L(y, \hat{y}) = - \big[ y \log(\hat{y}) + (1-y)\log(1-\hat{y}) \big] $$

Where:

y = true label (0 or 1)
ŷ = predicted probability

🧠 Intuition (Human Version)

Situation	Loss
Correct & confident	very small
Correct but unsure	medium
Wrong & confident	huge

AI learns by avoiding embarrassment 😅

⚽ Messi Injury Example — Cross-Entropy

True label:

y = 1   (Messi is injured)

Case 1 — Good prediction

ŷ = 0.9

Loss:

$$ L = -\log(0.9) ≈ 0.105 $$

✅ small punishment

Case 2 — Terrible prediction

ŷ = 0.01

Loss:

$$ L = -\log(0.01) ≈ 4.6 $$

🔥 massive punishment

📉 Why LOG is Used (Very Important)

Students always ask:

“Why not just (y − ŷ)² ?”

Here is the real reason.

📉 Reason 1: Log Turns Multiplication into Addition

Probabilities multiply:

$$ P = P_1 \times P_2 \times P_3 $$

Taking log:

$$ \log P = \log P_1 + \log P_2 + \log P_3 $$

✅ numerically stable ✅ easier to optimize ✅ no underflow

📉 Reason 2: Log Explodes Confident Mistakes

If:

ŷ → 0 but y = 1

Then:

log(ŷ) → -∞

🔥 Loss → ∞

This forces the model to learn fast.

📉 Reason 3: Log-Likelihood = Probability Maximization

Training AI is actually:

Maximize probability of observed data

But optimizers minimize, so we use:

$$ \text{Loss} = - \log(\text{Likelihood}) $$

This gives:

Cross-Entropy
Log-Loss
Negative Log-Likelihood (NLL)

👉 all the same family.

🧠 Cross-Entropy = Negative Log-Likelihood

This identity is core AI knowledge:

Training a classifier = maximizing likelihood

Loss just flips the sign.

🤖 Logistic Regression — From Scratch (Pure Python)

Now everything connects.

🧠 Logistic Regression Model

Step 1 — Linear Score

$$ z = wx + b $$

Step 2 — Sigmoid

$$ \hat{y} = \sigma(z) $$

Step 3 — Cross-Entropy Loss

$$ L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})] $$

🧑‍💻 Pure Python Implementation (NO LIBS)

🔢 Math Helpers

def exp(x):
    e = 2.718281828
    return e ** x

def log(x):
    # natural log approximation
    n = 1000
    return n * ((x ** (1/n)) - 1)

def sigmoid(x):
    return 1 / (1 + exp(-x))

📉 Loss Function

def binary_cross_entropy(y, y_hat):
    eps = 1e-9  # avoid log(0)
    return - (y * log(y_hat + eps) +
              (1 - y) * log(1 - y_hat + eps))

🔁 Training Loop (1 Feature)

# training data
X = [1, 2, 3, 4]      # feature (e.g. injury indicators)
Y = [0, 0, 1, 1]      # labels

# parameters
w = 0.0
b = 0.0
lr = 0.1

for epoch in range(1000):
    dw = 0
    db = 0
    loss = 0

    for x, y in zip(X, Y):
        z = w * x + b
        y_hat = sigmoid(z)

        loss += binary_cross_entropy(y, y_hat)

        # gradients
        dw += (y_hat - y) * x
        db += (y_hat - y)

    # update
    w -= lr * dw / len(X)
    b -= lr * db / len(X)

    if epoch % 200 == 0:
        print("epoch", epoch, "loss", round(loss, 4))

🧠 What Students Should Realize

This model:

uses probability
uses log
uses cross-entropy
uses gradient descent

👉 This is real AI, not toy math.

🔗 Everything Connects (Final Mental Map)

Linear score
   ↓
Sigmoid
   ↓
Probability
   ↓
Log
   ↓
Cross-Entropy
   ↓
Gradient Descent
   ↓
Learning

🏁 Final Truth (Put This on a Slide)

AI is not guessing labels. AI is optimizing probabilities using logs.

Once students understand this, they can:

read ML papers
debug models
move to deep learning smoothly

🧠 Backpropagation — Derived by Hand (No Magic)

🎯 What Backpropagation Really Is

Backpropagation is not magic.

It is simply:

The chain rule applied repeatedly, backwards

🔗 Chain Rule (Foundation)

$$ y = f(g(x)) $$

$$ \frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx} $$

Backpropagation applies this rule from the loss back to the parameters.

🧠 Minimal Neural Network (One Neuron)

Pipeline:

$$ x \rightarrow z \rightarrow \hat{y} \rightarrow L $$

Definitions

Linear transformation:

$$ z = wx + b $$

Sigmoid activation:

$$ \hat{y} = \sigma(z) $$

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Binary cross-entropy loss:

$$ L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})] $$

🔥 Goal of Backpropagation

We want to compute:

$$ \frac{\partial L}{\partial w} $$

$$ \frac{\partial L}{\partial b} $$

These gradients tell us how to update the model.

✍️ Step-by-Step Gradient Derivation

Step 1 — Loss → Prediction

Step 2 — Prediction → Linear Score

📌 This equation is the heart of modern AI

Step 3 — Linear Score → Parameters

Partial derivatives:

$$ \frac{\partial z}{\partial w} = x $$

$$ \frac{\partial z}{\partial b} = 1 $$

Final gradients:

$$ \frac{\partial L}{\partial w} = (\hat{y} - y)x $$

$$ \frac{\partial L}{\partial b} = (\hat{y} - y) $$

🧠 Why This Is Powerful

No complicated calculus during training
Same gradient structure works for:
- Logistic Regression
- Neural Networks
- Deep Learning

Backpropagation is reusable probability calculus.

🔥 CNN From Scratch — Mathematical View

🎯 Why Convolution Exists

Fully connected networks:

ignore spatial structure
require too many parameters

Convolutions exploit:

Local connectivity and weight sharing

🧠 Definition of Convolution

A convolution is a sliding dot product.

🧊 1D Convolution Example

Input signal:

$$ x = [1, 2, 3, 4, 5] $$

Kernel:

$$ k = [1, 0, -1] $$

✍️ Manual Computation

$$ [1,2,3] \cdot [1,0,-1] = -2 $$

$$ [2,3,4] \cdot [1,0,-1] = -2 $$

$$ [3,4,5] \cdot [1,0,-1] = -2 $$

Output:

$$ [-2, -2, -2] $$

🧠 CNN Processing Pipeline

$$ \text{Image} \rightarrow \text{Convolution} \rightarrow \text{ReLU} \rightarrow \text{Pooling} \rightarrow \text{Fully Connected} \rightarrow \text{Softmax} $$

CNNs still end with Softmax + Cross-Entropy, making them probabilistic classifiers.

🧪 Numerical Stability — Log-Sum-Exp Trick

❌ The Numerical Problem

Softmax definition:

$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

If logits are large:

$$ z = [1000, 1001, 1002] $$

Then:

$$ e^{1002} \rightarrow \infty $$

💥 overflow breaks training.

✅ Log-Sum-Exp Identity

Let:

$$ m = \max(z) $$

🧠 Why This Works

Keeps exponentials numerically small
Preserves exact probabilities
Used in all major deep-learning frameworks

📉 Stable Cross-Entropy (Direct Form)

Instead of computing:

$$ \text{Softmax} \rightarrow \log \rightarrow \text{Loss} $$

We compute:

$$ L = -z_y + \log \sum_i e^{z_i} $$

(using log-sum-exp)

1️⃣ What does probability measure?

Show answer

How likely an event is to happen, represented by a number between 0 and 1.

2️⃣ If Messi scores 8 goals out of 10 penalties, what is the probability he scores?

A) 0.2 B) 0.5 C) 0.8 D) 1.0

Show answer

C) 0.8

3️⃣ What is the sample space when rolling a standard die?

Show answer

{1, 2, 3, 4, 5, 6}

4️⃣ If Gryffindor has 40 students out of 100, what is

P(Gryffindor)?

Show answer

40 / 100 = 0.4

5️⃣ What does an event mean in probability?

Show answer

A subset of the sample space (a group of outcomes).

6️⃣ What is the probability range of any event?

A) −1 to 1 B) 0 to 10 C) 0 to 1 D) Any real number

Show answer

C) 0 to 1

7️⃣ Which formula represents conditional probability?

A) P(A ∪ B) B) P(A ∩ B) C) P(A | B) = P(A ∩ B) / P(B) D) P(A) + P(B)

Show answer

C) P(A | B) = P(A ∩ B) / P(B)

8️⃣ What does P(A | B) mean in plain English?

Show answer

The probability of A happening, given that B has already happened.

9️⃣ In Hogwarts terms, conditional probability means:

Show answer

You only look at students inside a specific house, not the whole school.

🔟 Which rule is used for independent events?

Show answer

Multiplication rule: P(A ∩ B) = P(A) × P(B)

1️⃣1️⃣ If Messi scores with probability 0.8

and the goalkeeper guesses wrong with probability 0.7, what is the probability both happen?

Show answer

0.8 × 0.7 = 0.56

1️⃣2️⃣ What problem does the addition rule solve?

Show answer

It avoids double-counting overlapping events.

1️⃣3️⃣ What is Bayes’ Theorem mainly used for?

Show answer

Reversing probability: updating beliefs after seeing evidence.

1️⃣4️⃣ Which formula is Bayes’ Theorem?

A) P(A ∩ B) B) P(A | B) = P(B | A)P(A) / P(B) C) P(A) + P(B) D) P(A) − P(B)

Show answer

B) P(A | B) = P(B | A)P(A) / P(B)

1️⃣5️⃣ In Messi injury analysis, Bayes helps answer:

Show answer

Given Messi played badly, how likely is it that he is injured?

1️⃣6️⃣ What does mean (average) measure?

Show answer

The central value of the data.

1️⃣7️⃣ High variance means what?

Show answer

Data is spread out and unpredictable.

1️⃣8️⃣ Who has lower variance in exam scores?

A) Hermione B) Seamus

Show answer

A) Hermione

1️⃣9️⃣ What is the key assumption of Naive Bayes?

Show answer

Features are conditionally independent given the class.

2️⃣0️⃣ Why is Naive Bayes powerful despite being “naive”?

Show answer

Because it is simple, fast, and works surprisingly well in real-world problems like spam detection and text classification.

Last updated on 2026