LC-TEST-18 โ€” Basic NLP (V2)

๐Ÿ“˜ What Is TF-IDF? (Intuition + Math)

TF-IDF stands for:

Term Frequency โ€“ Inverse Document Frequency

It answers one fundamental NLP question:

โ Which words are important in this document,
compared to the entire collection? โž


๐Ÿงฎ The Math (Step by Step)

1๏ธโƒฃ Term Frequency (TF)

How often a word appears inside one document.

$$ \text{TF}(w, d) = \frac{\text{count of } w \text{ in document } d} {\text{total words in document } d} $$

๐Ÿ‘‰ Measures local importance


2๏ธโƒฃ Document Frequency (DF)

How many documents contain the word.

$$ \text{DF}(w) = \text{number of documents containing } w $$

๐Ÿ‘‰ Measures how common a word is across documents


3๏ธโƒฃ Inverse Document Frequency (IDF)

Penalizes words appearing in many documents.

$$ \text{IDF}(w) = \log\left(\frac{N}{\text{DF}(w)}\right) $$

Where:

  • $N$ = total number of documents

๐Ÿ‘‰ Measures global rarity


4๏ธโƒฃ TF-IDF (Final Score)

$$ \text{TF-IDF}(w, d) = \text{TF}(w, d) \times \text{IDF}(w) $$

This is the importance score used in:

  • search engines
  • document ranking
  • classical NLP systems

๐Ÿงช Worked Example (Very Important)

Corpus (3 documents)

D1: "messi scores goals"
D2: "goals win matches"
D3: "messi inspires fans"

Step 1: Vocabulary

{messi, scores, goals, win, matches, inspires, fans}

Step 2: Document Frequency (DF)

Word DF
messi 2
goals 2
scores 1
win 1
matches 1
inspires 1
fans 1

Step 3: IDF (Assume natural log)

$$ \text{IDF}(\text{messi}) = \log\left(\frac{3}{2}\right) $$

$$ \text{IDF}(\text{scores}) = \log\left(\frac{3}{1}\right) $$

๐Ÿ‘‰ "scores" appears in fewer documents ๐Ÿ‘‰ therefore it has higher IDF


Step 4: TF-IDF in Document 1

Document 1 contains 3 words:

messi scores goals
Word TF IDF TF-IDF
messi 1/3 log(3/2) low
scores 1/3 log(3/1) HIGH
goals 1/3 log(3/2) low

โœ… “scores” becomes the most important word


๐ŸŽฏ What Interviewers Want You To Say

If asked:

โ€œExplain TF-IDFโ€

A perfect answer is:

โ€œTF-IDF measures how important a word is to a document relative to the whole corpus.

It boosts words that are frequent in one document but rare across documents.โ€


๐Ÿ”— Why TF-IDF Still Matters Today

Even in the age of GPT:

  • Search engines still use it
  • Keyword extraction uses it
  • It explains why attention exists
  • It is the conceptual ancestor of embeddings

If you understand TF-IDF deeply, transformers feel natural โ€” not magical.


๐ŸŒ Why This Document Exists

Large Language Models do not begin with transformers.

They begin with a simple question:

โ€œWhich words matter?โ€

TF-IDF is the bridge between:

  • raw text
  • statistics
  • semantic meaning

Every NLP engineer must master this.


โšฝ Messi & Meaning โ€” A Short Story

Messi touches the ball less than others.

Yet every touch matters more.

Common words are like defenders:

  • they are everywhere
  • they mean little

Rare words are like Messi:

  • fewer appearances
  • massive impact

That is TF-IDF.


๐Ÿ† Python NLP โ€” TF-IDF Edition (20 Problems)

Difficulty increases gradually.
Do not skip intuition.


๐ŸŸข TF-IDF 1 โ€” Tokenization

doc = "messi scores beautiful goals"

Task: Convert sentence into tokens.

โœ… Solution
tokens = doc.split()
print(tokens)

๐ŸŸข TF-IDF 2 โ€” Term Frequency (TF)

TF(word) = count(word) / total words

doc = "messi scores goals goals"
โœ… Solution
words = doc.split()
tf = {}

for w in words:
    tf[w] = tf.get(w, 0) + 1

for w in tf:
    tf[w] /= len(words)

print(tf)

๐ŸŸข TF-IDF 3 โ€” Vocabulary from Corpus

docs = [
  "messi scores goals",
  "goals win matches",
  "messi inspires fans"
]

Task: Extract unique vocabulary.

โœ… Solution
vocab = set()
for d in docs:
    vocab |= set(d.split())
print(vocab)

๐ŸŸก TF-IDF 4 โ€” Document Frequency (DF)

DF(word) = number of documents containing word

โœ… Solution
df = {}
for word in vocab:
    df[word] = sum(word in d.split() for d in docs)
print(df)

๐ŸŸก TF-IDF 5 โ€” Inverse Document Frequency (IDF)

IDF(word) = log(N / DF)

โœ… Solution
import math

N = len(docs)
idf = {w: math.log(N / df[w]) for w in df}
print(idf)

๐ŸŸก TF-IDF 6 โ€” TF-IDF for One Document

โœ… Solution
doc = docs[0]
words = doc.split()

tf = {}
for w in words:
    tf[w] = tf.get(w, 0) + 1
for w in tf:
    tf[w] /= len(words)

tfidf = {w: tf[w] * idf[w] for w in tf}
print(tfidf)

๐Ÿ”ต TF-IDF 7 โ€” Full TF-IDF Matrix

โœ… Solution
matrix = []

for d in docs:
    words = d.split()
    tf = {}
    for w in words:
        tf[w] = tf.get(w, 0) + 1
    for w in tf:
        tf[w] /= len(words)

    vec = {w: tf.get(w, 0) * idf[w] for w in vocab}
    matrix.append(vec)

print(matrix)

๐Ÿ”ต TF-IDF 8 โ€” Important Words in Document

Task: Find top-1 TF-IDF word.

โœ… Solution
top = max(tfidf, key=tfidf.get)
print(top)

๐Ÿ”ต TF-IDF 9 โ€” Remove Stopwords

stopwords = {"the","and","is","of"}
โœ… Solution
filtered_docs = []
for d in docs:
    filtered_docs.append(
        " ".join(w for w in d.split() if w not in stopwords)
    )
print(filtered_docs)

๐Ÿ”ต TF-IDF 10 โ€” Normalize Vectors (Cosine Prep)

โœ… Solution
import math

def norm(vec):
    return math.sqrt(sum(v*v for v in vec.values()))

print(norm(tfidf))

๐Ÿ”ด TF-IDF 11 โ€” Cosine Similarity

a = "messi scores goals"
b = "messi inspires fans"
โœ… Solution
def cosine(v1, v2):
    common = set(v1) & set(v2)
    num = sum(v1[w] * v2[w] for w in common)
    den = norm(v1) * norm(v2)
    return num / den if den else 0

๐Ÿ”ด TF-IDF 12 โ€” Find Most Similar Document

โœ… Solution
query = matrix[0]
sims = [cosine(query, m) for m in matrix]
print(sims)

๐Ÿ”ด TF-IDF 13 โ€” Keyword Extraction

Task: Return words with TF-IDF > threshold.

โœ… Solution
keywords = [w for w,v in tfidf.items() if v > 0.2]
print(keywords)

๐Ÿ”ด TF-IDF 14 โ€” Streaming TF Update

โœ… Solution
stream = ["messi scores", "scores again"]

tf = {}
total = 0

for s in stream:
    for w in s.split():
        tf[w] = tf.get(w, 0) + 1
        total += 1

for w in tf:
    tf[w] /= total

print(tf)

๐Ÿ”ด TF-IDF 15 โ€” Big Data MapReduce TF

โœ… Solution
mapped = []
for d in docs:
    for w in d.split():
        mapped.append((w,1))

reduced = {}
for w,v in mapped:
    reduced[w] = reduced.get(w,0) + v

print(reduced)

๐Ÿ”ด TF-IDF 16 โ€” Rare Word Detection

Task: Words with highest IDF.

โœ… Solution
print(sorted(idf, key=idf.get, reverse=True))

๐Ÿ”ด TF-IDF 17 โ€” Explainability Test

Question: Why does "messi" sometimes get low TF-IDF?

โœ… Answer
Because it appears in many documents,
its IDF is low, reducing its importance.

๐Ÿ”ด TF-IDF 18 โ€” sklearn Comparison (Industry Standard)

โœ… Solution
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
X = vec.fit_transform(docs)

print(vec.get_feature_names_out())
print(X.toarray())

๐Ÿ”ด TF-IDF 19 โ€” Interview Question

Why TF-IDF fails for semantics?

โœ… Answer
It ignores word order, context, and meaning.
Synonyms are treated as unrelated.

๐Ÿ”ด TF-IDF 20 โ€” Bridge to Transformers

Question: What replaced TF-IDF?

โœ… Answer
Word embeddings โ†’ contextual embeddings โ†’ transformers

๐ŸŽฏ Final Message

TF-IDF teaches:

  • importance
  • rarity
  • signal vs noise

If you understand this page:

  • you understand why attention exists
  • you understand why embeddings work

You are ready for real NLP.


Previous
Next