LC-TEST-18 โ Basic NLP (V2)
๐ What Is TF-IDF? (Intuition + Math)
TF-IDF stands for:
Term Frequency โ Inverse Document Frequency
It answers one fundamental NLP question:
โ Which words are important in this document,
compared to the entire collection? โ
๐งฎ The Math (Step by Step)
1๏ธโฃ Term Frequency (TF)
How often a word appears inside one document.
$$ \text{TF}(w, d) = \frac{\text{count of } w \text{ in document } d} {\text{total words in document } d} $$
๐ Measures local importance
2๏ธโฃ Document Frequency (DF)
How many documents contain the word.
$$ \text{DF}(w) = \text{number of documents containing } w $$
๐ Measures how common a word is across documents
3๏ธโฃ Inverse Document Frequency (IDF)
Penalizes words appearing in many documents.
$$ \text{IDF}(w) = \log\left(\frac{N}{\text{DF}(w)}\right) $$
Where:
- $N$ = total number of documents
๐ Measures global rarity
4๏ธโฃ TF-IDF (Final Score)
$$ \text{TF-IDF}(w, d) = \text{TF}(w, d) \times \text{IDF}(w) $$
This is the importance score used in:
- search engines
- document ranking
- classical NLP systems
๐งช Worked Example (Very Important)
Corpus (3 documents)
D1: "messi scores goals"
D2: "goals win matches"
D3: "messi inspires fans"
Step 1: Vocabulary
{messi, scores, goals, win, matches, inspires, fans}
Step 2: Document Frequency (DF)
| Word | DF |
|---|---|
| messi | 2 |
| goals | 2 |
| scores | 1 |
| win | 1 |
| matches | 1 |
| inspires | 1 |
| fans | 1 |
Step 3: IDF (Assume natural log)
$$ \text{IDF}(\text{messi}) = \log\left(\frac{3}{2}\right) $$
$$ \text{IDF}(\text{scores}) = \log\left(\frac{3}{1}\right) $$
๐ "scores" appears in fewer documents
๐ therefore it has higher IDF
Step 4: TF-IDF in Document 1
Document 1 contains 3 words:
messi scores goals
| Word | TF | IDF | TF-IDF |
|---|---|---|---|
| messi | 1/3 | log(3/2) | low |
| scores | 1/3 | log(3/1) | HIGH |
| goals | 1/3 | log(3/2) | low |
โ “scores” becomes the most important word
๐ฏ What Interviewers Want You To Say
If asked:
โExplain TF-IDFโ
A perfect answer is:
โTF-IDF measures how important a word is to a document relative to the whole corpus.
It boosts words that are frequent in one document but rare across documents.โ
๐ Why TF-IDF Still Matters Today
Even in the age of GPT:
- Search engines still use it
- Keyword extraction uses it
- It explains why attention exists
- It is the conceptual ancestor of embeddings
If you understand TF-IDF deeply, transformers feel natural โ not magical.
๐ Why This Document Exists
Large Language Models do not begin with transformers.
They begin with a simple question:
โWhich words matter?โ
TF-IDF is the bridge between:
- raw text
- statistics
- semantic meaning
Every NLP engineer must master this.
โฝ Messi & Meaning โ A Short Story
Messi touches the ball less than others.
Yet every touch matters more.
Common words are like defenders:
- they are everywhere
- they mean little
Rare words are like Messi:
- fewer appearances
- massive impact
That is TF-IDF.
๐ Python NLP โ TF-IDF Edition (20 Problems)
Difficulty increases gradually.
Do not skip intuition.
๐ข TF-IDF 1 โ Tokenization
doc = "messi scores beautiful goals"
Task: Convert sentence into tokens.
โ Solution
tokens = doc.split()
print(tokens)
๐ข TF-IDF 2 โ Term Frequency (TF)
TF(word) = count(word) / total words
doc = "messi scores goals goals"
โ Solution
words = doc.split()
tf = {}
for w in words:
tf[w] = tf.get(w, 0) + 1
for w in tf:
tf[w] /= len(words)
print(tf)
๐ข TF-IDF 3 โ Vocabulary from Corpus
docs = [
"messi scores goals",
"goals win matches",
"messi inspires fans"
]
Task: Extract unique vocabulary.
โ Solution
vocab = set()
for d in docs:
vocab |= set(d.split())
print(vocab)
๐ก TF-IDF 4 โ Document Frequency (DF)
DF(word) = number of documents containing word
โ Solution
df = {}
for word in vocab:
df[word] = sum(word in d.split() for d in docs)
print(df)
๐ก TF-IDF 5 โ Inverse Document Frequency (IDF)
IDF(word) = log(N / DF)
โ Solution
import math
N = len(docs)
idf = {w: math.log(N / df[w]) for w in df}
print(idf)
๐ก TF-IDF 6 โ TF-IDF for One Document
โ Solution
doc = docs[0]
words = doc.split()
tf = {}
for w in words:
tf[w] = tf.get(w, 0) + 1
for w in tf:
tf[w] /= len(words)
tfidf = {w: tf[w] * idf[w] for w in tf}
print(tfidf)
๐ต TF-IDF 7 โ Full TF-IDF Matrix
โ Solution
matrix = []
for d in docs:
words = d.split()
tf = {}
for w in words:
tf[w] = tf.get(w, 0) + 1
for w in tf:
tf[w] /= len(words)
vec = {w: tf.get(w, 0) * idf[w] for w in vocab}
matrix.append(vec)
print(matrix)
๐ต TF-IDF 8 โ Important Words in Document
Task: Find top-1 TF-IDF word.
โ Solution
top = max(tfidf, key=tfidf.get)
print(top)
๐ต TF-IDF 9 โ Remove Stopwords
stopwords = {"the","and","is","of"}
โ Solution
filtered_docs = []
for d in docs:
filtered_docs.append(
" ".join(w for w in d.split() if w not in stopwords)
)
print(filtered_docs)
๐ต TF-IDF 10 โ Normalize Vectors (Cosine Prep)
โ Solution
import math
def norm(vec):
return math.sqrt(sum(v*v for v in vec.values()))
print(norm(tfidf))
๐ด TF-IDF 11 โ Cosine Similarity
a = "messi scores goals"
b = "messi inspires fans"
โ Solution
def cosine(v1, v2):
common = set(v1) & set(v2)
num = sum(v1[w] * v2[w] for w in common)
den = norm(v1) * norm(v2)
return num / den if den else 0
๐ด TF-IDF 12 โ Find Most Similar Document
โ Solution
query = matrix[0]
sims = [cosine(query, m) for m in matrix]
print(sims)
๐ด TF-IDF 13 โ Keyword Extraction
Task: Return words with TF-IDF > threshold.
โ Solution
keywords = [w for w,v in tfidf.items() if v > 0.2]
print(keywords)
๐ด TF-IDF 14 โ Streaming TF Update
โ Solution
stream = ["messi scores", "scores again"]
tf = {}
total = 0
for s in stream:
for w in s.split():
tf[w] = tf.get(w, 0) + 1
total += 1
for w in tf:
tf[w] /= total
print(tf)
๐ด TF-IDF 15 โ Big Data MapReduce TF
โ Solution
mapped = []
for d in docs:
for w in d.split():
mapped.append((w,1))
reduced = {}
for w,v in mapped:
reduced[w] = reduced.get(w,0) + v
print(reduced)
๐ด TF-IDF 16 โ Rare Word Detection
Task: Words with highest IDF.
โ Solution
print(sorted(idf, key=idf.get, reverse=True))
๐ด TF-IDF 17 โ Explainability Test
Question:
Why does "messi" sometimes get low TF-IDF?
โ Answer
Because it appears in many documents,
its IDF is low, reducing its importance.
๐ด TF-IDF 18 โ sklearn Comparison (Industry Standard)
โ Solution
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(docs)
print(vec.get_feature_names_out())
print(X.toarray())
๐ด TF-IDF 19 โ Interview Question
Why TF-IDF fails for semantics?
โ Answer
It ignores word order, context, and meaning.
Synonyms are treated as unrelated.
๐ด TF-IDF 20 โ Bridge to Transformers
Question: What replaced TF-IDF?
โ Answer
Word embeddings โ contextual embeddings โ transformers
๐ฏ Final Message
TF-IDF teaches:
- importance
- rarity
- signal vs noise
If you understand this page:
- you understand why attention exists
- you understand why embeddings work
You are ready for real NLP.