LC-TEST-19 — Basic NLP (V3)
🌍 Why This Document Exists
This is not a memorization guide.
This is how real NLP engineers think:
- before transformers
- before embeddings
- before GPUs
Every great NLP system starts with: text, counts, structure, and meaning
⚽ A Real-World NLP Story — Messi Beyond Football
Lionel Messi is often described as a footballer,
but to millions of people, he represents discipline, creativity, and persistence.
From Rosario to Barcelona, from criticism to championships,
Messi’s story is not just about goals.
It is about patterns:
- repeated effort
- rare moments of brilliance
- quiet consistency
In language, meaning works the same way.
Common words appear everywhere.
Rare words define intent.
This document treats Messi’s story as data —
and trains you to extract meaning like an NLP engineer.
📄 The Corpus (Input Text)
Lionel Messi is one of the greatest football players in history.
Messi began his career in Argentina before joining Barcelona.
At Barcelona, Messi scored goals, created chances, and inspired fans.
Many people believe Messi changed the way football is played.
Despite fame, Messi remained disciplined and focused on the game.
🏆 NLP Interview — 20 Problems
Difficulty increases gradually. Try to think before coding.
🟢 NLP 1 — Sentence Count
Task: How many sentences are in the corpus?
✅ Solution
sentences = [s for s in text.split(".") if s.strip()]
print(len(sentences))
🟢 NLP 2 — Word Tokenization
Task: Convert the corpus into a list of words (lowercased).
✅ Solution
words = text.lower().replace(".", "").split()
print(words)
🟢 NLP 3 — Vocabulary Size
Task: How many unique words exist?
✅ Solution
print(len(set(words)))
🟡 NLP 4 — Word Frequency Count
Task: Count how many times each word appears.
✅ Solution
freq = {}
for w in words:
freq[w] = freq.get(w, 0) + 1
print(freq)
🟡 NLP 5 — Most Frequent Word
Task: Find the most common word.
✅ Solution
print(max(freq, key=freq.get))
🟡 NLP 6 — Stopword Removal
Remove: {"the", "is", "and", "of", "to", "in"}
✅ Solution
stop = {"the","is","and","of","to","in"}
filtered = [w for w in words if w not in stop]
print(filtered)
🟡 NLP 7 — Capitalized Word Detection
Task: Detect words that originally started with capital letters.
✅ Solution
caps = [w for w in text.split() if w[0].isupper()]
print(caps)
🔵 NLP 8 — Average Sentence Length
Task: Compute average words per sentence.
✅ Solution
lengths = [len(s.split()) for s in sentences]
print(sum(lengths) / len(lengths))
🔵 NLP 9 — Bigram Extraction
Task: Extract all word bigrams.
✅ Solution
bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
print(bigrams)
🔵 NLP 10 — Keyword Search
Task:
Check if the word "discipline" exists.
✅ Solution
print("discipline" in words)
🔵 NLP 11 — Named Entity Heuristic
Rule: Words starting with capital letters = entities.
✅ Solution
entities = list(set(caps))
print(entities)
🔴 NLP 12 — Document Frequency (DF)
Task:
How many sentences contain the word "messi"?
✅ Solution
print(sum("messi" in s.lower() for s in sentences))
🔴 NLP 13 — TF Calculation
Task:
Compute TF for "messi" in the whole corpus.
✅ Solution
tf_messi = freq["messi"] / len(words)
print(tf_messi)
🔴 NLP 14 — Rare Word Detection
Task: Find words appearing only once.
✅ Solution
rare = [w for w,c in freq.items() if c == 1]
print(rare)
🔴 NLP 15 — Sentence Similarity (Jaccard)
Compare sentence 1 and 3.
✅ Solution
a = set(sentences[0].lower().split())
b = set(sentences[2].lower().split())
print(len(a & b) / len(a | b))
🔴 NLP 16 — Duplicate Sentence Detection
Task: Check if any sentence is duplicated.
✅ Solution
seen = set()
dup = False
for s in sentences:
if s in seen:
dup = True
seen.add(s)
print(dup)
🔴 NLP 17 — Streaming Word Count (Big Data)
Messages arrive line by line.
✅ Solution
counts = {}
for s in sentences:
for w in s.lower().split():
counts[w] = counts.get(w, 0) + 1
print(counts)
🔴 NLP 18 — MapReduce Mindset
Map: (word, 1)
Reduce: sum
✅ Solution
mapped = [(w,1) for w in words]
reduced = {}
for w,v in mapped:
reduced[w] = reduced.get(w,0) + v
print(reduced)
🔴 NLP 19 — Explainability Question
Question:
Why is "messi" less informative than "disciplined"?
✅ Answer
Because "messi" appears many times,
while "disciplined" is rare and carries more information.
🔴 NLP 20 — Bridge to Modern NLP
Question: Why does TF-IDF fail for deep semantics?
✅ Answer
It ignores word order, context, and meaning.
Synonyms are unrelated, and semantics are not captured.
🎯 Final Message
If you can:
- explain these problems
- code them calmly
- justify your choices
You are interview-ready.
Before LLMs, there was understanding. Before transformers, there was thinking.
This is how real NLP engineers are made.