LC-TEST-17 — Basic NLP (V1)
🌍 Why This Document Exists
This is not a pure NLP library tutorial.
This is algorithmic thinking for NLP engineers.
If you want to work at:
- Google (Search, Gemini)
- OpenAI (ChatGPT, alignment)
- AWS (Comprehend, Bedrock)
- Microsoft (Copilot, Bing)
You must master strings, text, and scale.
⚽ A Short Story — Messi & Language
Messi does not speak with words on the field.
He speaks with:
- movement
- timing
- patterns
Language is the same.
Before transformers… Before LLMs…
There were:
- characters
- tokens
- frequencies
- distributions
This book trains your text intuition — not just syntax.
🏆 Python NLP — 20 String Problems
Difficulty increases gradually.
Try before opening solutions.
🟢 NLP 1 — Word Count (Warm-Up)
Messi speaks:
text = "messi plays football and messi inspires the world"
Task: Count how many times each word appears.
✅ Solution
counts = {}
for w in text.split():
counts[w] = counts.get(w, 0) + 1
print(counts)
🟢 NLP 2 — Unique Vocabulary Size
Task: Count how many unique words are in the text.
✅ Solution
vocab = set(text.split())
print(len(vocab))
🟢 NLP 3 — Capitalization Detection
sentence = "Messi Is The Greatest"
Task: Return all words that start with a capital letter.
✅ Solution
caps = [w for w in sentence.split() if w[0].isupper()]
print(caps)
🟢 NLP 4 — Lowercase Normalization
Task: Convert text to lowercase and remove extra spaces.
text = " Messi Plays Football "
✅ Solution
clean = " ".join(text.lower().split())
print(clean)
🟡 NLP 5 — Character Frequency
Task: Count how often each character appears (ignore spaces).
text = "messi"
✅ Solution
freq = {}
for c in text:
if c != " ":
freq[c] = freq.get(c, 0) + 1
print(freq)
🟡 NLP 6 — Find Keywords in Text
keywords = ["goal", "win", "champion"]
text = "messi scores a goal to win the match"
Task: Return keywords that appear in text.
✅ Solution
found = []
for k in keywords:
if k in text:
found.append(k)
print(found)
🟡 NLP 7 — Sentence Length Analyzer
sentences = [
"Messi plays football",
"Messi inspires millions of people around the world"
]
Task: Return sentence lengths (in words).
✅ Solution
lengths = [len(s.split()) for s in sentences]
print(lengths)
🟡 NLP 8 — Stopword Removal
stopwords = {"the","and","is"}
text = "messi is the best and the greatest"
✅ Solution
filtered = [w for w in text.split() if w not in stopwords]
print(" ".join(filtered))
🔵 NLP 9 — Bigram Generation
text = "messi wins world cup"
Task: Generate word bigrams.
✅ Solution
words = text.split()
bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
print(bigrams)
🔵 NLP 10 — Most Frequent Word
text = "messi messi goal goal goal win"
✅ Solution
freq = {}
for w in text.split():
freq[w] = freq.get(w, 0) + 1
print(max(freq, key=freq.get))
🔵 NLP 11 — Prefix Matching (Search Autocomplete)
words = ["messi","message","meta","goal"]
prefix = "me"
✅ Solution
print([w for w in words if w.startswith(prefix)])
🔵 NLP 12 — Suffix Detection
Task:
Find words ending with "ing".
words = ["playing","played","scoring","score"]
✅ Solution
print([w for w in words if w.endswith("ing")])
🔴 NLP 13 — Streaming Word Count (Big Data Mindset)
Messages arrive one by one.
stream = ["messi scores", "messi wins", "scores again"]
✅ Solution
counts = {}
for msg in stream:
for w in msg.split():
counts[w] = counts.get(w, 0) + 1
print(counts)
🔴 NLP 14 — MapReduce (Word Count)
Map: emit (word, 1)
Reduce: sum values
✅ Solution
mapped = []
for s in stream:
for w in s.split():
mapped.append((w,1))
reduced = {}
for w,v in mapped:
reduced[w] = reduced.get(w,0) + v
print(reduced)
🔴 NLP 15 — Longest Word in Corpus
text = "messi demonstrates extraordinary football intelligence"
✅ Solution
words = text.split()
print(max(words, key=len))
🔴 NLP 16 — Named Entity Heuristic
Rule: Words starting with capital = entity.
text = "Messi plays for Argentina"
✅ Solution
entities = [w for w in text.split() if w[0].isupper()]
print(entities)
🔴 NLP 17 — Sentence Tokenization
text = "Messi scored. Fans celebrated. History written."
✅ Solution
sentences = [s.strip() for s in text.split(".") if s]
print(sentences)
🔴 NLP 18 — Text Similarity (Bag of Words)
a = "messi scores goals"
b = "messi scores"
✅ Solution
sa, sb = set(a.split()), set(b.split())
print(len(sa & sb) / len(sa | sb))
🔴 NLP 19 — Detect Shouting (Uppercase Ratio)
text = "GOAL GOAL Messi"
✅ Solution
caps = sum(1 for c in text if c.isupper())
print(caps / len(text) > 0.5)
🔴 NLP 20 — Real NLP Interview Question
Task: Detect duplicate sentences.
sentences = [
"messi scores",
"messi wins",
"messi scores"
]
✅ Solution
seen = set()
dups = []
for s in sentences:
if s in seen:
dups.append(s)
seen.add(s)
print(dups)
🎯 Final Message
Before:
- BERT
- GPT
- Transformers
There were:
- strings
- loops
- counters
- patterns
If you master this page, you understand NLP from the ground up.