LC-TEST-17 — Basic NLP (V1)

🌍 Why This Document Exists

This is not a pure NLP library tutorial.

This is algorithmic thinking for NLP engineers.

If you want to work at:

  • Google (Search, Gemini)
  • OpenAI (ChatGPT, alignment)
  • AWS (Comprehend, Bedrock)
  • Microsoft (Copilot, Bing)

You must master strings, text, and scale.


⚽ A Short Story — Messi & Language

Messi does not speak with words on the field.

He speaks with:

  • movement
  • timing
  • patterns

Language is the same.

Before transformers… Before LLMs…

There were:

  • characters
  • tokens
  • frequencies
  • distributions

This book trains your text intuition — not just syntax.


🏆 Python NLP — 20 String Problems

Difficulty increases gradually.
Try before opening solutions.


🟢 NLP 1 — Word Count (Warm-Up)

Messi speaks:

text = "messi plays football and messi inspires the world"

Task: Count how many times each word appears.

✅ Solution
counts = {}
for w in text.split():
    counts[w] = counts.get(w, 0) + 1
print(counts)

🟢 NLP 2 — Unique Vocabulary Size

Task: Count how many unique words are in the text.

✅ Solution
vocab = set(text.split())
print(len(vocab))

🟢 NLP 3 — Capitalization Detection

sentence = "Messi Is The Greatest"

Task: Return all words that start with a capital letter.

✅ Solution
caps = [w for w in sentence.split() if w[0].isupper()]
print(caps)

🟢 NLP 4 — Lowercase Normalization

Task: Convert text to lowercase and remove extra spaces.

text = "  Messi   Plays   Football "
✅ Solution
clean = " ".join(text.lower().split())
print(clean)

🟡 NLP 5 — Character Frequency

Task: Count how often each character appears (ignore spaces).

text = "messi"
✅ Solution
freq = {}
for c in text:
    if c != " ":
        freq[c] = freq.get(c, 0) + 1
print(freq)

🟡 NLP 6 — Find Keywords in Text

keywords = ["goal", "win", "champion"]
text = "messi scores a goal to win the match"

Task: Return keywords that appear in text.

✅ Solution
found = []
for k in keywords:
    if k in text:
        found.append(k)
print(found)

🟡 NLP 7 — Sentence Length Analyzer

sentences = [
  "Messi plays football",
  "Messi inspires millions of people around the world"
]

Task: Return sentence lengths (in words).

✅ Solution
lengths = [len(s.split()) for s in sentences]
print(lengths)

🟡 NLP 8 — Stopword Removal

stopwords = {"the","and","is"}
text = "messi is the best and the greatest"
✅ Solution
filtered = [w for w in text.split() if w not in stopwords]
print(" ".join(filtered))

🔵 NLP 9 — Bigram Generation

text = "messi wins world cup"

Task: Generate word bigrams.

✅ Solution
words = text.split()
bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
print(bigrams)

🔵 NLP 10 — Most Frequent Word

text = "messi messi goal goal goal win"
✅ Solution
freq = {}
for w in text.split():
    freq[w] = freq.get(w, 0) + 1
print(max(freq, key=freq.get))

🔵 NLP 11 — Prefix Matching (Search Autocomplete)

words = ["messi","message","meta","goal"]
prefix = "me"
✅ Solution
print([w for w in words if w.startswith(prefix)])

🔵 NLP 12 — Suffix Detection

Task: Find words ending with "ing".

words = ["playing","played","scoring","score"]
✅ Solution
print([w for w in words if w.endswith("ing")])

🔴 NLP 13 — Streaming Word Count (Big Data Mindset)

Messages arrive one by one.

stream = ["messi scores", "messi wins", "scores again"]
✅ Solution
counts = {}
for msg in stream:
    for w in msg.split():
        counts[w] = counts.get(w, 0) + 1
print(counts)

🔴 NLP 14 — MapReduce (Word Count)

Map: emit (word, 1) Reduce: sum values

✅ Solution
mapped = []
for s in stream:
    for w in s.split():
        mapped.append((w,1))

reduced = {}
for w,v in mapped:
    reduced[w] = reduced.get(w,0) + v

print(reduced)

🔴 NLP 15 — Longest Word in Corpus

text = "messi demonstrates extraordinary football intelligence"
✅ Solution
words = text.split()
print(max(words, key=len))

🔴 NLP 16 — Named Entity Heuristic

Rule: Words starting with capital = entity.

text = "Messi plays for Argentina"
✅ Solution
entities = [w for w in text.split() if w[0].isupper()]
print(entities)

🔴 NLP 17 — Sentence Tokenization

text = "Messi scored. Fans celebrated. History written."
✅ Solution
sentences = [s.strip() for s in text.split(".") if s]
print(sentences)

🔴 NLP 18 — Text Similarity (Bag of Words)

a = "messi scores goals"
b = "messi scores"
✅ Solution
sa, sb = set(a.split()), set(b.split())
print(len(sa & sb) / len(sa | sb))

🔴 NLP 19 — Detect Shouting (Uppercase Ratio)

text = "GOAL GOAL Messi"
✅ Solution
caps = sum(1 for c in text if c.isupper())
print(caps / len(text) > 0.5)

🔴 NLP 20 — Real NLP Interview Question

Task: Detect duplicate sentences.

sentences = [
  "messi scores",
  "messi wins",
  "messi scores"
]
✅ Solution
seen = set()
dups = []

for s in sentences:
    if s in seen:
        dups.append(s)
    seen.add(s)

print(dups)

🎯 Final Message

Before:

  • BERT
  • GPT
  • Transformers

There were:

  • strings
  • loops
  • counters
  • patterns

If you master this page, you understand NLP from the ground up.


Previous
Next