LC-TEST-19 — Basic NLP (V3)

🌍 Why This Document Exists

This is not a memorization guide.

This is how real NLP engineers think:

before transformers
before embeddings
before GPUs

Every great NLP system starts with: text, counts, structure, and meaning

⚽ A Real-World NLP Story — Messi Beyond Football

Lionel Messi is often described as a footballer,
but to millions of people, he represents discipline, creativity, and persistence.

From Rosario to Barcelona, from criticism to championships,
Messi’s story is not just about goals.

It is about patterns:

repeated effort
rare moments of brilliance
quiet consistency

In language, meaning works the same way.

Common words appear everywhere.
Rare words define intent.

This document treats Messi’s story as data —
and trains you to extract meaning like an NLP engineer.

📄 The Corpus (Input Text)

Lionel Messi is one of the greatest football players in history.
Messi began his career in Argentina before joining Barcelona.
At Barcelona, Messi scored goals, created chances, and inspired fans.
Many people believe Messi changed the way football is played.
Despite fame, Messi remained disciplined and focused on the game.

🏆 NLP Interview — 20 Problems

Difficulty increases gradually. Try to think before coding.

🟢 NLP 1 — Sentence Count

Task: How many sentences are in the corpus?

✅ Solution

sentences = [s for s in text.split(".") if s.strip()]
print(len(sentences))

🟢 NLP 2 — Word Tokenization

Task: Convert the corpus into a list of words (lowercased).

✅ Solution

words = text.lower().replace(".", "").split()
print(words)

🟢 NLP 3 — Vocabulary Size

Task: How many unique words exist?

✅ Solution

print(len(set(words)))

🟡 NLP 4 — Word Frequency Count

Task: Count how many times each word appears.

✅ Solution

freq = {}
for w in words:
    freq[w] = freq.get(w, 0) + 1
print(freq)

🟡 NLP 5 — Most Frequent Word

Task: Find the most common word.

✅ Solution

print(max(freq, key=freq.get))

🟡 NLP 6 — Stopword Removal

Remove: {"the", "is", "and", "of", "to", "in"}

✅ Solution

stop = {"the","is","and","of","to","in"}
filtered = [w for w in words if w not in stop]
print(filtered)

🟡 NLP 7 — Capitalized Word Detection

Task: Detect words that originally started with capital letters.

✅ Solution

caps = [w for w in text.split() if w[0].isupper()]
print(caps)

🔵 NLP 8 — Average Sentence Length

Task: Compute average words per sentence.

✅ Solution

lengths = [len(s.split()) for s in sentences]
print(sum(lengths) / len(lengths))

🔵 NLP 9 — Bigram Extraction

Task: Extract all word bigrams.

✅ Solution

bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
print(bigrams)

🔵 NLP 10 — Keyword Search

Task: Check if the word "discipline" exists.

✅ Solution

print("discipline" in words)

🔵 NLP 11 — Named Entity Heuristic

Rule: Words starting with capital letters = entities.

✅ Solution

entities = list(set(caps))
print(entities)

🔴 NLP 12 — Document Frequency (DF)

Task: How many sentences contain the word "messi"?

✅ Solution

print(sum("messi" in s.lower() for s in sentences))

🔴 NLP 13 — TF Calculation

Task: Compute TF for "messi" in the whole corpus.

✅ Solution

tf_messi = freq["messi"] / len(words)
print(tf_messi)

🔴 NLP 14 — Rare Word Detection

Task: Find words appearing only once.

✅ Solution

rare = [w for w,c in freq.items() if c == 1]
print(rare)

🔴 NLP 15 — Sentence Similarity (Jaccard)

Compare sentence 1 and 3.

✅ Solution

a = set(sentences[0].lower().split())
b = set(sentences[2].lower().split())
print(len(a & b) / len(a | b))

🔴 NLP 16 — Duplicate Sentence Detection

Task: Check if any sentence is duplicated.

✅ Solution

seen = set()
dup = False
for s in sentences:
    if s in seen:
        dup = True
    seen.add(s)
print(dup)

🔴 NLP 17 — Streaming Word Count (Big Data)

Messages arrive line by line.

✅ Solution

counts = {}
for s in sentences:
    for w in s.lower().split():
        counts[w] = counts.get(w, 0) + 1
print(counts)

🔴 NLP 18 — MapReduce Mindset

Map: (word, 1) Reduce: sum

✅ Solution

mapped = [(w,1) for w in words]
reduced = {}
for w,v in mapped:
    reduced[w] = reduced.get(w,0) + v
print(reduced)

🔴 NLP 19 — Explainability Question

Question: Why is "messi" less informative than "disciplined"?

✅ Answer

Because "messi" appears many times,
while "disciplined" is rare and carries more information.

🔴 NLP 20 — Bridge to Modern NLP

Question: Why does TF-IDF fail for deep semantics?

✅ Answer

It ignores word order, context, and meaning.
Synonyms are unrelated, and semantics are not captured.

🎯 Final Message

If you can:

explain these problems
code them calmly
justify your choices

You are interview-ready.

Before LLMs, there was understanding. Before transformers, there was thinking.

This is how real NLP engineers are made.

Last updated on 2026