Lemmatization
Reduce words to their valid dictionary base form using linguistic rules.
Run in Google Colab โOverview
Lemmatization reduces words to their base or dictionary form (lemma) using vocabulary and morphological analysis. Unlike stemming, lemmatization always returns a valid word โ 'better' โ 'good', 'ran' โ 'run'. It is slower than stemming but more accurate.
๐ก Why It Matters
Lemmatization is the preferred normalization technique when semantic accuracy matters. For tasks like sentiment analysis or question answering, returning linguistically valid lemmas ensures the model retains correct meaning.
Key Concepts
Lemma
The canonical, dictionary form of a word (e.g., 'run' is the lemma of 'running', 'ran', 'runs').
WordNet
A large lexical database of English, used by NLTK's WordNetLemmatizer to look up base forms.
POS-aware Lemmatization
Lemmatization is more accurate when the part-of-speech is provided โ 'meeting' as a noun vs verb has different lemmas.
Stemming vs Lemmatization
Stemming is faster and rule-based; lemmatization is slower but linguistically valid.
๐ Library Note
This module uses NLTK's `WordNetLemmatizer`, which requires the `wordnet` and `averaged_perceptron_tagger` corpora.
External Documentation
What You'll Learn
- Difference between stemming and lemmatization
- How WordNet is used for linguistically valid lemmas
- When to use lemmatization over stemming