Stemming
Reduce words to their root form by chopping off suffixes.
Run in Google Colab โOverview
Stemming is a rule-based text normalization technique that strips suffixes from words to reduce them to a common root (stem). For example, 'running', 'runs', and 'runner' all reduce to 'run'. It is fast and simple but often produces non-dictionary words.
๐ก Why It Matters
Stemming reduces vocabulary size, which helps models generalize better on sparse data. However, it is aggressive โ 'studies' may stem to 'studi', which is not a real word. For precision-critical tasks, lemmatization is preferred.
Stemming Algorithms
Porter Stemmer
The most widely used English stemmer, applying a series of suffix-stripping rules in phases.
Snowball Stemmer
An improved version of Porter, also supporting multiple languages.
Lancaster Stemmer
More aggressive than Porter โ faster but produces more truncated stems.
Over-stemming
When two different words are reduced to the same stem incorrectly, causing loss of meaning.
๐ Library Note
This module uses NLTK's `PorterStemmer` and `SnowballStemmer` classes.
External Documentation
What You'll Learn
- How stemming reduces vocabulary size
- Porter vs Snowball stemmer differences
- When stemming hurts more than it helps