Learning Path/Stemming
NLTKPorterStemmerSnowballStemmer10 min

Stemming

Reduce words to their root form by chopping off suffixes.

Run in Google Colab โ†—

Overview

Stemming is a rule-based text normalization technique that strips suffixes from words to reduce them to a common root (stem). For example, 'running', 'runs', and 'runner' all reduce to 'run'. It is fast and simple but often produces non-dictionary words.

๐Ÿ’ก Why It Matters

Stemming reduces vocabulary size, which helps models generalize better on sparse data. However, it is aggressive โ€” 'studies' may stem to 'studi', which is not a real word. For precision-critical tasks, lemmatization is preferred.

Stemming Algorithms

Porter Stemmer

The most widely used English stemmer, applying a series of suffix-stripping rules in phases.

Snowball Stemmer

An improved version of Porter, also supporting multiple languages.

Lancaster Stemmer

More aggressive than Porter โ€” faster but produces more truncated stems.

Over-stemming

When two different words are reduced to the same stem incorrectly, causing loss of meaning.

๐Ÿ›  Library Note

This module uses NLTK's `PorterStemmer` and `SnowballStemmer` classes.

What You'll Learn

  • How stemming reduces vocabulary size
  • Porter vs Snowball stemmer differences
  • When stemming hurts more than it helps