NLTKPorterStemmerSnowballStemmer10 min

Stemming

Reduce words to their root form by chopping off suffixes.

Overview

Stemming is a rule-based text normalization technique that strips suffixes from words to reduce them to a common root (stem). For example, 'running', 'runs', and 'runner' all reduce to 'run'. It is fast and simple but often produces non-dictionary words.

💡 Why It Matters

Stemming reduces vocabulary size, which helps models generalize better on sparse data. However, it is aggressive — 'studies' may stem to 'studi', which is not a real word. For precision-critical tasks, lemmatization is preferred.

Stemming Algorithms

Porter Stemmer

The most widely used English stemmer, applying a series of suffix-stripping rules in phases.

Snowball Stemmer

An improved version of Porter, also supporting multiple languages.

Lancaster Stemmer

More aggressive than Porter — faster but produces more truncated stems.

Over-stemming

When two different words are reduced to the same stem incorrectly, causing loss of meaning.

🛠 Library Note

This module uses NLTK's `PorterStemmer` and `SnowballStemmer` classes.

External Documentation

Text Preprocessing with NLTK Stemming vs Lemmatization Guide

What You'll Learn

How stemming reduces vocabulary size
Porter vs Snowball stemmer differences
When stemming hurts more than it helps

Tokenization Lemmatization