Tokenization
Break raw text into meaningful units โ words, sentences, or subwords.
Run in Google Colab โOverview
Tokenization is the process of segmenting text into individual units called tokens. A token can be a word, punctuation mark, or sentence. It is the first transformation applied after preprocessing and is required before any further NLP step.
๐ก Why It Matters
Every NLP model operates on sequences of tokens, not raw strings. The quality of tokenization directly impacts all downstream tasks โ poor tokenization means the model receives malformed input. For example, naive whitespace splitting fails on contractions like "don't" or hyphenated words like "state-of-the-art".
Types of Tokenization
Word Tokenization
Splits text into individual words and punctuation marks.
Sentence Tokenization
Splits a paragraph into individual sentences using punctuation heuristics.
NLTK word_tokenize
Handles contractions and edge cases better than Python's split() method.
Subword Tokenization
Advanced method (BPE, WordPiece) used in transformers โ not covered in this repo but worth knowing.
๐ Library Note
This module uses NLTK's `word_tokenize` and `sent_tokenize` functions, which require the `punkt` corpus to be downloaded.
External Documentation
What You'll Learn
- Difference between word and sentence tokenization
- How whitespace-based splitting fails on edge cases
- NLTK's word_tokenize vs simple split()