Feature Extraction
Convert text into numerical representations using BoW and TF-IDF.
Run in Google Colab ↗Overview
Feature extraction transforms text into numerical vectors that machine learning models can process. The two classical approaches are Bag of Words (BoW), which counts word frequencies, and TF-IDF, which weights words by how unique they are across documents.
💡 Why It Matters
ML models cannot work with raw strings — feature extraction is the bridge between text and math. Understanding sparse vector representations (BoW/TF-IDF) builds the intuition needed to appreciate why dense embeddings like Word2Vec are more powerful.
Vectorization Methods
Bag of Words (BoW)
Represents text as a vector of word counts, ignoring order and grammar.
TF-IDF
Term Frequency–Inverse Document Frequency — downweights common words and upweights rare, informative ones.
Document-Term Matrix
A matrix where rows are documents and columns are vocabulary words, cells contain TF-IDF scores.
Sparsity
BoW/TF-IDF vectors are mostly zeros since no single document contains all vocabulary words.
🛠 Library Note
This module uses scikit-learn's `CountVectorizer` and `TfidfVectorizer`.
External Documentation
What You'll Learn
- Bag of Words (BoW) model and its limitations
- How TF-IDF addresses BoW's frequency bias
- Building a document-term matrix with scikit-learn