scikit-learnTF-IDFBoWCountVectorizer18 min

Feature Extraction

Convert text into numerical representations using BoW and TF-IDF.

Overview

Feature extraction transforms text into numerical vectors that machine learning models can process. The two classical approaches are Bag of Words (BoW), which counts word frequencies, and TF-IDF, which weights words by how unique they are across documents.

💡 Why It Matters

ML models cannot work with raw strings — feature extraction is the bridge between text and math. Understanding sparse vector representations (BoW/TF-IDF) builds the intuition needed to appreciate why dense embeddings like Word2Vec are more powerful.

Vectorization Methods

Bag of Words (BoW)

Represents text as a vector of word counts, ignoring order and grammar.

TF-IDF

Term Frequency–Inverse Document Frequency — downweights common words and upweights rare, informative ones.

Document-Term Matrix

A matrix where rows are documents and columns are vocabulary words, cells contain TF-IDF scores.

Sparsity

BoW/TF-IDF vectors are mostly zeros since no single document contains all vocabulary words.

🛠 Library Note

This module uses scikit-learn's `CountVectorizer` and `TfidfVectorizer`.

External Documentation

scikit-learn TfidfVectorizer Docs Complete NLP Preprocessing Guide

What You'll Learn

Bag of Words (BoW) model and its limitations
How TF-IDF addresses BoW's frequency bias
Building a document-term matrix with scikit-learn

POS Tagging Word Vectorization (Word2Vec)