Text Classification

Build an end-to-end pipeline to classify text using ML models on the IMDB dataset.

Overview

Text classification is the task of assigning predefined categories to text. In this module, you build an end-to-end pipeline on the IMDB dataset: preprocess reviews, vectorize with TF-IDF, train a classifier, and evaluate performance — connecting every prior module together.

💡 Why It Matters

Classification is the most common applied NLP task — spam detection, sentiment analysis, topic labeling. This module demonstrates how all preprocessing steps compound: better preprocessing → better vectors → better classifier performance.

Pipeline Components

Naive Bayes

A probabilistic classifier that assumes feature independence — fast and strong baseline for text.

Logistic Regression

A linear model that often outperforms Naive Bayes on TF-IDF features with enough data.

scikit-learn Pipeline

Chains vectorizer + classifier into one object for clean training and inference.

Evaluation Metrics

Accuracy, precision, recall, and F1-score — all critical for understanding classifier behavior beyond raw accuracy.

🛠 Library Note

This module uses scikit-learn's `Pipeline`, `TfidfVectorizer`, `MultinomialNB`, and `LogisticRegression` on the IMDB Dataset.csv from this repo.

External Documentation

scikit-learn Text Classification Guide NLTK Classification Docs

What You'll Learn

Building a full preprocessing → vectorization → classification pipeline
Training Naive Bayes and Logistic Regression on IMDB reviews
Evaluating with accuracy, precision, recall, and F1

Word Vectorization (Word2Vec)