Word Vectorization (Word2Vec)

Learn dense word embeddings that capture semantic meaning and relationships.

Overview

Word2Vec is a neural network-based model that learns dense vector representations (embeddings) for words. Unlike BoW, these 100–300 dimensional vectors capture semantic similarity — 'king' and 'queen' are close in vector space because they appear in similar contexts.

💡 Why It Matters

Word2Vec was a landmark breakthrough in NLP. Dense embeddings allow models to generalize across semantically similar words, dramatically improving performance on classification, clustering, and similarity tasks compared to sparse BoW vectors.

Word2Vec Architecture

CBOW (Continuous Bag of Words)

Predicts a target word from its surrounding context words.

Skip-gram

Predicts surrounding context words given a target word — better for rare words.

Embedding Dimension

The size of each word's vector (typically 100–300). Higher = more expressive but slower.

Cosine Similarity

Used to measure how semantically close two word vectors are in the embedding space.

🛠 Library Note

This module uses gensim's `Word2Vec` class. The Friends transcript dataset from this repo is used as the training corpus.

External Documentation

gensim Word2Vec Official Docs Word2Vec for Text Classification

What You'll Learn

Why Word2Vec outperforms sparse BoW/TF-IDF vectors
CBOW vs Skip-gram architectures
Training Word2Vec on Friends transcript data
Word similarity and analogy tasks with gensim

Feature Extraction Text Classification