Word Vectorization (Word2Vec)
Learn dense word embeddings that capture semantic meaning and relationships.
Run in Google Colab โOverview
Word2Vec is a neural network-based model that learns dense vector representations (embeddings) for words. Unlike BoW, these 100โ300 dimensional vectors capture semantic similarity โ 'king' and 'queen' are close in vector space because they appear in similar contexts.
๐ก Why It Matters
Word2Vec was a landmark breakthrough in NLP. Dense embeddings allow models to generalize across semantically similar words, dramatically improving performance on classification, clustering, and similarity tasks compared to sparse BoW vectors.
Word2Vec Architecture
CBOW (Continuous Bag of Words)
Predicts a target word from its surrounding context words.
Skip-gram
Predicts surrounding context words given a target word โ better for rare words.
Embedding Dimension
The size of each word's vector (typically 100โ300). Higher = more expressive but slower.
Cosine Similarity
Used to measure how semantically close two word vectors are in the embedding space.
๐ Library Note
This module uses gensim's `Word2Vec` class. The Friends transcript dataset from this repo is used as the training corpus.
External Documentation
What You'll Learn
- Why Word2Vec outperforms sparse BoW/TF-IDF vectors
- CBOW vs Skip-gram architectures
- Training Word2Vec on Friends transcript data
- Word similarity and analogy tasks with gensim