Learning Path/Word Vectorization (Word2Vec)
gensimWord2VecembeddingsIMDBFriends20 min

Word Vectorization (Word2Vec)

Learn dense word embeddings that capture semantic meaning and relationships.

Run in Google Colab โ†—

Overview

Word2Vec is a neural network-based model that learns dense vector representations (embeddings) for words. Unlike BoW, these 100โ€“300 dimensional vectors capture semantic similarity โ€” 'king' and 'queen' are close in vector space because they appear in similar contexts.

๐Ÿ’ก Why It Matters

Word2Vec was a landmark breakthrough in NLP. Dense embeddings allow models to generalize across semantically similar words, dramatically improving performance on classification, clustering, and similarity tasks compared to sparse BoW vectors.

Word2Vec Architecture

CBOW (Continuous Bag of Words)

Predicts a target word from its surrounding context words.

Skip-gram

Predicts surrounding context words given a target word โ€” better for rare words.

Embedding Dimension

The size of each word's vector (typically 100โ€“300). Higher = more expressive but slower.

Cosine Similarity

Used to measure how semantically close two word vectors are in the embedding space.

๐Ÿ›  Library Note

This module uses gensim's `Word2Vec` class. The Friends transcript dataset from this repo is used as the training corpus.

What You'll Learn

  • Why Word2Vec outperforms sparse BoW/TF-IDF vectors
  • CBOW vs Skip-gram architectures
  • Training Word2Vec on Friends transcript data
  • Word similarity and analogy tasks with gensim