Datasets

Real-world datasets used across the NLP learning path — each chosen for a specific purpose.

IMDB Dataset

IMDB_Dataset.xlsx
XLSX

64,661 KB dataset of movie reviews labeled for sentiment analysis. Used in Feature Extraction and Text Classification modules.

Size

~64 MB

Records

50,000 rows

Source

Stanford AI Lab

Columns / Structure

review (text), sentiment (positive/negative)

Used in:Feature ExtractionText Classification

Friends Transcript

Friends_Transcript.txt
TXT

Raw episode transcripts from the TV show Friends (~4.7 MB). Used to train Word2Vec embeddings on conversational, informal English.

Size

~4.7 MB

Records

Multiple episodes

Source

Custom collected

Columns / Structure

Raw dialogue text

Used in:Word Vectorization (Word2Vec)

Game of Thrones Scripts

data_GOT.zip
ZIP

Character dialogue and scene descriptions from Game of Thrones. Used for NLP experimentation on fantasy/formal prose. Download as ZIP.

Size

Variable

Records

Multiple episodes

Source

Custom collected

Columns / Structure

Character, Dialogue, Season, Episode

Used in:Word Vectorization (Word2Vec)

Quora Analysis

quora_analysis.zip
ZIP

Quora question dataset used for text analysis and NLP experiments including duplicate question detection and classification. Download as ZIP.

Size

Variable

Records

Variable

Source

Custom collected

Columns / Structure

Question text, labels

Used in:Feature ExtractionText Classification

Slang Reference List

slang.txt
TXT

Custom slang normalization dictionary used in the Text Preprocessing module to map informal terms to their standard equivalents.

Size

~2 KB

Records

Custom word mappings

Source

Custom built

Columns / Structure

slang → standard form

Used in:Text Preprocessing

📦 Note on Large Datasets

The IMDB dataset is ~64MB — download may take a moment. Folder-based datasets (GOT, Quora) are packaged as ZIP files. You can also access everything from the original GitHub repo.