Datasets
Real-world datasets used across the NLP learning path — each chosen for a specific purpose.
IMDB Dataset
IMDB_Dataset.xlsx64,661 KB dataset of movie reviews labeled for sentiment analysis. Used in Feature Extraction and Text Classification modules.
Size
~64 MB
Records
50,000 rows
Source
Stanford AI Lab
Columns / Structure
review (text), sentiment (positive/negative)
Friends Transcript
Friends_Transcript.txtRaw episode transcripts from the TV show Friends (~4.7 MB). Used to train Word2Vec embeddings on conversational, informal English.
Size
~4.7 MB
Records
Multiple episodes
Source
Custom collected
Columns / Structure
Raw dialogue text
Game of Thrones Scripts
data_GOT.zipCharacter dialogue and scene descriptions from Game of Thrones. Used for NLP experimentation on fantasy/formal prose. Download as ZIP.
Size
Variable
Records
Multiple episodes
Source
Custom collected
Columns / Structure
Character, Dialogue, Season, Episode
Quora Analysis
quora_analysis.zipQuora question dataset used for text analysis and NLP experiments including duplicate question detection and classification. Download as ZIP.
Size
Variable
Records
Variable
Source
Custom collected
Columns / Structure
Question text, labels
Slang Reference List
slang.txtCustom slang normalization dictionary used in the Text Preprocessing module to map informal terms to their standard equivalents.
Size
~2 KB
Records
Custom word mappings
Source
Custom built
Columns / Structure
slang → standard form
📦 Note on Large Datasets
The IMDB dataset is ~64MB — download may take a moment. Folder-based datasets (GOT, Quora) are packaged as ZIP files. You can also access everything from the original GitHub repo.