20 Newsgroup Dataset Analysis

Introduction

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

For more information, click this link: http://qwone.com/~jason/20Newsgroups/

Data

The data source: http://qwone.com/~jason/20Newsgroups/

Tools

BoA, Tf-idf, LDA
- Gensim; sklearn
Doc2Vec
- Gensim
Visualization
- t-SNE [1] or PCA
- matplotlib; seaborn; visdom; tensorboard
- Matlab is also powerful
Document clustering
- sklearn.cluser.Kmeans
- sklearn.metrics

Project Flow

1. Preprocess the dataset

Clean the data and build the vocabulary
Visualize the statistics of the dataset
Baseline document features
Bag-of-words; TF-IDF Model

2. Topic Modeling

Train a LDA model with given topic number
Visualize different topics

3. Vector representation of documents

Train a Doc2Vec model
Visualize word embedding and document embedding

4. Comparison between different document representations

Document clustering

Important Files

FinalProject_codes1.ipynb - Vector Representations
FinalProject_codes2.ipynb - Topic Modeling
Text Classification methods in NLP using Deep Learning.ipynb - Convoluted Neural Network

Topic_modeling_and_clustering_Report.pdf - PDF report that decribes the appreaches used for document representation and classsification in python and compares the different approaches along with their visual representations using t-SNE

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Final Submission		Final Submission
Images		Images
README.md		README.md
Text Classification methods in NLP using Deep Learning.ipynb		Text Classification methods in NLP using Deep Learning.ipynb
Topic_modeling_and_clustering_Report.pdf		Topic_modeling_and_clustering_Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

20 Newsgroup Dataset Analysis

Introduction

Data

Tools

Project Flow

Important Files

About

Releases

Packages

Languages

Sameeksharajsb/20-Newsgroup-Dataset-Analysis

Folders and files

Latest commit

History

Repository files navigation

20 Newsgroup Dataset Analysis

Introduction

Data

Tools

Project Flow

Important Files

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages