Skip to content

Evaluation of the accuracy of vectorization and text classification methods

Notifications You must be signed in to change notification settings

andreytsimbalov/News_Classification_and_Vectorization

Repository files navigation

News_Classification_and_Vectorization

This project implements: news gathering -> preprocessing -> vectorization -> classification

Content:

Data collection

Parser_news_LENTA creates 12 datasets (one for each month of 2020) in the format data/data_on_months/news_lenta_XX_2020

Likewise: RIA, RBC

data_news_corrector_2020 combines all collected data and normalizes their shape. Only the tags - dt - main_text - website columns remain. Final dataset data/news_main_2020

Data preprocessing

data_preprocessing conducts data preprocessing. Stemming / lemmatization, removal of stop words, replacement of numbers with unified analogs are performed.

It also creates tags based on tags from websites for further classification:

  • economy
  • entertainment
  • traditions
  • science
  • society
  • sports
  • technology

Final dataset data/news_main_prepr_2020, as well as data/data_stem & data/data_lemm

Vectorization

vector_model_creator performs vectorization:

  • tfidf_lemm_500k - tfidf
  • d2v_300 - Doc2Vec
  • ft_lemm_300 - FastText
  • w2v_tfidf_vector_data - Word2Vec
  • glove_tfidf_vector_data - GloVe
  • use_vector_data - Universal-sentence-encoder
  • bert_vector_data - Bert

all models are stored in the models/ folder

Classification

Classifier_news makes a classification:

  • LogisticRegression
  • SVM
  • Single-layer perceptron
  • Bert
  • Gpt-2

for LogisticRegression, SVM, Single-layer perceptron models, preliminary vectorization is performed using one of the previously described methods.

About

Evaluation of the accuracy of vectorization and text classification methods

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages