Skip to content

lacteolus/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec: Custom implementation in PyTorch

Custom implementation of the original paper on Word2Vec - Efficient Estimation of Word Representations in Vector Space

It uses minimum of third-party packages. Most of the functionality is implemented using basic features of PyTorch.

Additional information:

Overview

  • There are 2 model architectures implemented in this project:
    • Continuous Bag-of-Words Model (CBOW), that predicts word based on its context
    • Continuous Skip-gram Model (Skip-Gram), that predicts context for a given word
  • Models are trained on text8 corpus which is the first 109 bytes of the English Wikipedia dump on Mar. 3, 2006
  • Context for both models is represented as 5 words before and 5 words after the central word
  • AdamW optimizer is used
  • Trained for 5 epochs
  • Vocabulary size is limited by 5000 words
  • Results can be compared with reference Gensim Word2Vec module

Repository mirrors

Repository structure

.
├── dataset
│   └── tesxt8.txt
├── imgs
│   ├── cbow.png
│   └── gensim.png
├── notebooks
│   └── training.ipynb
├── results
│   ├── cbow
│   └── skipgram
├── src
│   ├── custom_word2vec.py
│   ├── dataloader.py
│   ├── gensim_word2vec.py
│   ├── metric_monitor.py
│   ├── trainer.py
│   └── vocab.py
├── main.py
├── README.md
└── requirements.txt
  • dataset/text8.txt - text8 corpus file
  • imgs/ - images for documentation
  • notebooks/training.ipynb - demo for training procedure
  • notebooks/evaluation.ipynb - demo for visually evaluating models
  • results/ - folder for storing results
  • src/custom_word2vec.py - custom Word2Vec model
  • src/dataloader.py - dataloader related classes and functions
  • src/gensim_word2vec.py - Gensim Word2Vec model
  • src/metric_monitor.py - metric monitor class
  • src/vocab.py - vocabulary class
  • main.py - main script for training

Usage

Training in local environment

python main.py

Before running the command, the following parameters can be changed in main.py file:

  • MAX_VOCAB_SIZE - Max vocabulary size
  • EPOCHS - Number of epochs
  • MODEL_TYPE - Model type to be used: "cbow" or "skipgram"
  • EMBEDDING_SIZE - Embedding size
  • SAVE_PATH - Path for saving results

By default, parameters are similar to ones used in Gensim.

Using notebooks

  • notebooks/training.ipynb notebook can be used to run train process in Colab or Kaggle environments
  • notebooks/evaluation.ipynb notebook can be used to evaluate different models e.g. display scatterplots or find similar words

Here are two examples of word groupings

Gensim model

CBOW custom model Tux, the Linux mascot

License

This project is licensed under the terms of the MIT license.