Golang "native" implementation of word2vec algorithm (word2vec++ port)
The library enables word2vec algorithm for Golang using native runtime (no servers, no Python, etc). This Golang module implements CGO bridge towards Max Fomichev's word2vec C++ library.
Use C++11 compatible compiler and cmake 3.1 to build the library. It is essential step before going further.
mkdir _build && cd _build
brew install cmake
cmake -DCMAKE_BUILD_TYPE=Release ../libw2v
make
cp ../libw2v/lib/libw2v.dylib /usr/local/lib/libw2v.dylib
Note: the project does not distribute library binaries, it is upcoming feature. You have to build binaries by yourself for your target runtime or raise an issue if any help is needed.
The trained model is required before moving on. Either use original Max Fomichev's word2vec C++ utility or Golang's frond-end supplied by this project:
go install github.com/fogfish/word2vec/w2v@latest
In following examples, "War and Peace" by Leo Tolstoy is used for training. We have also used stop words to increase accuracy.
Let's start training with defining the config file:
w2v train config > wap-en.yaml
w2v train -C wap-en.yaml \
-o wap-v300w5e5s1h005-en.bin \
-f ../doc/leo-tolstoy-war-and-peace-en.txt
Name the output model after parameters used for training: v
vector size, w
nearby words window, e
training epoch, architecture skip-gram s1
or CBoW s0
, algorithm H. softmax h1
, N. Sampling h0
.
The default arguments gives sufficient results, see the article Word2Vec: Optimal hyperparameters and their impact on natural language processing downstream tasks for consideration about training options.
The latest version of the library is available at its main
branch. All development, including new features and bug fixes, take place on the main
branch using forking and pull requests as described in contribution guidelines. The stable version is available via Golang modules.
Use go get
to retrieve the library and add it as dependency to your application.
go get -u github.com/fogfish/word2vec
The example below shows the usage patterns for the library
import "github.com/fogfish/word2vec"
// 1. Load model
w2v, err := word2vec.Load("wap-v300w5e10s1h010-en.bin", 300)
seq := make([]word2vec.Nearest, 30)
w2v.Lookup("alexander", seq)
See the example or try it our via command line
w2v lookup \
-m wap-v300w5e5s1h005-en.bin \
-k 30 \
alexander
Calculate embedding for document
import "github.com/fogfish/word2vec"
// 1. Load model
w2v, err := word2vec.Load("wap-v300w5e10s1h010-en.bin", 300)
// 2. Allocated the memory for vector
vec := make([]float32, 300)
// 3. Calculate embeddings for the document
doc := "braunau was the headquarters of the commander-in-chief"
err = w2v.Embedding(doc, vec)
See the example or try it our via command line
w2v embedding \
-m wap-v300w5e5s1h005-en.bin \
../doc/leo-tolstoy-war-and-peace-en.txt
The library is MIT licensed and accepts contributions via GitHub pull requests:
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
The build and testing process requires Go version 1.21 or later.
The commit message helps us to write a good release note, speed-up review process. The message should address two question what changed and why. The project follows the template defined by chapter Contributing to a Project of Git book.
If you experience any issues with the library, please let us know via GitHub issues. We appreciate detailed and accurate reports that help us to identity and replicate the issue.