Indonesian Text Classifer

Simple Indonesian text classifier using Sklearn Pipeline. Two models are currently available:

TF-IDF vectorization with SGD Classifier (linear SVM) (default) [3]: 61% F1-score with small model size (< 800 KB)
Word2Vec x TF-IDF vectorization with RBF SVM [4]: 77% F1-score but with huge model size (> 1 GB) due to encapsulation of FastText pretrained word vectors (> 750 MB)

Dependencies

Python >= 3.x
Sqlite3

Usage

Training

Install dependencies pip install -r requirements.txt
Put datasets in dataset_labeled.csv
Run python train.py. Example:

$ python train.py

Cross-validating..
Classifier: SGDClassifier

No      test_pos_f1     test_pos_precision      test_pos_recall test_neg_f1     test_neg_precision      test_neg_recall
0       0.625   1.0     0.4545  0.9     0.8182  1.0
1       0.7778  1.0     0.6364  0.931   0.871   1.0
2       0.6667  0.8571  0.5455  0.8966  0.8387  0.963
3       0.5333  1.0     0.3636  0.8852  0.7941  1.0
4       0.625   1.0     0.4545  0.9     0.8182  1.0
5       0.8182  0.8182  0.8182  0.9259  0.9259  0.9259
6       0.7619  0.8889  0.6667  0.9091  0.8621  0.9615
7       0.1333  0.25    0.0909  0.7797  0.697   0.8846

Avg     0.6176  0.8518  0.5038  0.8909  0.8281  0.9669

Training using Word2Vec

Download Indonesian Wikipedia word2vec model from Facebook Research wiki.id.vec and put it in the same directory as train.py
Put datasets in dataset_labeled.csv
Run python train.py -m w2v-rbfsvm
Run python train.py -m w2v-rbfsvm -e dataset_labeled.vec to use custom embedding file trained using word2vec.py

Cross-validating..
Classifier: SVC

No      test_pos_f1     test_pos_precision      test_pos_recall test_neg_f1     test_neg_precision      test_neg_recall
0       0.8421  1.0     0.7273  0.9474  0.9     1.0
1       0.8421  1.0     0.7273  0.9474  0.9     1.0
2       0.8     0.8889  0.7273  0.9286  0.8966  0.963
3       0.625   1.0     0.4545  0.9     0.8182  1.0
4       0.9     1.0     0.8182  0.9643  0.931   1.0
5       0.8696  0.8333  0.9091  0.9434  0.9615  0.9259
6       0.8696  0.9091  0.8333  0.9434  0.9259  0.9615
7       0.4286  1.0     0.2727  0.8667  0.7647  1.0

Avg     0.7721  0.9539  0.6837  0.9301  0.8872  0.9813

You can train your own word2vec using python word2vec.py by putting training data (sentences) in first column of dataset_labeled.csv

Testing

Do training or obtain model.pkl (and put it in same location as test.py)
Install dependencies (if you have not) pip install -r requirements.txt
Run python test.py "sentences_1" "sentences_2" "sentences_n". Example:

$ python test.py "Harga Gabah Jatuh karena Hujan Berkepanjangan" "Donatella Klaim Film Serial Pembunuhan Gianni Versace Fiktif"

Preprocessing..
100% (2 of 2) |########################################################################| Elapsed Time: 0:00:00 Time: 0:00:00
Prediction(s):
Harga Gabah Jatuh karena Hujan Berkepanjangan (1)
Donatella Klaim Film Serial Pembunuhan Gianni Versace Fiktif (0)

Crawling titles from website

To automatically crawl titles (and links) from website, run classification and store positive results use crawler.py:

Copy .config.example to .config
Run python crawler.py

Post crawled title links to facebook

To automatically post crawled links from website, use fb.py:

Copy .config.example to .config and replace all Facebook config with valid values
Run python fb.py

To get token without expiry time follow suggestion from documentation :

"To get a longer-lived page access token, exchange the User access token for a long-lived one, as above, and then request the Page access token. The resulting page access token will not have any expiry time."

$ python crawl.py

To Do List

Store word2vec model in database to save RAM
Replace stemming with lemmatizer
Add more crawling sources
Try reinforcement learning instead of supervised learning

References

Stemmer https://github.com/har07/PySastrawi
Stopwords list https://github.com/stopwords-iso/stopwords-id
Text classification with Sklearn Pipeline https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
Text classification with Sklearn and Gensim Word2Vec http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
Facebook FastText pretrained word vectors https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indonesian Text Classifer

Dependencies

Usage

Training

Training using Word2Vec

Testing

Crawling titles from website

Post crawled title links to facebook

To Do List

References

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
crawler		crawler
.config.example		.config.example
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
crossvalidate.py		crossvalidate.py
dataset_labeled.csv		dataset_labeled.csv
dbexplorer.py		dbexplorer.py
fb.py		fb.py
model.pkl		model.pkl
model.tfidf.linearsvm.pkl		model.tfidf.linearsvm.pkl
requirements.txt		requirements.txt
stopwords-id.txt		stopwords-id.txt
test.py		test.py
train.py		train.py
utils.py		utils.py
word2vec.py		word2vec.py

yohanesgultom/id-text-classifier

Folders and files

Latest commit

History

Repository files navigation

Indonesian Text Classifer

Dependencies

Usage

Training

Training using Word2Vec

Testing

Crawling titles from website

Post crawled title links to facebook

To Do List

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages