Skip to content

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



40 Commits

Repository files navigation


PyPI version Actions Status PyPI download total MIT license

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models such as BERT.



  • fasttext
  • transformers
  • pytorch
  • numpy
  • flask

Installing via pip

$ pip install semantic-sh


from semantic_sh import SemanticSimHash

Use with BERT:

sh = SemanticSimHash(model_type='bert-base-multilingual-cased', dim=768)

Use with fasttext:

sh = SemanticSimHash(model_type='fasttext', dim=300, model_path='/path/to/cc.en.300.bin')

Use with GloVe:

sh = SemanticSimHash(model_type='glove', dim=300, model_path='/path/to/glove.6B.50d.txt')

Use with word2vec:

sh = SemanticSimHash(model_type='word2vec', dim=300, model_path='/path/to/en.w2v.txt')

Additional parameters

Customize threshold (default:0) , hash length (default: 256-bit) and add stop words list.

sh = SemanticSimHash(model_type='fasttext', key_size=128, dim=300, model_path='pat_to_fasttext_vectors.bin', thresh=0.8, stop_words=['the', 'i', 'you', 'he', 'she', 'it', 'we', 'they'])

Note: BERT-based models do not require stop words list.

Hash your text

sh.get_hash(['<your_text_0>', '<your_text_1>'])

Add document

Add your document to the proper group

sh.add_document(['<your_text_0>', '<your_text_1>'])

Find similar

Get all documents in the same group with the given text


Get Hamming Distance between 2 texts

sh.get_distance('<first_text>', '<second_text>')

Go through all document groups

Get all similar document groups which have more than 1 document

for docs in sh.get_similar_groups():

Save data

Save added documents, hash function, model and parameters'model.dat')

Load from saved file

Load all parameters, documents, hash function and model from saved file

sh = SemanticSimHash.load('model.dat')

API Server

Easily deploy a simple text similarity engine on web.


$ git clone

Standalone Usage [-h] [--host HOST] [--port PORT] [--model-type MODEL_TYPE]
                 [--model-path MODEL_PATH] [--key-size KEY_SIZE] [--dim DIM]
                 [--stop-words [STOP_WORDS [STOP_WORDS ...]]]
                 [--load-from LOAD_FROM]

optional arguments:
  -h, --help            show this help message and exit

  --host HOST
  --port PORT

  --model-type MODEL_TYPE
                        Type of model to run: fasttext or any pretrained model
                        name from huggingface/transformers
  --model-path MODEL_PATH
                        Path to vector files of fasttext models
  --key-size KEY_SIZE   Hash length in bits
  --dim DIM             Dimension of text representations according to chosen
                        model type
  --stop-words [STOP_WORDS [STOP_WORDS ...]]
                        List of stop words to exclude

  --load-from LOAD_FROM
                        Load previously saved state

Using with WSGI Container

from gevent.pywsgi import WSGIServer
from server import init_app

app = init_app(params) # same params as initialize SemantcSimHash object

http_server = WSGIServer(('', 5000), app)

NOTE: Sample code uses gevent but you can use any WSGI container which can be used with Flask app object instead.

API Reference

POST /api/hash

Return hashes of given documents

Request Body

    "documents": [
        "Here is the first document",
        "and second document"

Response Body

    "hashes": [

POST /api/add

Add given documents and return hash and custom IDs of the documents

Request Body

    "documents": [
        "Here is the first document",
        "and second document"

Response Body

    "documents": [
            "id": 1,
            "hash": 0x5d134944428a4"
            "id": 2,
            "hash": 0x7f636944d8c8"

POST /api/find-similar

Return similar documents to given text

Request Body

    "text": "Here is the text"

Response Body

    "similar_texts": [
        "Here is the text",
        "First text here",
        "Here is text"

POST /api/distance

Return Hamming distance between source and target texts

Request Body

    "src": "Here is the source text",
    "tgt": "Target text for measuring distance"

Response Body

    "distance": 21

GET /api/similarity-groups

Return buckets having more than one document ID

GET /api/text/<int:id>

Return the document according to its ID

With docker

Run the api server on port 4000

docker run -ti -p 4000:4000 -v `pwd`/data:/opt/data  semantic-sh:latest --port=4000 --model-type=bert-base-multilingual-cased --model-path=/opt/data

With docker-compose

Run the api server on port 4000

docker-compose up -d semantic-sh

Some Implementation Details

This is a simplified implementation of simhash by just creating random vectors and assigning 1 or 0 according to the result of dot product of each of these vectors with represantation of the text.
