Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flashtext seem inferior to hash table - am I missing something? #106

Open
Make42 opened this issue Apr 20, 2020 · 1 comment
Open

Flashtext seem inferior to hash table - am I missing something? #106

Make42 opened this issue Apr 20, 2020 · 1 comment

Comments

@Make42
Copy link

Make42 commented Apr 20, 2020

Maybe I missed something, but I do not understand what the advantage of using the tie dictionary is over using a hashing dictionary is. Can you tell me what I am missing?

Let K denote the number of keywords in the corpus and Q the number of query words for which you request the check for.

I am guessing that https://www.freecodecamp.org/news/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f/ reflects the time per query. So we have query time complexity of O(Q * K). I actually think this is weird - instinctually I would have expected O(Q * log(K)). Anyhow, just building a hash table would take O(Q).

Regarding build the respective data structures, flashtext should probably take something like Q(K * log(K). The first K is because you need to insert K words for which you need to query the tries data structure. However, since the trie is small at the beginning I put the second K into log. In opposite, the hash table only requires O(1), so we get O(K) in total for inserting the K words.

Assuming I am right: A way to make flashtext useful could be if you build two tries: One for the query words and one for the dictionary. Building the tries would take O(Q * log(Q) + K * log(K)), which would be slower than O(K) for the hash table. But querying would take about O(log(K) * log(Q)), which would be better the O(Q) if Q is large enough and and K is small enough.

Another application for the principles might be for finding longest common substrings.

These two ideas for improvement would of course need a bit more implementation.

@thakur-nandan
Copy link
Collaborator

Hi @Make42

Regarding the advantages of using the trie dictionary is over a hashing dictionary check this StackOverflow answer- https://stackoverflow.com/questions/245878/how-do-i-choose-between-a-hash-table-and-a-trie-prefix-tree.

Kind Regards,
Nandan Thakur

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants