Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Devanagri and Indian Languages #123

Open
srnthsrdhrn opened this issue May 6, 2021 · 0 comments
Open

Support for Devanagri and Indian Languages #123

srnthsrdhrn opened this issue May 6, 2021 · 0 comments

Comments

@srnthsrdhrn
Copy link

srnthsrdhrn commented May 6, 2021

Hi.
First of all, I would like to thank you for creating such a wonderful library. Really helps me a lot.

I am trying to use this for Devanagri (the script for Hindi) specifically, where I am facing issues.

The issue is when I am trying to extract keywords from a particular string, even strings containing that keyword as substrings are getting selected.

Example:

If I am searching for "Pam" I am also getting "Pamella".

From my rough understanding of the underlying algorithm, these cases ideally shouldn't occur.

So I am assuming this is something to do with the script of the text. Do we have a solution for this?

I came across this issue with Chinese: #43

Where you mentioned an absence of proper tokenization for the language is the issue. If that is the case here, I should be able to help in that regard.

For people who are coming to this issue for a solution, I am temporarily using a hack to get around this,

I use flashtext to extract the keywords and use the regex library to search for only those extracted keywords. Regex has support for unicode scripts and hence the regex expressions with word boundaries work for me.
So flashtext kind of reduces the search space for me, and regex is able to give good turnaround times there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant