Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact Match #109

Open
AlpUygur opened this issue Apr 27, 2020 · 12 comments
Open

Exact Match #109

AlpUygur opened this issue Apr 27, 2020 · 12 comments

Comments

@AlpUygur
Copy link

Hi,

I am using flashtext for searching 694 bad words in some documents for tagging them if they contain bad language or not. But i need the exact match case because some words contain bad words in them but they are not bad words. How can I make the search for exact matches?

@thakur-nandan
Copy link
Collaborator

Hi @AlpUygur,

Just add the bad words to the Keyword Processor using the add_keyword parameter, and make sure the case_sensitive=True. I hope this solves your issue?

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Bad word 1')
>>> keyword_processor.add_keyword('Bad word 2')
>>> keywords_found = keyword_processor.extract_keywords('I have Bad word 1 and Bad word 2.')
>>> keywords_found
>>> # ['Bad word1', 'Bad word 2']

Kind Regards,
Nandan Thakur

@AlpUygur
Copy link
Author

AlpUygur commented May 1, 2020

Hello, Thanks for your answer but it didn't work on my case.

When I try to add words in for loop it says

"keyword_processor.add_keyword(content[i])
TypeError: list indices must be integers or slices, not str"

and I did not want to add 694 of them in hand.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 2, 2020

can you share some sample which fails ?

@AlpUygur
Copy link
Author

For example I am looking for "am" in text. It founds "am" when there is "cam" in the text.

@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 10, 2020 via email

@iwpnd
Copy link

iwpnd commented May 11, 2020

@AlpUygur this does only happen when the "c" in "cam" for whatever reason is not part of the non_word_boundaries. Depending on the character script of your input text, this can happen.

import string
non_word_boundaries = set(string.digits + string.ascii_letters + '_')

Then check if that "c" is in non_word_boundaries.

If it is not, you have to manually add non_word_boundaries to your instance of KeywordProcessor via add_non_word_boundary().

@AlpUygur
Copy link
Author

AlpUygur commented May 16, 2020

This should never happen, can you pick that line and make a working example and share that.

On Mon, May 11, 2020 at 12:20 AM Alp Uygur @.***> wrote: For example I am looking for "am" in text. It founds "am" when there is "cam" in the text. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#109 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXY3QQPIMFJ45LKAB7T7ILRQ3ZN7ANCNFSM4MSBE2VA .
-- Vikash

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")

def isBad(text, keyword_processor):
    keywords_found = keyword_processor.extract_keywords(text)
    if len(keywords_found) > 0:
        print(keywords_found)
        return True
    return False

with open(r"texts.txt",'r') as f:
    text = f.read()


print(isBad(text,keyword_processor))

Output:
['am']
True

texts.txt
badwords.txt

Text file and bad words are here. I looked to the file and there is no "am" word in it but it still finds it. There are "am" inside of some words.

@iwpnd

@iwpnd
Copy link

iwpnd commented May 16, 2020

Can’t be bothered. Re-read my last comment and read up on how flashtext treats word boundaries.

@AlpUygur
Copy link
Author

It did not change anything when I add non word boundary

@iwpnd

@iwpnd
Copy link

iwpnd commented May 17, 2020

@AlpUygur Just my luck then I guess. :P

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> ["am"]

changing non word boundaries

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
keyword_processor.non_word_boundaries.update(["ş", "ü"])

text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> []

@thakur-nandan
Copy link
Collaborator

@AlpUygur probably uninstall flashtext and reinstall again?
I can't find a reason why it won't change anything with non-word boundaries.
Follow the steps mentioned by @iwpnd.

Thanks @iwpnd for the clear implementation :)

Kind Regards,
Nandan Thakur

@AlpUygur
Copy link
Author

@AlpUygur Just my luck then I guess. :P

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> ["am"]

changing non word boundaries

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
keyword_processor.non_word_boundaries.update(["ş", "ü"])

text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> []

Thanks for implementation. It is very clear.
On the other hand, my solved my problem by changing encoding of the file i read to utf-8. Because of badwords are in utf-8 file must be read utf-8 bytes format. I did not see this difference for a long time sorry :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants