Exact Match #109

AlpUygur · 2020-04-27T15:45:39Z

Hi,

I am using flashtext for searching 694 bad words in some documents for tagging them if they contain bad language or not. But i need the exact match case because some words contain bad words in them but they are not bad words. How can I make the search for exact matches?

thakur-nandan · 2020-04-28T20:18:41Z

Hi @AlpUygur,

Just add the bad words to the Keyword Processor using the add_keyword parameter, and make sure the case_sensitive=True. I hope this solves your issue?

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Bad word 1')
>>> keyword_processor.add_keyword('Bad word 2')
>>> keywords_found = keyword_processor.extract_keywords('I have Bad word 1 and Bad word 2.')
>>> keywords_found
>>> # ['Bad word1', 'Bad word 2']

Kind Regards,
Nandan Thakur

AlpUygur · 2020-05-01T18:48:19Z

Hello, Thanks for your answer but it didn't work on my case.

When I try to add words in for loop it says

"keyword_processor.add_keyword(content[i])
TypeError: list indices must be integers or slices, not str"

and I did not want to add 694 of them in hand.

vi3k6i5 · 2020-05-02T16:13:06Z

can you share some sample which fails ?

AlpUygur · 2020-05-10T18:49:56Z

For example I am looking for "am" in text. It founds "am" when there is "cam" in the text.

vi3k6i5 · 2020-05-10T20:01:14Z

This should never happen, can you pick that line and make a working example and share that.

…

On Mon, May 11, 2020 at 12:20 AM Alp Uygur ***@***.***> wrote: For example I am looking for "am" in text. It founds "am" when there is "cam" in the text. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#109 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXY3QQPIMFJ45LKAB7T7ILRQ3ZN7ANCNFSM4MSBE2VA> .

-- Vikash

iwpnd · 2020-05-11T08:00:42Z

@AlpUygur this does only happen when the "c" in "cam" for whatever reason is not part of the non_word_boundaries. Depending on the character script of your input text, this can happen.

import string
non_word_boundaries = set(string.digits + string.ascii_letters + '_')

Then check if that "c" is in non_word_boundaries.

If it is not, you have to manually add non_word_boundaries to your instance of KeywordProcessor via add_non_word_boundary().

AlpUygur · 2020-05-16T14:42:29Z

This should never happen, can you pick that line and make a working example and share that.
…
On Mon, May 11, 2020 at 12:20 AM Alp Uygur @.***> wrote: For example I am looking for "am" in text. It founds "am" when there is "cam" in the text. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#109 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXY3QQPIMFJ45LKAB7T7ILRQ3ZN7ANCNFSM4MSBE2VA .
-- Vikash

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")

def isBad(text, keyword_processor):
    keywords_found = keyword_processor.extract_keywords(text)
    if len(keywords_found) > 0:
        print(keywords_found)
        return True
    return False

with open(r"texts.txt",'r') as f:
    text = f.read()


print(isBad(text,keyword_processor))

Output:
['am']
True

texts.txt
badwords.txt

Text file and bad words are here. I looked to the file and there is no "am" word in it but it still finds it. There are "am" inside of some words.

@iwpnd

iwpnd · 2020-05-16T15:38:11Z

Can’t be bothered. Re-read my last comment and read up on how flashtext treats word boundaries.

AlpUygur · 2020-05-16T15:46:15Z

It did not change anything when I add non word boundary

@iwpnd

iwpnd · 2020-05-17T08:30:48Z

@AlpUygur Just my luck then I guess. :P

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> ["am"]

changing non word boundaries

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
keyword_processor.non_word_boundaries.update(["ş", "ü"])

text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> []

thakur-nandan · 2020-05-23T08:57:10Z

@AlpUygur probably uninstall flashtext and reinstall again?
I can't find a reason why it won't change anything with non-word boundaries.
Follow the steps mentioned by @iwpnd.

Thanks @iwpnd for the clear implementation :)

Kind Regards,
Nandan Thakur

AlpUygur · 2020-05-31T12:27:25Z

@AlpUygur Just my luck then I guess. :P

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> ["am"]

changing non word boundaries

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
keyword_processor.non_word_boundaries.update(["ş", "ü"])

text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> []

Thanks for implementation. It is very clear.
On the other hand, my solved my problem by changing encoding of the file i read to utf-8. Because of badwords are in utf-8 file must be read utf-8 bytes format. I did not see this difference for a long time sorry :)

thakur-nandan closed this as completed Apr 28, 2020

thakur-nandan reopened this Apr 28, 2020

thakur-nandan mentioned this issue May 23, 2020

Wrong matching result for word with accent marks #94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exact Match #109

Exact Match #109

AlpUygur commented Apr 27, 2020

thakur-nandan commented Apr 28, 2020

AlpUygur commented May 1, 2020 •

edited

Loading

vi3k6i5 commented May 2, 2020

AlpUygur commented May 10, 2020

vi3k6i5 commented May 10, 2020 via email

iwpnd commented May 11, 2020

AlpUygur commented May 16, 2020 •

edited

Loading

iwpnd commented May 16, 2020

AlpUygur commented May 16, 2020

iwpnd commented May 17, 2020 •

edited

Loading

thakur-nandan commented May 23, 2020

AlpUygur commented May 31, 2020

Exact Match #109

Exact Match #109

Comments

AlpUygur commented Apr 27, 2020

thakur-nandan commented Apr 28, 2020

AlpUygur commented May 1, 2020 • edited Loading

vi3k6i5 commented May 2, 2020

AlpUygur commented May 10, 2020

vi3k6i5 commented May 10, 2020 via email

iwpnd commented May 11, 2020

AlpUygur commented May 16, 2020 • edited Loading

iwpnd commented May 16, 2020

AlpUygur commented May 16, 2020

iwpnd commented May 17, 2020 • edited Loading

thakur-nandan commented May 23, 2020

AlpUygur commented May 31, 2020

AlpUygur commented May 1, 2020 •

edited

Loading

AlpUygur commented May 16, 2020 •

edited

Loading

iwpnd commented May 17, 2020 •

edited

Loading