internationalize word boundary checks #49

aseifert · 2018-03-19T22:53:48Z

Hi there,

I think the only safe way to deal with issue #48 would be to test against the \W class [1]. Judging from the benchmarks linked on https://github.com/vi3k6i5/flashtext#why-not-regex this seems to run slower by a factor of 1-2 though.

Best,
Alex

[1] Quoting the Python docs:

\b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

coveralls · 2018-03-19T22:55:46Z

Coverage increased (+0.7%) to 100.0% when pulling 9b6b187 on aseifert:master into 5591859 on vi3k6i5:master.

coveralls · 2018-03-19T22:55:46Z

Coverage increased (+0.7%) to 100.0% when pulling 9b6b187 on aseifert:master into 5591859 on vi3k6i5:master.

ioistired · 2019-06-08T05:58:19Z

Another way, based on https://stackoverflow.com/a/2998550:

def is_word_char(c, _categories=frozenset({'Ll', 'Lu', 'Lt', 'Lo', 'Lm', 'Nd', 'Pc'})):
    return unicodedata.category(c) in _categories

ioistired · 2019-06-08T05:58:43Z

flashtext/keyword.py

@@ -482,7 +457,7 @@ def extract_keywords(self, sentence, span_info=False):
        while idx < sentence_len:
            char = sentence[idx]
            # when we reach a character that might denote word end
-            if char not in self.non_word_boundaries:
+            if KeywordProcessor.NON_WORD_CHAR_REGEX.match(char):


why the ugly direct reference to the class? just use self

senpos · 2020-02-21T10:19:23Z

Another way to do it:

from functools import lru_cache

from flashtext import KeywordProcessor


class NonWordBoundaries:
    def __init__(self, *predicates):
        self.predicates = predicates

    @lru_cache(maxsize=128)
    def __contains__(self, ch):
        for predicate in self.predicates:
            if predicate(ch):
                return True
        return False


def main():
    words_to_search = ["рок"]

    keyword_processor = KeywordProcessor()
    keyword_processor.set_non_word_boundaries(NonWordBoundaries(str.isalpha, str.isdigit))
    keyword_processor.add_keywords_from_list(words_to_search)
    keywords_found = keyword_processor.extract_keywords('рок порок роковой')
    print(keywords_found)

Not sure about performance though. But at least it is easy to modify the behaviour.

alexpeaceca · 2020-05-14T16:34:53Z

Benchmarks vs. Regex are for the English only char set. Is increasing the word boundaries like this effecting flashtext performance in any significant way?

internationalize word boundary checks

9b6b187

ioistired suggested changes Jun 8, 2019

View reviewed changes

Vuizur mentioned this pull request Feb 6, 2023

Using spacy for POS detection when creating word wise epub xxyzz/WordDumb#71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internationalize word boundary checks #49

internationalize word boundary checks #49

aseifert commented Mar 19, 2018

coveralls commented Mar 19, 2018

coveralls commented Mar 19, 2018 •

edited

Loading

ioistired commented Jun 8, 2019

ioistired Jun 8, 2019

senpos commented Feb 21, 2020

alexpeaceca commented May 14, 2020

internationalize word boundary checks #49

Are you sure you want to change the base?

internationalize word boundary checks #49

Conversation

aseifert commented Mar 19, 2018

coveralls commented Mar 19, 2018

coveralls commented Mar 19, 2018 • edited Loading

ioistired commented Jun 8, 2019

ioistired Jun 8, 2019

Choose a reason for hiding this comment

senpos commented Feb 21, 2020

alexpeaceca commented May 14, 2020

coveralls commented Mar 19, 2018 •

edited

Loading