You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When tokenizing Korean text, the tokenizer loses information about complex whitespace by replacing them with a single space. Consequently, text is not equal to nlp(text).text. This discrepancy causes a problem: entities described in the original text with (start, end) indices no longer align with those in the tokenized document. Similarly, the offsets of predicted entities do not match those in the original text.
The described issue, for example, invalidates this code:
I presume it might not be the most correct/idiomatic/optimized solution but at least it works for my case (named entity recognition).
This is how I solved the issues for my project:
fromtypingimportGeneratorfromspacy.lang.koimportPOS, TAG_MAP, KoreanTokenizer, X, check_spacesfromspacy.tokensimportDocclass_CustomKoreanTokenizer(KoreanTokenizer):
"""Custom Korean tokenizer that preserves whitespaces. Required to make `text` equal to `doc.text` so that recognized entity offsets are correctly mapped to the original text. See parent class for more details. """@staticmethoddef_add_whitespace_tokens(
text: str, tokens: list[str]
) ->Generator[tuple[str, bool], None, None]:
"""Insert whitespace tokens into the list of `mecab-ko` tokens."""prev_end=0fortokenintokens:
start=text.find(token, prev_end)
ifstart==-1:
raiseValueError(f'Token "{token}" not found in "{text}"')
ifprev_end<start:
ws_token=text[prev_end:start]
# Create whitespace tokens only if there is something more than# just a single whitespace or if it's the first token.ifws_token!=' 'orprev_end==0:
ifws_token.startswith(' ') andprev_end>0:
# Leading space goes to the `prev_token.whitespace_`# if it's not the first token.ws_token=ws_token[1:]
yieldws_token, Trueyieldtoken, Falseprev_end=start+len(token)
ifprev_end<len(text):
# Yield what is left as a whitespace token. We yield even just a# single whitespace as a token since otherwise `check_spaces` won't# catch it at the end of the text.yieldtext[prev_end:], Truedef__call__(self, text: str) ->Doc:
"""Tokenize `text` and create a spaCy Doc object. This is a slightly modified version of `spacy.lang.ko.KoreanTokenizer.__call__`. It preserves whitespaces, which allows entity offsets to be maintained. """dtokens=list(self.detailed_tokens(text))
tokens= []
is_spaces= []
fortoken, is_spaceinself._add_whitespace_tokens(
text, [dt['surface'] fordtindtokens]
):
tokens.append(token)
is_spaces.append(is_space)
doc=Doc(
self.vocab, words=tokens, spaces=list(check_spaces(text, tokens))
)
fortoken, dtokeninzip(
(
tokenfortoken, is_spaceinzip(doc, is_spaces, strict=True)
ifnotis_space
),
dtokens,
strict=True,
):
first_tag, sep, eomi_tags=dtoken['tag'].partition('+')
token.tag_=first_tag# stem(어간) or pre-final(선어말 어미)iftoken.tag_inTAG_MAP:
token.pos=TAG_MAP[token.tag_][POS]
else:
token.pos=Xtoken.lemma_=dtoken['lemma']
doc.user_data['full_tags'] = [dt['tag'] fordtindtokens]
returndoc
When tokenizing Korean text, the tokenizer loses information about complex whitespace by replacing them with a single space. Consequently,
text
is not equal tonlp(text).text
. This discrepancy causes a problem: entities described in the original text with(start, end)
indices no longer align with those in the tokenized document. Similarly, the offsets of predicted entities do not match those in the original text.The described issue, for example, invalidates this code:
How to reproduce the behaviour
Suggested solution
I presume it might not be the most correct/idiomatic/optimized solution but at least it works for my case (named entity recognition).
This is how I solved the issues for my project:
Tests for the solution:
Your Environment
The text was updated successfully, but these errors were encountered: