Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowercase lemmatization in pipe, when tok2vec disabled #13511

Open
BMarcin opened this issue May 29, 2024 · 0 comments
Open

Lowercase lemmatization in pipe, when tok2vec disabled #13511

BMarcin opened this issue May 29, 2024 · 0 comments

Comments

@BMarcin
Copy link

BMarcin commented May 29, 2024

Introduction

Hi! I am using spaCy lemmatizer for some tasks. I saw that when using a pipe to process the data faster, I'm getting different results with tok2vec disabled or enabled. Maintaining case-sensitivity is critical for me. Is the below behavior expected?

How to reproduce the behaviour

Case1

import spacy
nlp = spacy.load("en_core_web_sm")

for doc in nlp.pipe(["Hello! My name is Marcin.", "I have a SFTP server running in my HomeLab"], batch_size=100, n_process=1, disable=["ner", "tok2vec"]):
    for token in doc:
        print(str(token), token.lemma_)
    print("")

Output:

Hello hello
! !
My my
name name
is is
Marcin marcin
. .

I i
have have
a a
SFTP sftp
server server
running running
in in
my my
HomeLab homelab

Case2

import spacy
nlp = spacy.load("en_core_web_sm")

for doc in nlp.pipe(["Hello! My name is Marcin.", "I have a SFTP server running in my HomeLab"], batch_size=100, n_process=1, disable=["ner"]):
    for token in doc:
        print(str(token), token.lemma_)
    print("")

Output:

Hello hello
! !
My my
name name
is be
Marcin Marcin
. .

I I
have have
a a
SFTP sftp
server server
running run
in in
my my
HomeLab HomeLab

Info about spaCy

  • spaCy version: 3.7.2
  • Platform: Windows-10-10.0.19045-SP0
  • Python version: 3.10.13
  • Pipelines: en_core_web_sm (3.7.1), en_core_web_trf (3.7.3), es_core_news_lg (3.7.0), es_core_news_sm (3.7.0), pl_core_news_lg (3.7.0), pl_core_news_sm (3.7.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant