Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upcoming ICU 72 may break tests #337

Open
MichaelChirico opened this issue Nov 29, 2022 · 1 comment
Open

Upcoming ICU 72 may break tests #337

MichaelChirico opened this issue Nov 29, 2022 · 1 comment

Comments

@MichaelChirico
Copy link
Contributor

Heads up that we observed some tests breaking on our (old, 0.5.1) version of text2vec after updating to ICU72.

I haven't had time to dig into it extensively (and especially whether the HEAD version of text2vec is still affected), but the root cause will be similar to what we reported to tokenizers:

ropensci/tokenizers#82

I am reporting despite a pure reproducible example because I saw this test still hard-coding the exact # of tokenized words (even though it's now marked dont-test):

expect_equal(length(vocab$term), 19297)

So I suspect the dev version is still susceptible. Note that because stringi bundles versioned copies of icu (e.g. icu69), it's also unlikely this will affect the CRAN version of text2vec until stringi updates their bundled version, but it could still affect users like us that link to a newer ICU version than that bundled with the CRAN sources.

Here's the output of testthat on 0.5.1 for reference:

=═ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-hash-corpus.R:19'): Unigram Hash Corpus construction ─────────
length(m@x) (`actual`) not equal to 140615L (`expected`).

  `actual`: 140614
`expected`: 140615
── Failure ('test-hash-corpus.R:37'): trigram hash-corpus construction ─────────
length(m@x) (`actual`) not equal to 591549L (`expected`).

  `actual`: 591559
`expected`: 591549
── Failure ('test-iterators.R:25'): ifiles ─────────────────────────────────────
nrow(v) (`actual`) not equal to 7448 (`expected`).

  `actual`: 7446
`expected`: 7448
── Failure ('test-iterators.R:38'): ifiles_parallel ────────────────────────────
nrow(v) (`actual`) not equal to 7448 (`expected`).

  `actual`: 7446
`expected`: 7448
── Failure ('test-vocab-corpus.R:16'): Vocabulary pruning ──────────────────────
length(vocab$term) (`actual`) not equal to 19297 (`expected`).

  `actual`: 19289
`expected`: 19297
── Failure ('test-vocab-corpus.R:19'): Vocabulary pruning ──────────────────────
max(vocab$term_count) (`actual`) not equal to 13224 (`expected`).

  `actual`: 13227
`expected`: 13224
── Failure ('test-vocab-corpus.R:86'): Unigran Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 141714L (`expected`).

  `actual`: 141713
`expected`: 141714
── Failure ('test-vocab-corpus.R:104'): bi-gram Vocabulary Corpus construction ──
sum(grepl("_", vocab$term, fixed = T)) (`actual`) not equal to 121333L (`expected`).

  `actual`: 121331
`expected`: 121333
── Failure ('test-vocab-corpus.R:105'): bi-gram Vocabulary Corpus construction ──
length(vocab$term) (`actual`) not equal to 121333L (`expected`).

  `actual`: 121331
`expected`: 121333
── Failure ('test-vocab-corpus.R:113'): bi-gram Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 220104L (`expected`).

  `actual`: 220109
`expected`: 220104
── Failure ('test-vocab-corpus.R:121'): Unigram + Bigram Vocabulary Corpus construction ──
length(vocab$term) (`actual`) not equal to 140630L (`expected`).

  `actual`: 140620
`expected`: 140630
── Failure ('test-vocab-corpus.R:128'): Unigram + Bigram Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 361818L (`expected`).

  `actual`: 361822
`expected`: 361818

[ FAIL 12 | WARN 20 | SKIP 1 | PASS 132 ]

With such high numbers of vocab elements it's a bit tougher to confirm specifically what's changed, but I do see things like e-mail addresses and words with colons that were mentioned in the tokenizers issue as causing slight changes in stringi::stri_split_boundaries behavior, so I am reasonably confident that's the full extent of what's going on.

We could simply update the #s of tokenized words to match the new output, but this technically marks a breaking change, and it may be hard to anticipate how this would affect downstream model outputs, so I also leave it to the maintainers to decide the right path forward.

@dselivanov
Copy link
Owner

dselivanov commented Nov 30, 2022

Hey! Thanks for this ticket. We already had failing tests in 0.6.1 and I wasn't be able to find the root cause and it apparently was ICU version change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants