Upcoming ICU 72 may break tests #337

MichaelChirico · 2022-11-29T22:34:07Z

Heads up that we observed some tests breaking on our (old, 0.5.1) version of text2vec after updating to ICU72.

I haven't had time to dig into it extensively (and especially whether the HEAD version of text2vec is still affected), but the root cause will be similar to what we reported to tokenizers:

ropensci/tokenizers#82

I am reporting despite a pure reproducible example because I saw this test still hard-coding the exact # of tokenized words (even though it's now marked dont-test):

text2vec/tests/testthat/dont-test-vocab-corpus.R

Line 16 in 311c483

expect_equal(length(vocab$term), 19297)

So I suspect the dev version is still susceptible. Note that because stringi bundles versioned copies of icu (e.g. icu69), it's also unlikely this will affect the CRAN version of text2vec until stringi updates their bundled version, but it could still affect users like us that link to a newer ICU version than that bundled with the CRAN sources.

Here's the output of testthat on 0.5.1 for reference:

=═ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-hash-corpus.R:19'): Unigram Hash Corpus construction ─────────
length(m@x) (`actual`) not equal to 140615L (`expected`).

  `actual`: 140614
`expected`: 140615
── Failure ('test-hash-corpus.R:37'): trigram hash-corpus construction ─────────
length(m@x) (`actual`) not equal to 591549L (`expected`).

  `actual`: 591559
`expected`: 591549
── Failure ('test-iterators.R:25'): ifiles ─────────────────────────────────────
nrow(v) (`actual`) not equal to 7448 (`expected`).

  `actual`: 7446
`expected`: 7448
── Failure ('test-iterators.R:38'): ifiles_parallel ────────────────────────────
nrow(v) (`actual`) not equal to 7448 (`expected`).

  `actual`: 7446
`expected`: 7448
── Failure ('test-vocab-corpus.R:16'): Vocabulary pruning ──────────────────────
length(vocab$term) (`actual`) not equal to 19297 (`expected`).

  `actual`: 19289
`expected`: 19297
── Failure ('test-vocab-corpus.R:19'): Vocabulary pruning ──────────────────────
max(vocab$term_count) (`actual`) not equal to 13224 (`expected`).

  `actual`: 13227
`expected`: 13224
── Failure ('test-vocab-corpus.R:86'): Unigran Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 141714L (`expected`).

  `actual`: 141713
`expected`: 141714
── Failure ('test-vocab-corpus.R:104'): bi-gram Vocabulary Corpus construction ──
sum(grepl("_", vocab$term, fixed = T)) (`actual`) not equal to 121333L (`expected`).

  `actual`: 121331
`expected`: 121333
── Failure ('test-vocab-corpus.R:105'): bi-gram Vocabulary Corpus construction ──
length(vocab$term) (`actual`) not equal to 121333L (`expected`).

  `actual`: 121331
`expected`: 121333
── Failure ('test-vocab-corpus.R:113'): bi-gram Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 220104L (`expected`).

  `actual`: 220109
`expected`: 220104
── Failure ('test-vocab-corpus.R:121'): Unigram + Bigram Vocabulary Corpus construction ──
length(vocab$term) (`actual`) not equal to 140630L (`expected`).

  `actual`: 140620
`expected`: 140630
── Failure ('test-vocab-corpus.R:128'): Unigram + Bigram Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 361818L (`expected`).

  `actual`: 361822
`expected`: 361818

[ FAIL 12 | WARN 20 | SKIP 1 | PASS 132 ]

With such high numbers of vocab elements it's a bit tougher to confirm specifically what's changed, but I do see things like e-mail addresses and words with colons that were mentioned in the tokenizers issue as causing slight changes in stringi::stri_split_boundaries behavior, so I am reasonably confident that's the full extent of what's going on.

We could simply update the #s of tokenized words to match the new output, but this technically marks a breaking change, and it may be hard to anticipate how this would affect downstream model outputs, so I also leave it to the maintainers to decide the right path forward.

The text was updated successfully, but these errors were encountered:

dselivanov · 2022-11-30T00:48:03Z

Hey! Thanks for this ticket. We already had failing tests in 0.6.1 and I wasn't be able to find the root cause and it apparently was ICU version change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upcoming ICU 72 may break tests #337

Upcoming ICU 72 may break tests #337

MichaelChirico commented Nov 29, 2022

dselivanov commented Nov 30, 2022 •

edited

Loading

Upcoming ICU 72 may break tests #337

Upcoming ICU 72 may break tests #337

Comments

MichaelChirico commented Nov 29, 2022

dselivanov commented Nov 30, 2022 • edited Loading

dselivanov commented Nov 30, 2022 •

edited

Loading