How to use `TokenizerBuilder`? #1549

polarathene · 2024-06-07T01:18:07Z

I expected TokenizerBuilder to produce a Tokenizer from the build() result, but instead Tokenizer wraps TokenizerImpl.

No problem, I see that it impl From<TokenizerImpl> for Tokenizer, but it's attempting to do quite a bit more for some reason? Meanwhile I cannot use Tokenizer(unwrapped_build_result_here) as the struct is private 🤔 (while the Tokenizer::new() method won't take this in either)

let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    .with_model(unigram)
    .with_decoder(Some(decoder))
    .with_normalizer(Some(normalizer))
    .build()
    .map_err(anyhow::Error::msg)?
);

error[E0283]: type annotations needed
   --> mistralrs-core/src/pipeline/gguf_tokenizer.rs:139:41
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    |                                         ^^^^^^^^^^^^^^^^^^^^^ cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`
    |
    = note: cannot satisfy `_: tokenizers::PreTokenizer`
    = help: the following types implement trait `tokenizers::PreTokenizer`:
              tokenizers::pre_tokenizers::bert::BertPreTokenizer
              tokenizers::decoders::byte_level::ByteLevel
              tokenizers::pre_tokenizers::delimiter::CharDelimiterSplit
              tokenizers::pre_tokenizers::digits::Digits
              tokenizers::decoders::metaspace::Metaspace
              tokenizers::pre_tokenizers::punctuation::Punctuation
              tokenizers::pre_tokenizers::sequence::Sequence
              tokenizers::pre_tokenizers::split::Split
            and 4 others
note: required by a bound in `tokenizers::TokenizerBuilder::<M, N, PT, PP, D>::new`
   --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/mod.rs:314:9
    |
314 |     PT: PreTokenizer,
    |         ^^^^^^^^^^^^ required by this bound in `TokenizerBuilder::<M, N, PT, PP, D>::new`
...
319 |     pub fn new() -> Self {
    |            --- required by a bound in this associated function
help: consider specifying the generic arguments
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::<tokenizers::models::unigram::Unigram, tokenizers::NormalizerWrapper, PT, PP, tokenizers::DecoderWrapper>::new()
    |                                                         +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Why is this an issue? Isn't the point of the builder so that you don't have to specify the optional types not explicitly set?

cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`

I had a glance over the source on github but didn't see an example or test for using this API and the docs don't really cover it either.

Meanwhile with Tokenizer instead of TokenizerBuilder this works:

let mut tokenizer = Tokenizer::new(tokenizers::ModelWrapper::Unigram(unigram));
tokenizer.with_decoder(decoder);
tokenizer.with_normalizer(normalizer);

The text was updated successfully, but these errors were encountered:

polarathene · 2024-06-07T02:12:00Z

In the meantime, I've implemented this alternative workaround (using buildstructor):

struct TokenizerX;
#[buildstructor::buildstructor]
impl TokenizerX {
    #[builder]
    fn try_new<'a>(
        with_model: ModelWrapper,
        with_decoder: Option<Decoder<'a>>,
        with_normalizer: Option<Normalizer<'a>>,
    ) -> Result<Tokenizer> {
        let mut tokenizer = Tokenizer::new(with_model);

        // Handle local enum to remote enum type:
        if let Some(decoder) = with_decoder {
            let d = DecoderWrapper::try_from(decoder)?;
            tokenizer.with_decoder(d);
        }
        if let Some(normalizer) = with_normalizer {
            let n = NormalizerWrapper::try_from(normalizer)?;
            tokenizer.with_normalizer(n);
        }

        Ok(tokenizer)
    }
}

Usage:

let mut tokenizer: Tokenizer = TokenizerX::try_builder()
    .with_model(model)
    .with_decoder(decoder)
    .with_normalizer(normalizer)
    .build()?;

The local to remote enum logic above is for the related DecoderWrapper + NormalizeWrapper enums which were also a bit noisy to use / grok, so I have a similar workaround for those:

let decoder = Decoder::Sequence(vec![
    Decoder::Replace("_", " "),
    Decoder::ByteFallback,
    Decoder::Fuse,
    Decoder::Strip(' ', 1, 0),
]);

let normalizer = Normalizer::Sequence(vec![
    Normalizer::Prepend("▁"),
    Normalizer::Replace(" ", "▁"),
]);

More details at mistral.rs.

ArthurZucker · 2024-06-11T13:35:46Z

The builder is I believe mostly used fro training

polarathene · 2024-06-14T01:17:45Z

@ArthurZucker perhaps you could better document that? Because by naming convention and current docs comment it implies it is the builder pattern for the Tokenizer struct:

Builder for Tokenizer structs.

It provides an API that matches what you'd expect of a builder API, and it's build() method returns a type that is used to construct a Tokenizer struct (which also has a From impl for this type):

tokenizers/tokenizers/src/tokenizer/mod.rs

Lines 464 to 484 in 1ff56c0

    
           impl<M, N, PT, PP, D> From<TokenizerImpl<M, N, PT, PP, D>> for Tokenizer 
        
           where 
        
               M: Into<ModelWrapper>, 
        
               N: Into<NormalizerWrapper>, 
        
               PT: Into<PreTokenizerWrapper>, 
        
               PP: Into<PostProcessorWrapper>, 
        
               D: Into<DecoderWrapper>, 
        
           { 
        
               fn from(t: TokenizerImpl<M, N, PT, PP, D>) -> Self { 
        
                   Self(TokenizerImpl { 
        
                       model: t.model.into(), 
        
                       normalizer: t.normalizer.map(Into::into), 
        
                       pre_tokenizer: t.pre_tokenizer.map(Into::into), 
        
                       post_processor: t.post_processor.map(Into::into), 
        
                       decoder: t.decoder.map(Into::into), 
        
                       added_vocabulary: t.added_vocabulary, 
        
                       padding: t.padding, 
        
                       truncation: t.truncation, 
        
                   }) 
        
               } 
        
           }

tokenizers/tokenizers/src/tokenizer/mod.rs

Lines 419 to 436 in 1ff56c0

    
           impl Tokenizer { 
        
               /// Construct a new Tokenizer based on the model. 
        
               pub fn new(model: impl Into<ModelWrapper>) -> Self { 
        
                   Self(TokenizerImpl::new(model.into())) 
        
               } 
        
               /// Unwrap the TokenizerImpl. 
        
               pub fn into_inner( 
        
                   self, 
        
               ) -> TokenizerImpl< 
        
                   ModelWrapper, 
        
                   NormalizerWrapper, 
        
                   PreTokenizerWrapper, 
        
                   PostProcessorWrapper, 
        
                   DecoderWrapper, 
        
               > { 
        
                   self.0 
        
               }

tokenizers/tokenizers/src/tokenizer/mod.rs

Lines 408 to 417 in 1ff56c0

    
           #[derive(Serialize, Deserialize, Debug, Clone)] 
        
           pub struct Tokenizer( 
        
               TokenizerImpl< 
        
                   ModelWrapper, 
        
                   NormalizerWrapper, 
        
                   PreTokenizerWrapper, 
        
                   PostProcessorWrapper, 
        
                   DecoderWrapper, 
        
               >, 
        
           );

As the issue reports though, that doesn't seem to work very well, the builder API is awkward to use. You could probably adapt it to use buildstructor similar to how I have shown above with my TokenizerX workaround type (which also does a similar workaround for Decoder / Normalizer inputs to provide a better DX, but that is not required).

Presently, due to the reported issue here the builder offers little value vs creating the tokenizer without a fluent builder API.

github-actions · 2024-07-14T01:58:23Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use `TokenizerBuilder`? #1549

How to use `TokenizerBuilder`? #1549

polarathene commented Jun 7, 2024

polarathene commented Jun 7, 2024

ArthurZucker commented Jun 11, 2024

polarathene commented Jun 14, 2024

github-actions bot commented Jul 14, 2024

How to use TokenizerBuilder? #1549

How to use TokenizerBuilder? #1549

Comments

polarathene commented Jun 7, 2024

polarathene commented Jun 7, 2024

ArthurZucker commented Jun 11, 2024

polarathene commented Jun 14, 2024

github-actions bot commented Jul 14, 2024

How to use `TokenizerBuilder`? #1549

How to use `TokenizerBuilder`? #1549