Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use TokenizerBuilder? #1549

Open
polarathene opened this issue Jun 7, 2024 · 4 comments
Open

How to use TokenizerBuilder? #1549

polarathene opened this issue Jun 7, 2024 · 4 comments
Labels

Comments

@polarathene
Copy link

I expected TokenizerBuilder to produce a Tokenizer from the build() result, but instead Tokenizer wraps TokenizerImpl.

No problem, I see that it impl From<TokenizerImpl> for Tokenizer, but it's attempting to do quite a bit more for some reason? Meanwhile I cannot use Tokenizer(unwrapped_build_result_here) as the struct is private 🤔 (while the Tokenizer::new() method won't take this in either)


let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    .with_model(unigram)
    .with_decoder(Some(decoder))
    .with_normalizer(Some(normalizer))
    .build()
    .map_err(anyhow::Error::msg)?
);
error[E0283]: type annotations needed
   --> mistralrs-core/src/pipeline/gguf_tokenizer.rs:139:41
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    |                                         ^^^^^^^^^^^^^^^^^^^^^ cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`
    |
    = note: cannot satisfy `_: tokenizers::PreTokenizer`
    = help: the following types implement trait `tokenizers::PreTokenizer`:
              tokenizers::pre_tokenizers::bert::BertPreTokenizer
              tokenizers::decoders::byte_level::ByteLevel
              tokenizers::pre_tokenizers::delimiter::CharDelimiterSplit
              tokenizers::pre_tokenizers::digits::Digits
              tokenizers::decoders::metaspace::Metaspace
              tokenizers::pre_tokenizers::punctuation::Punctuation
              tokenizers::pre_tokenizers::sequence::Sequence
              tokenizers::pre_tokenizers::split::Split
            and 4 others
note: required by a bound in `tokenizers::TokenizerBuilder::<M, N, PT, PP, D>::new`
   --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/mod.rs:314:9
    |
314 |     PT: PreTokenizer,
    |         ^^^^^^^^^^^^ required by this bound in `TokenizerBuilder::<M, N, PT, PP, D>::new`
...
319 |     pub fn new() -> Self {
    |            --- required by a bound in this associated function
help: consider specifying the generic arguments
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::<tokenizers::models::unigram::Unigram, tokenizers::NormalizerWrapper, PT, PP, tokenizers::DecoderWrapper>::new()
    |                                                         +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Why is this an issue? Isn't the point of the builder so that you don't have to specify the optional types not explicitly set?

cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`

I had a glance over the source on github but didn't see an example or test for using this API and the docs don't really cover it either.


Meanwhile with Tokenizer instead of TokenizerBuilder this works:

let mut tokenizer = Tokenizer::new(tokenizers::ModelWrapper::Unigram(unigram));
tokenizer.with_decoder(decoder);
tokenizer.with_normalizer(normalizer);
@polarathene
Copy link
Author

In the meantime, I've implemented this alternative workaround (using buildstructor):

struct TokenizerX;
#[buildstructor::buildstructor]
impl TokenizerX {
    #[builder]
    fn try_new<'a>(
        with_model: ModelWrapper,
        with_decoder: Option<Decoder<'a>>,
        with_normalizer: Option<Normalizer<'a>>,
    ) -> Result<Tokenizer> {
        let mut tokenizer = Tokenizer::new(with_model);

        // Handle local enum to remote enum type:
        if let Some(decoder) = with_decoder {
            let d = DecoderWrapper::try_from(decoder)?;
            tokenizer.with_decoder(d);
        }
        if let Some(normalizer) = with_normalizer {
            let n = NormalizerWrapper::try_from(normalizer)?;
            tokenizer.with_normalizer(n);
        }

        Ok(tokenizer)
    }
}

Usage:

let mut tokenizer: Tokenizer = TokenizerX::try_builder()
    .with_model(model)
    .with_decoder(decoder)
    .with_normalizer(normalizer)
    .build()?;

The local to remote enum logic above is for the related DecoderWrapper + NormalizeWrapper enums which were also a bit noisy to use / grok, so I have a similar workaround for those:

let decoder = Decoder::Sequence(vec![
    Decoder::Replace("_", " "),
    Decoder::ByteFallback,
    Decoder::Fuse,
    Decoder::Strip(' ', 1, 0),
]);

let normalizer = Normalizer::Sequence(vec![
    Normalizer::Prepend("▁"),
    Normalizer::Replace(" ", "▁"),
]);

More details at mistral.rs.

@ArthurZucker
Copy link
Collaborator

The builder is I believe mostly used fro training

@polarathene
Copy link
Author

@ArthurZucker perhaps you could better document that? Because by naming convention and current docs comment it implies it is the builder pattern for the Tokenizer struct:

Builder for Tokenizer structs.

It provides an API that matches what you'd expect of a builder API, and it's build() method returns a type that is used to construct a Tokenizer struct (which also has a From impl for this type):

impl<M, N, PT, PP, D> From<TokenizerImpl<M, N, PT, PP, D>> for Tokenizer
where
M: Into<ModelWrapper>,
N: Into<NormalizerWrapper>,
PT: Into<PreTokenizerWrapper>,
PP: Into<PostProcessorWrapper>,
D: Into<DecoderWrapper>,
{
fn from(t: TokenizerImpl<M, N, PT, PP, D>) -> Self {
Self(TokenizerImpl {
model: t.model.into(),
normalizer: t.normalizer.map(Into::into),
pre_tokenizer: t.pre_tokenizer.map(Into::into),
post_processor: t.post_processor.map(Into::into),
decoder: t.decoder.map(Into::into),
added_vocabulary: t.added_vocabulary,
padding: t.padding,
truncation: t.truncation,
})
}
}

impl Tokenizer {
/// Construct a new Tokenizer based on the model.
pub fn new(model: impl Into<ModelWrapper>) -> Self {
Self(TokenizerImpl::new(model.into()))
}
/// Unwrap the TokenizerImpl.
pub fn into_inner(
self,
) -> TokenizerImpl<
ModelWrapper,
NormalizerWrapper,
PreTokenizerWrapper,
PostProcessorWrapper,
DecoderWrapper,
> {
self.0
}

#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct Tokenizer(
TokenizerImpl<
ModelWrapper,
NormalizerWrapper,
PreTokenizerWrapper,
PostProcessorWrapper,
DecoderWrapper,
>,
);


As the issue reports though, that doesn't seem to work very well, the builder API is awkward to use. You could probably adapt it to use buildstructor similar to how I have shown above with my TokenizerX workaround type (which also does a similar workaround for Decoder / Normalizer inputs to provide a better DX, but that is not required).

Presently, due to the reported issue here the builder offers little value vs creating the tokenizer without a fluent builder API.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants