Converting gguf fp16 & bf16 to hf is not supported. #31762

PenutChen · 2024-07-03T03:24:13Z

System Info

transformers==4.42.3
torch==2.3.0
numpy==1.26.4
gguf==0.6.0

Who can help?

@SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import os
from transformers import AutoModelForCausalLM

gguf_path = "path/to/llama3-8b.fp16.gguf"  # or bf16
model_id = os.path.dirname(gguf_path)
gguf_file = os.path.basename(gguf_path)

model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)

Expected behavior

Besides quantization, only F32 is implemented. FP16 and BF16 are not yet supported.

fp16 error log:

Converting and de-quantizing GGUF tensors...:   0%|                         | 0/291 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data2/Penut/LLM-Backend/Testing.py", line 9, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3583, in from_pretrained
    state_dict = load_gguf_checkpoint(gguf_path, return_tensors=True)["tensors"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 146, in load_gguf_checkpoint
    weights = load_dequant_gguf_tensor(shape=shape, ggml_type=tensor.tensor_type, data=tensor.data)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/integrations/ggml.py", line 507, in load_dequant_gguf_tensor
    raise NotImplementedError(
NotImplementedError: ggml_type 1 not implemented - please raise an issue on huggingface transformers: https://github.com/huggingface/transformers/issues/new/choose

bf16 error log:

Traceback (most recent call last):
  File "/data2/Penut/LLM-Backend/Testing.py", line 9, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 524, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 965, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/configuration_utils.py", line 719, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 81, in load_gguf_checkpoint
    reader = GGUFReader(gguf_checkpoint_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/gguf/gguf_reader.py", line 116, in __init__
    self._build_tensors(offs, tensors_fields)
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/gguf/gguf_reader.py", line 239, in _build_tensors
    ggml_type = GGMLQuantizationType(raw_dtype[0])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/enum.py", line 714, in __call__
    return cls.__new__(cls, value)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/enum.py", line 1137, in __new__
    raise ve_exc
ValueError: 30 is not a valid GGMLQuantizationType

I tried to add F16 to GGML_TYPES:

GGML_TYPES = {
    "F32": 0,
    "F16": 1,
    # ...
}

def load_dequant_gguf_tensor(shape, ggml_type, data):
    if ggml_type == GGML_TYPES["F32"]:
        values = data
    elif ggml_type == GGML_TYPES["F16"]:
        values = data
    # ...

~~I'm not sure if this is correct, but after converting to hf, the PPL is over 1000.~~

The text was updated successfully, but these errors were encountered:

PenutChen · 2024-07-03T04:53:45Z

I found that the PPL issue is related to Llama3 or llama.cpp. It doesn't happen with TinyLlama. I'll create another issue to discuss if needed.

PenutChen · 2024-07-03T08:14:44Z

It's easy to support GGUF FP16. Since BF16 is not supported by NumPy, my current workaround is to convert BF16 to FP16 using PyTorch, but it's not ideal to rely on PyTorch at this step.

Reference: main...PenutChen:transformers:main

def load_dequant_gguf_tensor(shape, ggml_type, data):
    if ggml_type == GGML_TYPES["F32"]:
        values = data
    elif ggml_type == GGML_TYPES["F16"]:
        values = data
    elif ggml_type == GGML_TYPES["BF16"]:
        import torch
        data_uint8 = data.view(np.uint8)
        tensor_uint8 = torch.from_numpy(data_uint8)
        values = tensor_uint8.view(torch.bfloat16).float().numpy()

Note that BF16 support requires modifying some code in gguf-py. Since the latest version of gguf-py from the llama.cpp repo doesn't work with the current HF integration (#31725), I modified the version from PyPI as follows:

class GGMLQuantizationType(IntEnum):
    F32  = 0
    F16  = 1
    BF16 = 30
    # ...

GGML_QUANT_SIZES = {
    GGMLQuantizationType.F32:  (1, 4),
    GGMLQuantizationType.F16:  (1, 2),
    GGMLQuantizationType.BF16: (1, 2),
    # ...
}

LysandreJik · 2024-07-03T09:33:20Z

Hey @SunMarc, would you have some bandwidth to take a look at this ? :)

SunMarc · 2024-07-03T12:09:03Z

Hey @PenutChen, thanks for your research ! I think that we should just support FP16 first since supporting BF16 would require a new gguf release + transformers gguf integration is not compatible yet. LMK what you think ! If you have some time, would you like a open a PR ? Otherwise, I will do it !

PenutChen · 2024-07-04T00:44:13Z

@SunMarc Sure, I will do the necessary checks and open a PR! By the way, gguf-py on PyPI has not been updated for a long time. Most developers from llama.cpp seem to use gguf-py from the source. I think if we want to improve this integration, we should discuss it with the developers of llama.cpp.

PenutChen linked a pull request Jul 4, 2024 that will close this issue

Support dequantizing GGUF FP16 format #31783

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting gguf fp16 & bf16 to hf is not supported. #31762

Converting gguf fp16 & bf16 to hf is not supported. #31762

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

LysandreJik commented Jul 3, 2024

SunMarc commented Jul 3, 2024

PenutChen commented Jul 4, 2024 •

edited

Loading

Converting gguf fp16 & bf16 to hf is not supported. #31762

Converting gguf fp16 & bf16 to hf is not supported. #31762

Comments

PenutChen commented Jul 3, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

PenutChen commented Jul 3, 2024 • edited Loading

PenutChen commented Jul 3, 2024 • edited Loading

LysandreJik commented Jul 3, 2024

SunMarc commented Jul 3, 2024

PenutChen commented Jul 4, 2024 • edited Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 4, 2024 •

edited

Loading