Wrong logit tensor dimensions when training CLIP #31825

npyoung · 2024-07-07T07:22:24Z

System Info

transformers version: 4.42.3
Platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.17
Python version: 3.8.19
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.32.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA GeForce RTX 4090

Who can help?

@sgugger @muellerzr

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I'm following the example script to fine-tune CLIP on some custom data for zero-shot image classification. I am starting from the popular openai/clip-vit-base-patch32 weights. I can get the script to run on my dataset and I can see my training loss decreasing, but...

During eval I want to add a custom accuracy metric, so I defined a compute_metrics(eval_preds) that I pass to the trainer. For this model, the eval_preds that gets passed to my custom metrics function should be a tuple where the first element is logits_per_image: an (image_batch_size, text_batch_size) tensor as described here.

Expected behavior

logits_per_image appears to have shape (num_validation_examples, batch_size).

Similarly, the 2nd element in eval_preds should be logits_per_text and have transposed dimensions but it's also (num_validation_examples, batch_size). So I really can't tell what's going on here - certainly not the behavior the docs describe.

Furthermore label_ids is an empty list. I'd expect something like arange(num_classes) or the indices for the current batch.

In short, I'd like to compute top1 and top5 accuracy but I can't tell what the extra rows in the logit tensors are there for.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-07-08T10:38:27Z

Hi @npyoung, thanks for opening an issue! Could you share a minimal code reproducer?

I find it surprising the logits_per_image don't have the transposed dimensions of logits_per_text from the model output as this is how they are defined in the model and there are no further transformations on the object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong logit tensor dimensions when training CLIP #31825

Wrong logit tensor dimensions when training CLIP #31825

npyoung commented Jul 7, 2024

amyeroberts commented Jul 8, 2024

Wrong logit tensor dimensions when training CLIP #31825

Wrong logit tensor dimensions when training CLIP #31825

Comments

npyoung commented Jul 7, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jul 8, 2024