Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong logit tensor dimensions when training CLIP #31825

Open
4 tasks done
npyoung opened this issue Jul 7, 2024 · 1 comment
Open
4 tasks done

Wrong logit tensor dimensions when training CLIP #31825

npyoung opened this issue Jul 7, 2024 · 1 comment

Comments

@npyoung
Copy link
Contributor

npyoung commented Jul 7, 2024

System Info

  • transformers version: 4.42.3
  • Platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.17
  • Python version: 3.8.19
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.32.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA GeForce RTX 4090

Who can help?

@sgugger @muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm following the example script to fine-tune CLIP on some custom data for zero-shot image classification. I am starting from the popular openai/clip-vit-base-patch32 weights. I can get the script to run on my dataset and I can see my training loss decreasing, but...

During eval I want to add a custom accuracy metric, so I defined a compute_metrics(eval_preds) that I pass to the trainer. For this model, the eval_preds that gets passed to my custom metrics function should be a tuple where the first element is logits_per_image: an (image_batch_size, text_batch_size) tensor as described here.

Expected behavior

logits_per_image appears to have shape (num_validation_examples, batch_size).

Similarly, the 2nd element in eval_preds should be logits_per_text and have transposed dimensions but it's also (num_validation_examples, batch_size). So I really can't tell what's going on here - certainly not the behavior the docs describe.

Furthermore label_ids is an empty list. I'd expect something like arange(num_classes) or the indices for the current batch.

In short, I'd like to compute top1 and top5 accuracy but I can't tell what the extra rows in the logit tensors are there for.

@amyeroberts
Copy link
Collaborator

Hi @npyoung, thanks for opening an issue! Could you share a minimal code reproducer?

I find it surprising the logits_per_image don't have the transposed dimensions of logits_per_text from the model output as this is how they are defined in the model and there are no further transformations on the object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants