Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer.train -> Expected a 'cuda' device type for generator but found 'cpu' #31833

Open
4 tasks
diego-coba opened this issue Jul 7, 2024 · 3 comments
Open
4 tasks

Comments

@diego-coba
Copy link

diego-coba commented Jul 7, 2024

System Info

Transformers 4.41.2
PyTorch 2.3.1+cu121
Python 3.12.3
Ubuntu 24.04

GPU: NVIDIA GeForce GTX 1650

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

`%pip install --quiet torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
%pip install --quiet -U datasets
%pip install --quiet torchdata
%pip install --quiet setuptools
%pip install --quiet transformers
%pip install --quiet evaluate
%pip install --quiet rouge_score
%pip install --quiet loralib
%pip install --quiet peft
%pip install --quiet ipywidgets

import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)} is available and will be used.")
else:
    print("CUDA is not available. CPU will be used.")

dash_line = '-'.join('' for x in range(100))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
import torch
import time
import evaluate
import pandas as pd
import numpy as np
with torch.device(device):
    huggingface_dataset_name = "knkarthick/dialogsum"
    dataset = load_dataset(huggingface_dataset_name)
    model_name='google/flan-t5-base'
    original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(example):
        start_prompt = 'Summarize the following conversation.\n\n'
        end_prompt = '\n\nSummary: '
        prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
        example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
        example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
        return example

    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

    lora_config = LoraConfig(
        r=16, # Rank
        lora_alpha=32,
        target_modules=["q", "v"],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
    )

    peft_model = get_peft_model(original_model, lora_config)

    output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

    peft_training_args = TrainingArguments(
        output_dir=output_dir,
        auto_find_batch_size=True,
        learning_rate=1e-3, # Higher learning rate than full fine-tuning.
        num_train_epochs=10,
        logging_steps=1,
        max_steps=1    
    )

    peft_trainer = Trainer(
        model=peft_model,
        args=peft_training_args,
        train_dataset=tokenized_datasets["train"],
    )

    trainer_args = {
        "resume_from_checkpoint":None,
        "trial":None,
        "ignore_keys_for_eval":None
    }

    peft_trainer.train(
        **trainer_args
    )

    peft_model_path="./peft-dialogue-summary-checkpoint-local"

    peft_trainer.model.save_pretrained(peft_model_path)
    tokenizer.save_pretrained(peft_model_path)

The code shown above throws Expected a 'cuda' device type for generator but found 'cpu'

Stack trace:
RuntimeError                              Traceback (most recent call last)
Cell In[5], [line 54](vscode-notebook-cell:?execution_count=5&line=54)
     [42](vscode-notebook-cell:?execution_count=5&line=42) peft_trainer = Trainer(
     [43](vscode-notebook-cell:?execution_count=5&line=43)     model=peft_model,
     [44](vscode-notebook-cell:?execution_count=5&line=44)     args=peft_training_args,
     [45](vscode-notebook-cell:?execution_count=5&line=45)     train_dataset=tokenized_datasets["train"],
     [46](vscode-notebook-cell:?execution_count=5&line=46) )
     [48](vscode-notebook-cell:?execution_count=5&line=48) trainer_args = {
     [49](vscode-notebook-cell:?execution_count=5&line=49)     "resume_from_checkpoint":None,
     [50](vscode-notebook-cell:?execution_count=5&line=50)     "trial":None,
     [51](vscode-notebook-cell:?execution_count=5&line=51)     "ignore_keys_for_eval":None
     [52](vscode-notebook-cell:?execution_count=5&line=52) }
---> [54](vscode-notebook-cell:?execution_count=5&line=54) peft_trainer.train(
     [55](vscode-notebook-cell:?execution_count=5&line=55)     **trainer_args
     [56](vscode-notebook-cell:?execution_count=5&line=56) )
     [58](vscode-notebook-cell:?execution_count=5&line=58) peft_model_path="./peft-dialogue-summary-checkpoint-local"
     [60](vscode-notebook-cell:?execution_count=5&line=60) peft_trainer.model.save_pretrained(peft_model_path)

File ~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   [1883](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1883)         hf_hub_utils.enable_progress_bars()
   [1884](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1884) else:
-> [1885](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1885)     return inner_training_loop(
   [1886](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1886)         args=args,
   [1887](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1887)         resume_from_checkpoint=resume_from_checkpoint,
...
     [76](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:76) if func in _device_constructors() and kwargs.get('device') is None:
     [77](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:77)     kwargs['device'] = self.device
---> [78](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:78) return func(*args, **kwargs)

Expected behavior

Should not throw the error as the entire code is running under "with torch.device(device):" with device='cuda'

@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

@muellerzr
Copy link
Contributor

muellerzr commented Jul 8, 2024

Why are we doing everything under with device()? Does it work if you remove this?

@diego-coba
Copy link
Author

Thanks for looking at my issue.

Q. Why?
A. When working with the large variant of the model for prediction, PyTorch doesn't use the GPU, so I had to manually move it with .to('cuda'). To avoid having to move everything manually (tokenizer, dataset, model) I started using the with device syntax. Now I'm trying to train it using PEFT with LoRA, and as my GPU has only 4GB VRAM, I used the base variant this time, keeping the manual specification for the device to be used but the error shown happens.

Q: Does it work if I remove it:
A: It actually works, even when setting the device to CPU, PyTorch somehow ignores it and can, with the base variant of the model, automatically use the GPU as saw in nvidia-smi using about 3.8 GB VRAM when running the script.

So IDK why sometimes PyTorch automatically uses the GPU, others not, but for some reason when trying to force it to use the GPU with PEFT LoRa, the error happens.

For now I'm just relying on the automatic device detection. But I still think there's something not working properly somewhere.

Thanks again @muellerzr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants