Trainer.train -> Expected a 'cuda' device type for generator but found 'cpu' #31833

diego-coba · 2024-07-07T21:07:38Z

System Info

Transformers 4.41.2
PyTorch 2.3.1+cu121
Python 3.12.3
Ubuntu 24.04

GPU: NVIDIA GeForce GTX 1650

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

`%pip install --quiet torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
%pip install --quiet -U datasets
%pip install --quiet torchdata
%pip install --quiet setuptools
%pip install --quiet transformers
%pip install --quiet evaluate
%pip install --quiet rouge_score
%pip install --quiet loralib
%pip install --quiet peft
%pip install --quiet ipywidgets

import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)} is available and will be used.")
else:
    print("CUDA is not available. CPU will be used.")

dash_line = '-'.join('' for x in range(100))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
import torch
import time
import evaluate
import pandas as pd
import numpy as np

with torch.device(device):
    huggingface_dataset_name = "knkarthick/dialogsum"
    dataset = load_dataset(huggingface_dataset_name)
    model_name='google/flan-t5-base'
    original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(example):
        start_prompt = 'Summarize the following conversation.\n\n'
        end_prompt = '\n\nSummary: '
        prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
        example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
        example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
        return example

    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

    lora_config = LoraConfig(
        r=16, # Rank
        lora_alpha=32,
        target_modules=["q", "v"],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
    )

    peft_model = get_peft_model(original_model, lora_config)

    output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

    peft_training_args = TrainingArguments(
        output_dir=output_dir,
        auto_find_batch_size=True,
        learning_rate=1e-3, # Higher learning rate than full fine-tuning.
        num_train_epochs=10,
        logging_steps=1,
        max_steps=1    
    )

    peft_trainer = Trainer(
        model=peft_model,
        args=peft_training_args,
        train_dataset=tokenized_datasets["train"],
    )

    trainer_args = {
        "resume_from_checkpoint":None,
        "trial":None,
        "ignore_keys_for_eval":None
    }

    peft_trainer.train(
        **trainer_args
    )

    peft_model_path="./peft-dialogue-summary-checkpoint-local"

    peft_trainer.model.save_pretrained(peft_model_path)
    tokenizer.save_pretrained(peft_model_path)

The code shown above throws Expected a 'cuda' device type for generator but found 'cpu'

Stack trace:
RuntimeError                              Traceback (most recent call last)
Cell In[5], [line 54](vscode-notebook-cell:?execution_count=5&line=54)
     [42](vscode-notebook-cell:?execution_count=5&line=42) peft_trainer = Trainer(
     [43](vscode-notebook-cell:?execution_count=5&line=43)     model=peft_model,
     [44](vscode-notebook-cell:?execution_count=5&line=44)     args=peft_training_args,
     [45](vscode-notebook-cell:?execution_count=5&line=45)     train_dataset=tokenized_datasets["train"],
     [46](vscode-notebook-cell:?execution_count=5&line=46) )
     [48](vscode-notebook-cell:?execution_count=5&line=48) trainer_args = {
     [49](vscode-notebook-cell:?execution_count=5&line=49)     "resume_from_checkpoint":None,
     [50](vscode-notebook-cell:?execution_count=5&line=50)     "trial":None,
     [51](vscode-notebook-cell:?execution_count=5&line=51)     "ignore_keys_for_eval":None
     [52](vscode-notebook-cell:?execution_count=5&line=52) }
---> [54](vscode-notebook-cell:?execution_count=5&line=54) peft_trainer.train(
     [55](vscode-notebook-cell:?execution_count=5&line=55)     **trainer_args
     [56](vscode-notebook-cell:?execution_count=5&line=56) )
     [58](vscode-notebook-cell:?execution_count=5&line=58) peft_model_path="./peft-dialogue-summary-checkpoint-local"
     [60](vscode-notebook-cell:?execution_count=5&line=60) peft_trainer.model.save_pretrained(peft_model_path)

File ~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   [1883](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1883)         hf_hub_utils.enable_progress_bars()
   [1884](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1884) else:
-> [1885](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1885)     return inner_training_loop(
   [1886](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1886)         args=args,
   [1887](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1887)         resume_from_checkpoint=resume_from_checkpoint,
...
     [76](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:76) if func in _device_constructors() and kwargs.get('device') is None:
     [77](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:77)     kwargs['device'] = self.device
---> [78](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:78) return func(*args, **kwargs)

Expected behavior

Should not throw the error as the entire code is running under "with torch.device(device):" with device='cuda'

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-07-08T10:33:28Z

cc @muellerzr @SunMarc

muellerzr · 2024-07-08T15:57:58Z

Why are we doing everything under with device()? Does it work if you remove this?

diego-coba · 2024-07-08T16:35:17Z

Thanks for looking at my issue.

Q. Why?
A. When working with the large variant of the model for prediction, PyTorch doesn't use the GPU, so I had to manually move it with .to('cuda'). To avoid having to move everything manually (tokenizer, dataset, model) I started using the with device syntax. Now I'm trying to train it using PEFT with LoRA, and as my GPU has only 4GB VRAM, I used the base variant this time, keeping the manual specification for the device to be used but the error shown happens.

Q: Does it work if I remove it:
A: It actually works, even when setting the device to CPU, PyTorch somehow ignores it and can, with the base variant of the model, automatically use the GPU as saw in nvidia-smi using about 3.8 GB VRAM when running the script.

So IDK why sometimes PyTorch automatically uses the GPU, others not, but for some reason when trying to force it to use the GPU with PEFT LoRa, the error happens.

For now I'm just relying on the automatic device detection. But I still think there's something not working properly somewhere.

Thanks again @muellerzr

katherine-atwell mentioned this issue Jul 10, 2024

Initializes generators in trainer.py with the device specified in sel… #31895

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer.train -> Expected a 'cuda' device type for generator but found 'cpu' #31833

Trainer.train -> Expected a 'cuda' device type for generator but found 'cpu' #31833

diego-coba commented Jul 7, 2024 •

edited by muellerzr

Loading

amyeroberts commented Jul 8, 2024

muellerzr commented Jul 8, 2024 •

edited

Loading

diego-coba commented Jul 8, 2024

Trainer.train -> Expected a 'cuda' device type for generator but found 'cpu' #31833

Trainer.train -> Expected a 'cuda' device type for generator but found 'cpu' #31833

Comments

diego-coba commented Jul 7, 2024 • edited by muellerzr Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jul 8, 2024

muellerzr commented Jul 8, 2024 • edited Loading

diego-coba commented Jul 8, 2024

diego-coba commented Jul 7, 2024 •

edited by muellerzr

Loading

muellerzr commented Jul 8, 2024 •

edited

Loading