Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ai studio合入微调模型killed #8694

Open
1 task done
Yang-Changhui opened this issue Jul 2, 2024 · 9 comments
Open
1 task done

[Bug]: ai studio合入微调模型killed #8694

Yang-Changhui opened this issue Jul 2, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@Yang-Changhui
Copy link

Yang-Changhui commented Jul 2, 2024

软件环境

- paddlepaddle-gpu: 0.0.0.post118
- paddlenlp: 2.8.0.post0

重复问题

  • I have searched the existing issues

错误描述

硬件:ai studio V100 32G
微调模型:THUDM/chatglm2-6b

模型合并时,cpu内存占用率一直上升,直到爆内存,然后被killed。而gpu显存利用率很低,这是什么原因?
如何在合并时降低cpu的内存占用率?谢谢

稳定复现步骤 & 代码

模型微调

python finetune_generation.py chatglm2/lora_argument.json

配置文件修改

{
"dataset_name_or_path": "/home/aistudio/dataset",
"per_device_train_batch_size": 1,
"zero_padding": true,
"use_flash_attention": true,
"weight_quantize_algo": "nf4"
}

模型合并

python merge_lora_params.py
--lora_path ./checkpoints/chatglm2_lora_ckpts/checkpoint-204
--merge_lora_model_path ./checkpoints/chatglm2_lora_merge
--device "gpu"
--low_gpu_mem True

合并出错:
无标题

@Yang-Changhui Yang-Changhui added the bug Something isn't working label Jul 2, 2024
@wawltor
Copy link
Collaborator

wawltor commented Jul 2, 2024

参数融合过程中需要将参数将在内存上进行融合,可以打开 unified checkpoint然后来避免参数融合

@Yang-Changhui
Copy link
Author

好的,还有一个问题,使用最新的paddlenlp3.0,相同的配置,进行模型微调时,训练一会报错,使用paddlenlp2.8不会报错:
OSError: (External) OSError: (External) CUBLAS error(14).
[Hint: 'CUBLAS_STATUS_INTERNAL_ERROR'. An internal cuBLAS operation failed. This error is usually caused by a cudaMemcpyAsync() failure. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. ] (at /paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:1753)

@wawltor
Copy link
Collaborator

wawltor commented Jul 2, 2024

训练时显存OOM了吗?

@Yang-Changhui
Copy link
Author

使用ai studio自带的监视器查看,并没有,连一半都没有

@Yang-Changhui
Copy link
Author

Yang-Changhui commented Jul 2, 2024

参数融合过程中需要将参数将在内存上进行融合,可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用,打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

@wawltor
Copy link
Collaborator

wawltor commented Jul 2, 2024

参数融合过程中需要将参数将在内存上进行融合,可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用,打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

2.8版本可以支持unified checkpoint

@wawltor
Copy link
Collaborator

wawltor commented Jul 2, 2024

使用ai studio自带的监视器查看,并没有,连一半都没有

安装的paddle的cuda版本是否满足需要了?

@Yang-Changhui
Copy link
Author

使用ai studio自带的监视器查看,并没有,连一半都没有

安装的paddle的cuda版本是否满足需要了?
是的,使用的paddlepaddle-gpu==0.0.0.post118

@Yang-Changhui
Copy link
Author

参数融合过程中需要将参数将在内存上进行融合,可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用,打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

2.8版本可以支持unified checkpoint
在2.8中,微调lora命令为:python finetune_generation.py ./chatglm2/lora_argument.json,要使用unified checkpoint 需要python run_pretrain.py ./chatglm2/lora_argument.json --unified_checkpoint 1 ,这两种训练方式好像不同吧,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants