[Bug]: ai studio合入微调模型killed #8694

Yang-Changhui · 2024-07-02T02:08:45Z

软件环境

- paddlepaddle-gpu: 0.0.0.post118
- paddlenlp: 2.8.0.post0

重复问题

I have searched the existing issues

错误描述

硬件：ai studio V100 32G
微调模型：THUDM/chatglm2-6b

模型合并时，cpu内存占用率一直上升，直到爆内存，然后被killed。而gpu显存利用率很低，这是什么原因？
如何在合并时降低cpu的内存占用率？谢谢

稳定复现步骤 & 代码

模型微调

python finetune_generation.py chatglm2/lora_argument.json

配置文件修改

{
"dataset_name_or_path": "/home/aistudio/dataset",
"per_device_train_batch_size": 1,
"zero_padding": true,
"use_flash_attention": true,
"weight_quantize_algo": "nf4"
}

模型合并

python merge_lora_params.py
--lora_path ./checkpoints/chatglm2_lora_ckpts/checkpoint-204
--merge_lora_model_path ./checkpoints/chatglm2_lora_merge
--device "gpu"
--low_gpu_mem True

合并出错：

wawltor · 2024-07-02T06:31:16Z

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

Yang-Changhui · 2024-07-02T07:15:23Z

好的，还有一个问题，使用最新的paddlenlp3.0，相同的配置，进行模型微调时，训练一会报错，使用paddlenlp2.8不会报错：
OSError: (External) OSError: (External) CUBLAS error(14).
[Hint: 'CUBLAS_STATUS_INTERNAL_ERROR'. An internal cuBLAS operation failed. This error is usually caused by a cudaMemcpyAsync() failure. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. ] (at /paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:1753)

wawltor · 2024-07-02T08:18:27Z

训练时显存OOM了吗？

Yang-Changhui · 2024-07-02T08:25:29Z

使用ai studio自带的监视器查看，并没有，连一半都没有

Yang-Changhui · 2024-07-02T10:14:34Z

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用，打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

wawltor · 2024-07-02T10:55:15Z

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用，打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

2.8版本可以支持unified checkpoint

wawltor · 2024-07-02T10:55:37Z

使用ai studio自带的监视器查看，并没有，连一半都没有

安装的paddle的cuda版本是否满足需要了？

Yang-Changhui · 2024-07-02T11:20:27Z

使用ai studio自带的监视器查看，并没有，连一半都没有

安装的paddle的cuda版本是否满足需要了？
是的，使用的paddlepaddle-gpu==0.0.0.post118

Yang-Changhui · 2024-07-02T11:39:13Z

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用，打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

2.8版本可以支持unified checkpoint
在2.8中，微调lora命令为：python finetune_generation.py ./chatglm2/lora_argument.json,要使用unified checkpoint 需要python run_pretrain.py ./chatglm2/lora_argument.json --unified_checkpoint 1 ,这两种训练方式好像不同吧，

Yang-Changhui added the bug Something isn't working label Jul 2, 2024

paddle-bot bot assigned DesmonDay Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: ai studio合入微调模型killed #8694

[Bug]: ai studio合入微调模型killed #8694

Yang-Changhui commented Jul 2, 2024 •

edited

Loading

wawltor commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

wawltor commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024 •

edited

Loading

wawltor commented Jul 2, 2024

wawltor commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

[Bug]: ai studio合入微调模型killed #8694

[Bug]: ai studio合入微调模型killed #8694

Comments

Yang-Changhui commented Jul 2, 2024 • edited Loading

软件环境

重复问题

错误描述

稳定复现步骤 & 代码

模型微调

配置文件修改

模型合并

wawltor commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

wawltor commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024 • edited Loading

wawltor commented Jul 2, 2024

wawltor commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024

Yang-Changhui commented Jul 2, 2024 •

edited

Loading

Yang-Changhui commented Jul 2, 2024 •

edited

Loading