-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: ai studio合入微调模型killed #8694
Comments
参数融合过程中需要将参数将在内存上进行融合,可以打开 unified checkpoint然后来避免参数融合 |
好的,还有一个问题,使用最新的paddlenlp3.0,相同的配置,进行模型微调时,训练一会报错,使用paddlenlp2.8不会报错: |
训练时显存OOM了吗? |
使用ai studio自带的监视器查看,并没有,连一半都没有 |
好像只有训练时候才能使用,打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数 |
2.8版本可以支持unified checkpoint |
安装的paddle的cuda版本是否满足需要了? |
|
|
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
模型微调
python finetune_generation.py chatglm2/lora_argument.json
配置文件修改
{
"dataset_name_or_path": "/home/aistudio/dataset",
"per_device_train_batch_size": 1,
"zero_padding": true,
"use_flash_attention": true,
"weight_quantize_algo": "nf4"
}
模型合并
python merge_lora_params.py
--lora_path ./checkpoints/chatglm2_lora_ckpts/checkpoint-204
--merge_lora_model_path ./checkpoints/chatglm2_lora_merge
--device "gpu"
--low_gpu_mem True
合并出错:
![无标题](https://private-user-images.githubusercontent.com/71805205/344851584-5aa64b6e-27f5-4195-9de2-013a174948e4.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExNjE2NTAsIm5iZiI6MTcyMTE2MTM1MCwicGF0aCI6Ii83MTgwNTIwNS8zNDQ4NTE1ODQtNWFhNjRiNmUtMjdmNS00MTk1LTlkZTItMDEzYTE3NDk0OGU0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE2VDIwMjIzMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTA4OGRiMzBhY2Y4ODAxZjA0YjFiOTI4YjYwYjI1MjdhZjY5NGNhM2E2Zjk0YTE2ZmQ3OWM3MmY2Y2ZiYzc5OWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.GHPLn4gDR7CnhQw5cS-WT9DrznEC2oZXepU_6Fmcap4)
The text was updated successfully, but these errors were encountered: