[Bug]: 训练llama7B结束后遇到进程不自己退出的情况 #8707

hanhaowen-mt · 2024-07-03T09:18:51Z

软件环境

- paddlepaddle:
- paddlepaddle-gpu: 基于131999233ef997fc8d3f24b27830925b78cf17aa（比2.6.1新）的我们自己适配在musa的版本
- paddlenlp: https://github.com/ZHUI/PaddleNLP/tree/sci/benchmark commit:20fe363530c0e3868414f65ec394124ffac6b9b2(《pretrain-4机-20240620.pdf》文档里面要求这个commit)

重复问题

I have searched the existing issues

错误描述

发现A100的pretrain训练结束后，会有一些进程一直保留，workerlog输出也不一样，有如下两种log，第二种log的进程一直不退出占用显存，不知道是bug还是说是正常的？（log的图片在《稳定复现步骤 & 代码》）

稳定复现步骤 & 代码

为了调试我们把llm/run_dist.sh中的json文件改为llama/pretrain-llama2_7b-tp2sd4_stage2.json，其内容如下:

{
    "model_name_or_path": "meta-llama/Llama-2-7b",
    "tokenizer_name_or_path": "meta-llama/Llama-2-7b",
    "input_dir": "/home/baidu_test/zhonghui03/data",
    "output_dir": "/home/baidu_test/checkpoints/llama_benchmark_ckpts",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 1,
    "per_device_eval_batch_size": 1,
    "tensor_parallel_degree": 2,
    "pipeline_parallel_degree": 2,
    "sharding": "stage1",
    "virtual_pp_degree": 1,
    "sequence_parallel": 0,
    "use_flash_attention": true,
    "use_fused_rms_norm": true,
    "use_fused_rope": true,
    "max_seq_length": 4096,
    "learning_rate": 3e-05,
    "min_learning_rate": 3e-06,
    "warmup_steps": 30,
    "logging_steps": 1,
    "max_steps": 10,
    "save_steps": 9,
    "eval_steps": 7,
    "weight_decay": 0.01,
    "fp16": true,
    "fp16_opt_level": "O2",
    "warmup_ratio": 0.01,
    "max_grad_norm": 1.0,
    "dataloader_num_workers": 1,
    "continue_training": 1,
    "do_train": true,
    "do_eval": true,
    "do_predict": true,
    "disable_tqdm": true,
    "recompute": false,
    "distributed_dataloader": 1,
    "recompute_granularity": "full",
    "save_total_limit": 2
  }

run_dist.sh的内容如下:

# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

unset PADDLE_ELASTIC_JOB_ID
unset PADDLE_TRAINER_ENDPOINTS
unset DISTRIBUTED_TRAINER_ENDPOINTS
unset FLAGS_START_PORT
unset PADDLE_ELASTIC_TIMEOUT
unset CUDA_VISIBLE_DEVICES
# LD_LIBRARY_PATH=/opt/software/openmpi-4.0.5/lib

# 10.3.5.1    g3021
# 10.3.5.2    g3022
# 10.3.6.1    g3023
# 10.3.7.1    g3024
export SAVE_INIT_MODEL=1
python=python

# cd ../model_zoo/gpt-3/external_ops/ &&  ${python} setup.py install && cd -

PYTHONPATH=../ ${python} -m paddle.distributed.launch \
        --master "127.0.0.1:9632" \
        --nnodes 1 \
        --log_dir log_$(hostname) \
        --gpus 0,1,2,3,4,5,6,7 \
        run_pretrain.py \
    "llama/pretrain-llama2_7b-tp2sd4_stage2.json"

# llama/pretrain-llama_13b-tp2sd4_stage2.json
# llama/pretrain-llama_13b-pp4tp2sd2_stage1.json
# llama/pretrain-llama_13b-tp2sd4_stage2.json

The text was updated successfully, but these errors were encountered:

ZHUI · 2024-07-04T03:06:01Z

你好，这个应该是正常的。您看是否有别的方法，自己去把进程杀干净？

hanhaowen-mt added the bug Something isn't working label Jul 3, 2024

paddle-bot bot assigned KB-Ding Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 训练llama7B结束后遇到进程不自己退出的情况 #8707

[Bug]: 训练llama7B结束后遇到进程不自己退出的情况 #8707

hanhaowen-mt commented Jul 3, 2024 •

edited

Loading

ZHUI commented Jul 4, 2024

[Bug]: 训练llama7B结束后遇到进程不自己退出的情况 #8707

[Bug]: 训练llama7B结束后遇到进程不自己退出的情况 #8707

Comments

hanhaowen-mt commented Jul 3, 2024 • edited Loading

软件环境

重复问题

错误描述

稳定复现步骤 & 代码

ZHUI commented Jul 4, 2024

hanhaowen-mt commented Jul 3, 2024 •

edited

Loading