Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 训练llama7B结束后遇到进程不自己退出的情况 #8707

Open
1 task done
hanhaowen-mt opened this issue Jul 3, 2024 · 1 comment
Open
1 task done
Assignees
Labels
bug Something isn't working

Comments

@hanhaowen-mt
Copy link

hanhaowen-mt commented Jul 3, 2024

软件环境

- paddlepaddle:
- paddlepaddle-gpu: 基于131999233ef997fc8d3f24b27830925b78cf17aa(比2.6.1新)的我们自己适配在musa的版本
- paddlenlp: https://github.com/ZHUI/PaddleNLP/tree/sci/benchmark commit:20fe363530c0e3868414f65ec394124ffac6b9b2(《pretrain-4机-20240620.pdf》文档里面要求这个commit)

重复问题

  • I have searched the existing issues

错误描述

发现A100的pretrain训练结束后,会有一些进程一直保留,workerlog输出也不一样,有如下两种log,第二种log的进程一直不退出占用显存,不知道是bug还是说是正常的?(log的图片在《稳定复现步骤 & 代码》)

稳定复现步骤 & 代码

捕获
image
为了调试我们把llm/run_dist.sh中的json文件改为llama/pretrain-llama2_7b-tp2sd4_stage2.json,其内容如下:

{
    "model_name_or_path": "meta-llama/Llama-2-7b",
    "tokenizer_name_or_path": "meta-llama/Llama-2-7b",
    "input_dir": "/home/baidu_test/zhonghui03/data",
    "output_dir": "/home/baidu_test/checkpoints/llama_benchmark_ckpts",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 1,
    "per_device_eval_batch_size": 1,
    "tensor_parallel_degree": 2,
    "pipeline_parallel_degree": 2,
    "sharding": "stage1",
    "virtual_pp_degree": 1,
    "sequence_parallel": 0,
    "use_flash_attention": true,
    "use_fused_rms_norm": true,
    "use_fused_rope": true,
    "max_seq_length": 4096,
    "learning_rate": 3e-05,
    "min_learning_rate": 3e-06,
    "warmup_steps": 30,
    "logging_steps": 1,
    "max_steps": 10,
    "save_steps": 9,
    "eval_steps": 7,
    "weight_decay": 0.01,
    "fp16": true,
    "fp16_opt_level": "O2",
    "warmup_ratio": 0.01,
    "max_grad_norm": 1.0,
    "dataloader_num_workers": 1,
    "continue_training": 1,
    "do_train": true,
    "do_eval": true,
    "do_predict": true,
    "disable_tqdm": true,
    "recompute": false,
    "distributed_dataloader": 1,
    "recompute_granularity": "full",
    "save_total_limit": 2
  }

run_dist.sh的内容如下:

# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

unset PADDLE_ELASTIC_JOB_ID
unset PADDLE_TRAINER_ENDPOINTS
unset DISTRIBUTED_TRAINER_ENDPOINTS
unset FLAGS_START_PORT
unset PADDLE_ELASTIC_TIMEOUT
unset CUDA_VISIBLE_DEVICES
# LD_LIBRARY_PATH=/opt/software/openmpi-4.0.5/lib

# 10.3.5.1    g3021
# 10.3.5.2    g3022
# 10.3.6.1    g3023
# 10.3.7.1    g3024
export SAVE_INIT_MODEL=1
python=python

# cd ../model_zoo/gpt-3/external_ops/ &&  ${python} setup.py install && cd -

PYTHONPATH=../ ${python} -m paddle.distributed.launch \
        --master "127.0.0.1:9632" \
        --nnodes 1 \
        --log_dir log_$(hostname) \
        --gpus 0,1,2,3,4,5,6,7 \
        run_pretrain.py \
    "llama/pretrain-llama2_7b-tp2sd4_stage2.json"

# llama/pretrain-llama_13b-tp2sd4_stage2.json
# llama/pretrain-llama_13b-pp4tp2sd2_stage1.json
# llama/pretrain-llama_13b-tp2sd4_stage2.json

@hanhaowen-mt hanhaowen-mt added the bug Something isn't working label Jul 3, 2024
@ZHUI
Copy link
Collaborator

ZHUI commented Jul 4, 2024

你好,这个应该是正常的。您看是否有别的方法,自己去把进程杀干净?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants