Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training issue #5890

Open
MaleekaA opened this issue Jul 5, 2024 · 1 comment
Open

training issue #5890

MaleekaA opened this issue Jul 5, 2024 · 1 comment

Comments

@MaleekaA
Copy link

MaleekaA commented Jul 5, 2024

if i run any exampels or training setup from applications in colossalAI I get this issue
can you help me in solving this issue?

RuntimeError: Stop_waiting response is expected
Error: failed to run torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr= --master_port= vit_train_demo.py --model_name_or_path google/vit-base-patch16-224 --output_path ./home/jovyan/malika/ColossalAI/examples/images/vit --plugin hybrid_parallel --batch_size 8 --tp_size 4 --pp_size 1 --num_epoch 3 --learning_rate 2e-4 --weight_decay 0.05 --warmup_ratio 0.3 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /home/jovyan/malika/ColossalAI/examples/images/vit && export SHELL="/bin/bash" COLORTERM="truecolor" TERM_PROGRAM_VERSION="1.90.2" LC_ADDRESS="ko_KR.UTF-8" LC_NAME="ko_KR.UTF-8" LC_MONETARY="ko_KR.UTF-8" PWD="/home/jovyan/malika/ColossalAI/examples/images/vit" LOGNAME="jovyan" NCCL_DEBUG="INFO" VSCODE_GIT_ASKPASS_NODE="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/node" MOTD_SHOWN="pam" HOME="/home/jovyan" LANG="ko_KR.UTF-8" LC_PAPER="ko_KR.UTF-8" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" VIRTUAL_ENV="/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8" SSL_CERT_DIR="/usr/lib/ssl/certs" GIT_ASKPASS="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass.sh" SSH_CONNECTION="10.0.0.137 60450 10.0.31.75 22" CUDA_VISIBLE_DEVICES="0,1,2,3" LC_IDENTIFICATION="ko_KR.UTF-8" TERM="xterm-256color" USER="jovyan" VISIBLE="now" VSCODE_GIT_IPC_HANDLE="/tmp/vscode-git-044e287697.sock" SHLVL="2" LC_TELEPHONE="ko_KR.UTF-8" LC_MESSAGES="ko_KR.UTF-8" LC_MEASUREMENT="ko_KR.UTF-8" VIRTUAL_ENV_PROMPT="(torch2.3.0-py3.10-cuda11.8) " LD_LIBRARY_PATH="/usr/lib/nvidia:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64" LC_CTYPE="ko_KR.UTF-8" SSL_CERT_FILE="/usr/lib/ssl/certs/ca-certificates.crt" SSH_CLIENT="10.0.0.137 60450 22" LC_TIME="ko_KR.UTF-8" OMP_NUM_THREADS="1" VSCODE_GIT_ASKPASS_MAIN="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/extensions/git/dist/askpass-main.js" CUDA_HOME="/usr/local/cuda" LC_COLLATE="ko_KR.UTF-8" GCC_COLORS="error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01" BROWSER="/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/helpers/browser.sh" PATH="/usr/local/cuda/bin:/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin:/home/jovyan/.vscode-server/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/server/bin/remote-cli:/usr/local/cuda/bin:/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" LC_NUMERIC="ko_KR.UTF-8" OLDPWD="/home/jovyan/malika/ColossalAI/examples/images" TERM_PROGRAM="vscode" VSCODE_IPC_HOOK_CLI="/tmp/vscode-ipc-3dbaed2b-9e4b-4150-855d-699020003867.sock" _="/home/jovyan/.venv/torch2.3.0-py3.10-cuda11.8/bin/colossalai" && torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=210.125.69.5 --master_port=32309 vit_train_demo.py --model_name_or_path google/vit-base-patch16-224 --output_path ./home/jovyan/malika/ColossalAI/examples/images/vit --plugin hybrid_parallel --batch_size 8 --tp_size 4 --pp_size 1 --num_epoch 3 --learning_rate 2e-4 --weight_decay 0.05 --warmup_ratio 0.3'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes =====
127.0.0.1: failure

====== Stopping All Nodes =====
127.0.0.1: finish

@Edenzzzz
Copy link
Contributor

Edenzzzz commented Jul 9, 2024

RuntimeError: Stop_waiting response is expected
indicates that the problem is on torch's end. Please ensure your environment is properly set up (PyTorch version, CUDA) and re-run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants