Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer initialization freezes when mpi4py is installed #20049

Open
Robinysh opened this issue Jul 4, 2024 · 1 comment
Open

Trainer initialization freezes when mpi4py is installed #20049

Robinysh opened this issue Jul 4, 2024 · 1 comment
Labels

Comments

@Robinysh
Copy link

Robinysh commented Jul 4, 2024

Bug description

Trainer freezes on initialization when mpi4py is installed.

I suspect the following issues are encountering the same problem: #18836 #19768

What version are you seeing the problem on?

master

How to reproduce the bug

This freezes

pip install lightning mpi4py
python -c "import lightning; lightning.Trainer(accelerator='cpu')"

This does not freeze

pip install lightning
python -c "import lightning; lightning.Trainer(accelerator='cpu')"

Error messages and logs

No response

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 3090
    • available: True
    • version: 12.1
  • Lightning:
    • lightning: 2.3.2
    • lightning-utilities: 0.11.3.post0
    • pytorch-lightning: 2.3.2
    • torch: 2.3.1
    • torchmetrics: 1.4.0.post0
  • Packages:
    • aiohttp: 3.9.5
    • aiosignal: 1.3.1
    • attrs: 23.2.0
    • filelock: 3.15.4
    • frozenlist: 1.4.1
    • fsspec: 2024.6.1
    • gitdb: 4.0.11
    • gitpython: 3.1.40
    • globus-cli: 3.23.0
    • globus-sdk: 3.34.0
    • idna: 3.7
    • jinja2: 3.1.4
    • jupyter-server-mathjax: 0.2.6
    • lightning: 2.3.2
    • lightning-utilities: 0.11.3.post0
    • markupsafe: 2.1.5
    • mpi4py: 3.1.6
    • mpmath: 1.3.0
    • multidict: 6.0.5
    • nbdime: 4.0.1
    • networkx: 3.3
    • numpy: 2.0.0
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 8.9.2.26
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-nccl-cu12: 2.20.5
    • nvidia-nvjitlink-cu12: 12.5.82
    • nvidia-nvtx-cu12: 12.1.105
    • packaging: 24.1
    • pip: 24.0
    • pyopenssl: 23.2.0
    • pytorch-lightning: 2.3.2
    • pyyaml: 6.0.1
    • setuptools: 70.1.1
    • smmap: 5.0.1
    • sympy: 1.12.1
    • torch: 2.3.1
    • torchmetrics: 1.4.0.post0
    • tqdm: 4.66.4
    • triton: 2.3.1
    • types-python-dateutil: 2.8.19.20240106
    • typing-extensions: 4.12.2
    • wheel: 0.43.0
    • yarl: 1.9.4
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor:
    • python: 3.11.9
    • release: 6.8.9-arch1-2
    • version: Proposal for help #1 SMP PREEMPT_DYNAMIC Tue, 07 May 2024 21:35:54 +0000

Conda environment that freezes:

name: debuglightning
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - bzip2=1.0.8=hd590300_5
  - ca-certificates=2024.7.4=hbcca054_0
  - ld_impl_linux-64=2.40=hf3520f5_7
  - libexpat=2.6.2=h59595ed_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=14.1.0=h77fa898_0
  - libgomp=14.1.0=h77fa898_0
  - libnsl=2.0.1=hd590300_0
  - libsqlite=3.46.0=hde9e2c9_0
  - libuuid=2.38.1=h0b41bf4_0
  - libxcrypt=4.4.36=hd590300_1
  - libzlib=1.3.1=h4ab18f5_1
  - ncurses=6.5=h59595ed_0
  - openssl=3.3.1=h4ab18f5_1
  - pip=24.0=pyhd8ed1ab_0
  - python=3.11.9=hb806964_0_cpython
  - readline=8.2=h8228510_1
  - setuptools=70.1.1=pyhd8ed1ab_0
  - tk=8.6.13=noxft_h4845f30_101
  - tzdata=2024a=h0c530f3_0
  - wheel=0.43.0=pyhd8ed1ab_1
  - xz=5.2.6=h166bdaf_0
  - pip:
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - attrs==23.2.0
      - filelock==3.15.4
      - frozenlist==1.4.1
      - fsspec==2024.6.1
      - idna==3.7
      - jinja2==3.1.4
      - lightning==2.3.2
      - lightning-utilities==0.11.3.post0
      - markupsafe==2.1.5
      - mpi4py==3.1.6
      - mpmath==1.3.0
      - multidict==6.0.5
      - networkx==3.3
      - numpy==2.0.0
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==8.9.2.26
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.5.82
      - nvidia-nvtx-cu12==12.1.105
      - packaging==24.1
      - pytorch-lightning==2.3.2
      - pyyaml==6.0.1
      - sympy==1.12.1
      - torch==2.3.1
      - torchmetrics==1.4.0.post0
      - tqdm==4.66.4
      - triton==2.3.1
      - typing-extensions==4.12.2
      - yarl==1.9.4
prefix: /home/robinysh/.conda/envs/debuglightning

cc @awaelchli

@Robinysh Robinysh added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jul 4, 2024
@awaelchli
Copy link
Member

Hi @Robinysh

Based on this code:

def detect() -> bool:
"""Returns ``True`` if the `mpi4py` package is installed and MPI returns a world size greater than 1."""
if not _MPI4PY_AVAILABLE:
return False
from mpi4py import MPI
return MPI.COMM_WORLD.Get_size() > 1

I interpret when MPI is installed, and the world size is > 1, the Trainer will detect that it is running on an MPI cluster. If you're not launching on an MPI cluster, then I guess this will not work and the hang is understandable. Can I ask what you have mpi4py installed for and did you intend to run on an MPI cluster or not?

@awaelchli awaelchli added environment: mpi and removed needs triage Waiting to be triaged by maintainers labels Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants