Trainer initialization freezes when mpi4py is installed #20049

Robinysh · 2024-07-04T18:41:49Z

Bug description

Trainer freezes on initialization when mpi4py is installed.

I suspect the following issues are encountering the same problem: #18836 #19768

What version are you seeing the problem on?

master

How to reproduce the bug

This freezes

pip install lightning mpi4py
python -c "import lightning; lightning.Trainer(accelerator='cpu')"

This does not freeze

pip install lightning
python -c "import lightning; lightning.Trainer(accelerator='cpu')"

Error messages and logs

No response

Environment

Current environment

CUDA:
- GPU:
  - NVIDIA GeForce RTX 3090
- available: True
- version: 12.1
Lightning:
- lightning: 2.3.2
- lightning-utilities: 0.11.3.post0
- pytorch-lightning: 2.3.2
- torch: 2.3.1
- torchmetrics: 1.4.0.post0
Packages:
- aiohttp: 3.9.5
- aiosignal: 1.3.1
- attrs: 23.2.0
- filelock: 3.15.4
- frozenlist: 1.4.1
- fsspec: 2024.6.1
- gitdb: 4.0.11
- gitpython: 3.1.40
- globus-cli: 3.23.0
- globus-sdk: 3.34.0
- idna: 3.7
- jinja2: 3.1.4
- jupyter-server-mathjax: 0.2.6
- lightning: 2.3.2
- lightning-utilities: 0.11.3.post0
- markupsafe: 2.1.5
- mpi4py: 3.1.6
- mpmath: 1.3.0
- multidict: 6.0.5
- nbdime: 4.0.1
- networkx: 3.3
- numpy: 2.0.0
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.5.82
- nvidia-nvtx-cu12: 12.1.105
- packaging: 24.1
- pip: 24.0
- pyopenssl: 23.2.0
- pytorch-lightning: 2.3.2
- pyyaml: 6.0.1
- setuptools: 70.1.1
- smmap: 5.0.1
- sympy: 1.12.1
- torch: 2.3.1
- torchmetrics: 1.4.0.post0
- tqdm: 4.66.4
- triton: 2.3.1
- types-python-dateutil: 2.8.19.20240106
- typing-extensions: 4.12.2
- wheel: 0.43.0
- yarl: 1.9.4
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor:
- python: 3.11.9
- release: 6.8.9-arch1-2
- version: Proposal for help #1 SMP PREEMPT_DYNAMIC Tue, 07 May 2024 21:35:54 +0000

Conda environment that freezes:

name: debuglightning
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - bzip2=1.0.8=hd590300_5
  - ca-certificates=2024.7.4=hbcca054_0
  - ld_impl_linux-64=2.40=hf3520f5_7
  - libexpat=2.6.2=h59595ed_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=14.1.0=h77fa898_0
  - libgomp=14.1.0=h77fa898_0
  - libnsl=2.0.1=hd590300_0
  - libsqlite=3.46.0=hde9e2c9_0
  - libuuid=2.38.1=h0b41bf4_0
  - libxcrypt=4.4.36=hd590300_1
  - libzlib=1.3.1=h4ab18f5_1
  - ncurses=6.5=h59595ed_0
  - openssl=3.3.1=h4ab18f5_1
  - pip=24.0=pyhd8ed1ab_0
  - python=3.11.9=hb806964_0_cpython
  - readline=8.2=h8228510_1
  - setuptools=70.1.1=pyhd8ed1ab_0
  - tk=8.6.13=noxft_h4845f30_101
  - tzdata=2024a=h0c530f3_0
  - wheel=0.43.0=pyhd8ed1ab_1
  - xz=5.2.6=h166bdaf_0
  - pip:
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - attrs==23.2.0
      - filelock==3.15.4
      - frozenlist==1.4.1
      - fsspec==2024.6.1
      - idna==3.7
      - jinja2==3.1.4
      - lightning==2.3.2
      - lightning-utilities==0.11.3.post0
      - markupsafe==2.1.5
      - mpi4py==3.1.6
      - mpmath==1.3.0
      - multidict==6.0.5
      - networkx==3.3
      - numpy==2.0.0
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==8.9.2.26
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.5.82
      - nvidia-nvtx-cu12==12.1.105
      - packaging==24.1
      - pytorch-lightning==2.3.2
      - pyyaml==6.0.1
      - sympy==1.12.1
      - torch==2.3.1
      - torchmetrics==1.4.0.post0
      - tqdm==4.66.4
      - triton==2.3.1
      - typing-extensions==4.12.2
      - yarl==1.9.4
prefix: /home/robinysh/.conda/envs/debuglightning

cc @awaelchli

awaelchli · 2024-07-05T14:04:43Z

Hi @Robinysh

Based on this code:

pytorch-lightning/src/lightning/fabric/plugins/environments/mpi.py

Lines 71 to 78 in 3730e98

    
           def detect() -> bool: 
        
               """Returns ``True`` if the `mpi4py` package is installed and MPI returns a world size greater than 1.""" 
        
               if not _MPI4PY_AVAILABLE: 
        
                   return False 
        
               from mpi4py import MPI 
        
               return MPI.COMM_WORLD.Get_size() > 1

I interpret when MPI is installed, and the world size is > 1, the Trainer will detect that it is running on an MPI cluster. If you're not launching on an MPI cluster, then I guess this will not work and the hang is understandable. Can I ask what you have mpi4py installed for and did you intend to run on an MPI cluster or not?

Robinysh added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jul 4, 2024

github-actions bot added the ver: 2.2.x label Jul 4, 2024

awaelchli added environment: mpi and removed needs triage Waiting to be triaged by maintainers labels Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer initialization freezes when mpi4py is installed #20049

Trainer initialization freezes when mpi4py is installed #20049

Robinysh commented Jul 4, 2024 •

edited by github-actions bot

Loading

awaelchli commented Jul 5, 2024

Trainer initialization freezes when mpi4py is installed #20049

Trainer initialization freezes when mpi4py is installed #20049

Comments

Robinysh commented Jul 4, 2024 • edited by github-actions bot Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

awaelchli commented Jul 5, 2024

Robinysh commented Jul 4, 2024 •

edited by github-actions bot

Loading