Add special logic for 'step' in _optimizer_to_device #20019

corwinjoy · 2024-06-27T01:43:53Z

Fix performance degradation when restoring optimizer from checkpoint.
This fix is to address the issue discussed in #19955

Fixes #19955

This fix is also due to the related isssue in PyTorch:
pytorch/pytorch#74424

This issue could also use a test to check for continued performance, but I'm not sure how to do it.
On a dedicated GPU the transfer time is negligible, this really becomes an issue when the GPU is shared or has more of a transfer bottleneck.

📚 Documentation preview 📚: https://pytorch-lightning--20019.org.readthedocs.build/en/20019/

corwinjoy · 2024-06-27T01:46:44Z

Here is the performance information when using the test code from issue #19955 and continuing from a checkpoint. With the old code many memory synchronizations are forced, with the update to keep 'step' as-is, this issue is removed:

nsys profile --stats=true /home/cjoy/src/adam_gpu/.venv/bin/python /home/cjoy/src/adam_gpu/src/test.py


Original _optimizer_to_device function:
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)   StdDev (ns)            Operation          
 --------  ---------------  -----  -----------  --------  --------  ----------  ------------  ----------------------------
     60.4      129,388,373  4,094     31,604.4   1,344.0     1,024  16,394,576     672,557.9  [CUDA memcpy Device-to-Host]
     38.7       82,982,124     44  1,885,957.4     608.0       415  67,438,426  10,172,712.8  [CUDA memcpy Host-to-Device]
      0.9        1,971,518  2,000        985.8     992.0       416       2,368         166.8  [CUDA memset]            
      

with special handling for 'step as per this PR':
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)   StdDev (ns)            Operation          
 --------  ---------------  -----  -----------  --------  --------  ----------  ------------  ----------------------------
     59.3      122,887,554     74  1,660,642.6   1,424.0     1,024  16,134,055   4,710,020.5  [CUDA memcpy Device-to-Host]
     39.8       82,420,918     34  2,424,144.6     799.5       415  67,068,637  11,489,504.0  [CUDA memcpy Host-to-Device]
      0.9        1,940,579  2,000        970.3     991.0       415       5,727         197.9  [CUDA memset]

Add special logic for 'step' in _optimizer_to_device

aa0ec04

corwinjoy requested review from lantiga, Borda, tchaton, awaelchli and justusschock as code owners June 27, 2024 01:43

github-actions bot added the fabric lightning.fabric.Fabric label Jun 27, 2024

corwinjoy mentioned this pull request Jun 27, 2024

Adam optimizer is slower after loading model from checkpoint #19955

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add special logic for 'step' in _optimizer_to_device #20019

Add special logic for 'step' in _optimizer_to_device #20019

corwinjoy commented Jun 27, 2024 •

edited by github-actions bot

Loading

corwinjoy commented Jun 27, 2024 •

edited

Loading

Add special logic for 'step' in _optimizer_to_device #20019

Are you sure you want to change the base?

Add special logic for 'step' in _optimizer_to_device #20019

Conversation

corwinjoy commented Jun 27, 2024 • edited by github-actions bot Loading

corwinjoy commented Jun 27, 2024 • edited Loading

corwinjoy commented Jun 27, 2024 •

edited by github-actions bot

Loading

corwinjoy commented Jun 27, 2024 •

edited

Loading