Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special logic for 'step' in _optimizer_to_device #20019

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

corwinjoy
Copy link
Contributor

@corwinjoy corwinjoy commented Jun 27, 2024

Fix performance degradation when restoring optimizer from checkpoint.
This fix is to address the issue discussed in #19955

Fixes #19955

This fix is also due to the related isssue in PyTorch:
pytorch/pytorch#74424

This issue could also use a test to check for continued performance, but I'm not sure how to do it.
On a dedicated GPU the transfer time is negligible, this really becomes an issue when the GPU is shared or has more of a transfer bottleneck.


📚 Documentation preview 📚: https://pytorch-lightning--20019.org.readthedocs.build/en/20019/

@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Jun 27, 2024
@corwinjoy
Copy link
Contributor Author

corwinjoy commented Jun 27, 2024

Here is the performance information when using the test code from issue #19955 and continuing from a checkpoint. With the old code many memory synchronizations are forced, with the update to keep 'step' as-is, this issue is removed:

nsys profile --stats=true /home/cjoy/src/adam_gpu/.venv/bin/python /home/cjoy/src/adam_gpu/src/test.py


Original _optimizer_to_device function:
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)   StdDev (ns)            Operation          
 --------  ---------------  -----  -----------  --------  --------  ----------  ------------  ----------------------------
     60.4      129,388,373  4,094     31,604.4   1,344.0     1,024  16,394,576     672,557.9  [CUDA memcpy Device-to-Host]
     38.7       82,982,124     44  1,885,957.4     608.0       415  67,438,426  10,172,712.8  [CUDA memcpy Host-to-Device]
      0.9        1,971,518  2,000        985.8     992.0       416       2,368         166.8  [CUDA memset]            
      

with special handling for 'step as per this PR':
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)   StdDev (ns)            Operation          
 --------  ---------------  -----  -----------  --------  --------  ----------  ------------  ----------------------------
     59.3      122,887,554     74  1,660,642.6   1,424.0     1,024  16,134,055   4,710,020.5  [CUDA memcpy Device-to-Host]
     39.8       82,420,918     34  2,424,144.6     799.5       415  67,068,637  11,489,504.0  [CUDA memcpy Host-to-Device]
      0.9        1,940,579  2,000        970.3     991.0       415       5,727         197.9  [CUDA memset] 
       

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adam optimizer is slower after loading model from checkpoint
1 participant