-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect global_step with multiple optimizers and automatic_optimization=False #17958
Comments
I corrected a mistake in the replication script ( |
It should also be noted that the |
The source of this behavior starts with the fact that In turn, that property derives its result from the Ultimately, One possible fix would be to inject only one of the optimizers with the |
Why this matters:
|
+1 met the same thing |
Thought I'd provide a little more detail on my use case since other people have encountered this. I'm training a GAN with multiple discriminator steps per generator step. My training step looks like this: def training_step(self, batch, batch_idx):
if batch_idx % self.n_critic == 0:
self.update_generator_and_discriminator(batch)
else:
self.update_discriminator_only(batch) This is more efficient than only updating one of the networks each iteration, because it allows one to re-use the generator outputs for the discriminator update. But, it also means that As a workaround to this bug, I subclassed the trainer, like this: class MyTrainer(pl.Trainer):
def __init__(self, *, n_critic: int, **kwargs):
super().__init__(**kwargs)
self.n_critic = n_critic
@property
def global_step(self) -> int:
return convert_global_step_to_current_iter(super().global_step, self.n_critic) And I also implemented the following method: def convert_global_step_to_current_iter(step: int, nc: int) -> int:
return int(step * nc / (nc + 1)) This lets my callbacks run at the correct frequency, but is not a general solution. This only applies to the case where every |
This is a common issue for GAN and we should take a look. |
Thanks for the response! I'm happy to help with a PR, if you or anyone else has guidance for a way forward. |
hello, any triage or advice for this? |
When i set self.automatic_optimization = False, I got same error. My observation is as follow. When i using 2 optimizer, i got 2 times larger global step then actual step. So, in this case, we need to figure out how to handle global_step increase when called optimizer.step() for proper training. |
Maybe considering change the definition of global step as the number of time "training_step" is called. But this will be a breaking change... Adding a flag to open it will be better. |
I've encountered the same problem and solved this problem as below. However, I'm not sure this method does not makes another problem. [Background]
class _ManualOptimization(_Loop):
"""A special loop implementing what is known in Lightning as Manual Optimization where the optimization happens
entirely in the :meth:`~lightning.pytorch.core.module.LightningModule.training_step` and therefore the user is
responsible for back-propagating gradients and making calls to the optimizers.
This loop is a trivial case because it performs only a single iteration (calling directly into the module's
:meth:`~lightning.pytorch.core.module.LightningModule.training_step`) and passing through the output(s).
"""
output_result_cls = ManualResult
def __init__(self, trainer: "pl.Trainer") -> None:
super().__init__(trainer)
# since manual optimization does not track lr scheduler or optimizer frequencies, we use a simpler progress than
# `_OptimizationProgress`
self.optim_step_progress = _Progress.from_defaults(_ReadyCompletedTracker)
self._output: _OUTPUTS_TYPE = {}
def run(self, kwargs: OrderedDict) -> _OUTPUTS_TYPE:
self.on_run_start()
with suppress(StopIteration): # no loop to break at this level
self.advance(kwargs)
self._restarting = False
return self.on_run_end()
def on_run_start(self) -> None:
# inject logic around the optimizer step
for lightning_optimizer in self.trainer.strategy._lightning_optimizers:
lightning_optimizer._on_before_step = self._on_before_step
lightning_optimizer._on_after_step = self._on_after_step
def advance(self, kwargs: OrderedDict) -> None:
"""Performs the training step for manual optimization.
Args:
kwargs: The kwargs passed down to the hooks.
"""
trainer = self.trainer
# manually capture logged metrics
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
del kwargs # release the batch from memory
self.trainer.strategy.post_training_step()
result = self.output_result_cls.from_training_step_output(training_step_output)
self._output = result.asdict()
def on_run_end(self) -> _OUTPUTS_TYPE:
"""Returns the result of this loop, i.e., the post-processed outputs from the training step."""
output, self._output = self._output, {} # free memory
# reset logic around the optimizer step
for lightning_optimizer in self.trainer.strategy._lightning_optimizers:
lightning_optimizer._on_before_step = do_nothing_closure
lightning_optimizer._on_after_step = do_nothing_closure
return output
def _on_before_step(self) -> None:
self.optim_step_progress.increment_ready()
self.trainer.profiler.start("optimizer_step")
def _on_after_step(self) -> None:
self.trainer.profiler.stop("optimizer_step")
self.optim_step_progress.increment_completed() [Solution]
...
def training_step(self, batch, batch_idx):
gamma_opt, beta_opt = self.optimizers()
beta_opt._on_before_step = lambda : self.trainer.profiler.start("optimizer_step")
beta_opt._on_after_step = lambda : self.trainer.profiler.stop("optimizer_step")
... [Suggestion for the PytorchLightning]
...
def configure_optimziers():
opt1 = Adam(...)
opt2 = Adam(...)
return (
{"optimizer": opt1},
{"optimizer": opt2, "do_not_count_global_step": True},
)
|
Thanks for your solution, it works! |
@yzslab Great! Also, thanks for your commen. Typo fixed :) |
Thanks @Fitree for the neat fix! Do we need to update beta_opt._on_before_step and beta_opt._on_after_step at each step, or only the first step? Thanks. |
Just ran into this as well, thanks @yzslab for the quick fix. @askerlee yes, the Separately, in the meantime if anybody needs a quick fix for any number of optimizers, update to this:
|
(@ repo owners feel free to assign the issue to me) |
Thank you for your PR @Anner-deJong! Hope it is merged soon |
Bug description
Hello,
I encountered a bug when training with
automatic_optimization = False
and two optimizers.In summary: the
global_step
attribute of the trainer and the lightning module is tracking the total number of calls tooptimizer.step()
(in my case, two pertraining_step
), rather than the total number of iterations of the dataloader.This conflicts with the notion of
step
in arguments likelog_every_n_steps
andval_check_interval
in the trainer. Case in point, if we callinside
training_step
, withCSVLogger
,log_every_n_steps=10
, and twooptimizer.step()
s pertraining_step
, the CSV logs show:Note how
global_step
conflicts withstep
, and in fact is twice the expected value, since we have two optimizers.I have attached a complete code example that replicates the issue.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
If this is the intended behavior, it should be reconciled with the trainer's notion of step. Arguments like
log_every_n_steps
andval_check_interval
use a different definition of step.The text was updated successfully, but these errors were encountered: