You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a model with several ModelCheckpoint callbacks. When loading it from a checkpoint using trainer.fit(model, datamodule=dm, ckpt_path=training_ckpt_path), I get the following error:
lightning_fabric.utilities.exceptions.MisconfigurationException: `ModelCheckpoint(monitor='v_nll_unsupervised')` could not find the monitored key in the returned metrics:
['v_nll_supervised_encoder', 'v_nll_supervised_decoder', 'v_nll_supervised', 'v_nll', 'v_nll_supervised_encoder_clip', 'v_nll_supervised_decoder_clip', 'v_nll_supervised_c
lip', 'v_nll_clip', 'v_mse_supervised_encoder', 'v_mse_supervised_decoder', 'v_mse_encoder', 'v_mse_decoder', 'v_mse', 'v_mse_supervised_encoder_clip', 'v_mse_supervised_d
ecoder_clip', 'v_mse_encoder_clip', 'v_mse_decoder_clip', 'v_mse_clip', 'v_baseline_l_mse_supervised', 'v_baseline_l_mse', 'v_baseline_prior_mse_supervised', 'v_baseline_p
rior_mse', 'v_mu_supervised_encoder', 'v_mu_supervised_decoder', 'v_mu_encoder', 'v_mu_decoder', 'v_sigma_supervised_encoder', 'v_sigma_supervised_decoder', 'v_sigma_encod
er', 'v_sigma_decoder', 'hp_metric', 'epoch', 'step']. HINT: Did you call `log('v_nll_unsupervised', value)` in the `LightningModule`?
The issue seems to be that the v_nll_unsupervised metric was not logged with the log(...) method, so the ModelCheckpoint callback can't find it.
However, although I don't log this metric at every validation step, it is logged at least once every validation epoch. Since I use on_step=False, on_epoch=True when logging metrics, I would expect that the whole validation epoch would end before the ModelCheckpoint callback tries to access this metric, in which case it would exist and no error would be raised.
Nonetheless, it seems this metric is being accessed just after the first validation iteration.
I thought that maybe this was due to the sanity checking process when training starts. However, setting num_sanity_val_steps=0 or num_sanity_val_steps=-1 in the Trainer did not solve anything.
What version are you seeing the problem on?
v2.1
How to reproduce the bug
No response
Error messages and logs
lightning_fabric.utilities.exceptions.MisconfigurationException: `ModelCheckpoint(monitor='v_nll_unsupervised')` could not find the monitored key in the returned metrics:
['v_nll_supervised_encoder', 'v_nll_supervised_decoder', 'v_nll_supervised', 'v_nll', 'v_nll_supervised_encoder_clip', 'v_nll_supervised_decoder_clip', 'v_nll_supervised_c
lip', 'v_nll_clip', 'v_mse_supervised_encoder', 'v_mse_supervised_decoder', 'v_mse_encoder', 'v_mse_decoder', 'v_mse', 'v_mse_supervised_encoder_clip', 'v_mse_supervised_d
ecoder_clip', 'v_mse_encoder_clip', 'v_mse_decoder_clip', 'v_mse_clip', 'v_baseline_l_mse_supervised', 'v_baseline_l_mse', 'v_baseline_prior_mse_supervised', 'v_baseline_p
rior_mse', 'v_mu_supervised_encoder', 'v_mu_supervised_decoder', 'v_mu_encoder', 'v_mu_decoder', 'v_sigma_supervised_encoder', 'v_sigma_supervised_decoder', 'v_sigma_encod
er', 'v_sigma_decoder', 'hp_metric', 'epoch', 'step']. HINT: Did you call `log('v_nll_unsupervised', value)` in the `LightningModule`?
Environment
Current environment
CUDA:
- GPU:
- Tesla V100-PCIE-16GB
- Tesla V100-PCIE-16GB
- available: True
- version: 11.7
Bug description
I have a model with several
ModelCheckpoint
callbacks. When loading it from a checkpoint usingtrainer.fit(model, datamodule=dm, ckpt_path=training_ckpt_path)
, I get the following error:The issue seems to be that the
v_nll_unsupervised
metric was not logged with thelog(...)
method, so theModelCheckpoint
callback can't find it.However, although I don't log this metric at every validation step, it is logged at least once every validation epoch. Since I use
on_step=False, on_epoch=True
when logging metrics, I would expect that the whole validation epoch would end before theModelCheckpoint
callback tries to access this metric, in which case it would exist and no error would be raised.Nonetheless, it seems this metric is being accessed just after the first validation iteration.
I thought that maybe this was due to the sanity checking process when training starts. However, setting
num_sanity_val_steps=0
ornum_sanity_val_steps=-1
in theTrainer
did not solve anything.What version are you seeing the problem on?
v2.1
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
- GPU:
- Tesla V100-PCIE-16GB
- Tesla V100-PCIE-16GB
- available: True
- version: 11.7
- lightning-cloud: 0.5.37
- lightning-utilities: 0.8.0
- pytorch-lightning: 2.1.0
- pytorch-ranger: 0.1.1
- torch: 2.0.1
- torch-optimizer: 0.3.0
- torch-scatter: 2.1.1
- torchmetrics: 0.11.4
- absl-py: 1.4.0
- aiohttp: 3.8.4
- aiosignal: 1.3.1
- ansicolors: 1.1.8
- antlr4-python3-runtime: 4.7.2
- anyio: 3.7.1
- arrow: 1.2.3
- async-timeout: 4.0.2
- attrs: 23.1.0
- backoff: 2.2.1
- beautifulsoup4: 4.12.2
- blessed: 1.20.0
- boto: 2.49.0
- cachetools: 5.3.1
- certifi: 2023.5.7
- charset-normalizer: 3.1.0
- click: 8.1.3
- cmake: 3.26.4
- contourpy: 1.1.0
- croniter: 1.4.1
- cycler: 0.11.0
- dateutils: 0.6.12
- deepdiff: 6.3.1
- exceptiongroup: 1.1.2
- fastapi: 0.100.0
- filelock: 3.12.2
- fonttools: 4.40.0
- frozenlist: 1.3.3
- fsspec: 2023.6.0
- google-auth: 2.20.0
- google-auth-oauthlib: 1.0.0
- gprof2dot: 2022.7.29
- graphviz: 0.20.1
- grpcio: 1.51.3
- h11: 0.14.0
- idna: 3.4
- importlib-metadata: 6.7.0
- importlib-resources: 5.12.0
- inquirer: 3.1.3
- itsdangerous: 2.1.2
- jinja2: 3.1.2
- joblib: 1.2.0
- jsonschema: 4.17.3
- kiwisolver: 1.4.4
- lifted-pddl: 1.2.2
- lightning-cloud: 0.5.37
- lightning-utilities: 0.8.0
- lit: 16.0.6
- markdown: 3.4.3
- markdown-it-py: 3.0.0
- markupsafe: 2.1.3
- matplotlib: 3.7.1
- mdurl: 0.1.2
- mpmath: 1.3.0
- msgpack: 1.0.5
- multidict: 6.0.4
- multipledispatch: 0.6.0
- mypy: 1.3.0
- mypy-extensions: 1.0.0
- networkx: 3.1
- numpy: 1.25.0
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-cupti-cu11: 11.7.101
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- nvidia-cufft-cu11: 10.9.0.58
- nvidia-curand-cu11: 10.2.10.91
- nvidia-cusolver-cu11: 11.4.0.1
- nvidia-cusparse-cu11: 11.7.4.91
- nvidia-nccl-cu11: 2.14.3
- nvidia-nvtx-cu11: 11.7.91
- oauthlib: 3.2.2
- ordered-set: 4.1.0
- packaging: 23.1
- pandas: 2.0.2
- pddl-generators: 1.0
- pillow: 9.5.0
- pip: 23.1.2
- protobuf: 4.23.3
- psutil: 5.9.5
- pyarrow: 12.0.1
- pyasn1: 0.5.0
- pyasn1-modules: 0.3.0
- pydantic: 1.10.11
- pygments: 2.15.1
- pyjwt: 2.7.0
- pynvml: 11.5.0
- pyparsing: 3.1.0
- pyperplan: 2.1
- pyrsistent: 0.19.3
- python-dateutil: 2.8.2
- python-editor: 1.0.4
- python-multipart: 0.0.6
- pytorch-lightning: 2.1.0
- pytorch-ranger: 0.1.1
- pytz: 2023.3
- pyyaml: 6.0
- ray: 2.5.0
- readchar: 4.0.5
- requests: 2.31.0
- requests-oauthlib: 1.3.1
- rich: 13.4.2
- rsa: 4.9
- scikit-learn: 1.2.2
- scipy: 1.10.1
- seaborn: 0.12.2
- setuptools: 67.7.2
- six: 1.16.0
- snakeviz: 2.2.0
- sniffio: 1.3.0
- soupsieve: 2.4.1
- stable-trunc-gaussian: 1.3.9
- starlette: 0.27.0
- starsessions: 1.3.0
- strips-hgn: 1.0
- sympy: 1.12
- tarski: 0.8.2
- tensorboard: 2.16.2
- tensorboard-data-server: 0.7.1
- tensorboardx: 2.6.1
- threadpoolctl: 3.1.0
- tomli: 2.0.1
- torch: 2.0.1
- torch-optimizer: 0.3.0
- torch-scatter: 2.1.1
- torchmetrics: 0.11.4
- tornado: 6.3.3
- tqdm: 4.65.0
- traitlets: 5.9.0
- triton: 2.0.0
- typing-extensions: 4.6.3
- tzdata: 2023.3
- urllib3: 1.26.16
- uvicorn: 0.23.0
- wcwidth: 0.2.6
- websocket-client: 1.6.1
- websockets: 11.0.3
- werkzeug: 2.3.6
- wheel: 0.40.0
- yarl: 1.9.2
- z3: 0.2.0
- zipp: 3.15.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.16
- release: 5.4.0-174-generic
- version: fix import in Tensorboard example #193-Ubuntu SMP Thu Mar 7 14:29:28 UTC 2024
More info
No response
cc @carmocca @awaelchli
The text was updated successfully, but these errors were encountered: