Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCheckpoint could not find key in returned metrics #20046

Open
TheAeryan opened this issue Jul 4, 2024 · 0 comments
Open

ModelCheckpoint could not find key in returned metrics #20046

TheAeryan opened this issue Jul 4, 2024 · 0 comments
Labels
bug Something isn't working callback: model checkpoint help wanted Open to be worked on ver: 2.1.x

Comments

@TheAeryan
Copy link

TheAeryan commented Jul 4, 2024

Bug description

I have a model with several ModelCheckpoint callbacks. When loading it from a checkpoint using trainer.fit(model, datamodule=dm, ckpt_path=training_ckpt_path), I get the following error:

lightning_fabric.utilities.exceptions.MisconfigurationException: `ModelCheckpoint(monitor='v_nll_unsupervised')` could not find the monitored key in the returned metrics: 
['v_nll_supervised_encoder', 'v_nll_supervised_decoder', 'v_nll_supervised', 'v_nll', 'v_nll_supervised_encoder_clip', 'v_nll_supervised_decoder_clip', 'v_nll_supervised_c
lip', 'v_nll_clip', 'v_mse_supervised_encoder', 'v_mse_supervised_decoder', 'v_mse_encoder', 'v_mse_decoder', 'v_mse', 'v_mse_supervised_encoder_clip', 'v_mse_supervised_d
ecoder_clip', 'v_mse_encoder_clip', 'v_mse_decoder_clip', 'v_mse_clip', 'v_baseline_l_mse_supervised', 'v_baseline_l_mse', 'v_baseline_prior_mse_supervised', 'v_baseline_p
rior_mse', 'v_mu_supervised_encoder', 'v_mu_supervised_decoder', 'v_mu_encoder', 'v_mu_decoder', 'v_sigma_supervised_encoder', 'v_sigma_supervised_decoder', 'v_sigma_encod
er', 'v_sigma_decoder', 'hp_metric', 'epoch', 'step']. HINT: Did you call `log('v_nll_unsupervised', value)` in the `LightningModule`?

The issue seems to be that the v_nll_unsupervised metric was not logged with the log(...) method, so the ModelCheckpoint callback can't find it.
However, although I don't log this metric at every validation step, it is logged at least once every validation epoch. Since I use
on_step=False, on_epoch=True when logging metrics, I would expect that the whole validation epoch would end before the ModelCheckpoint callback tries to access this metric, in which case it would exist and no error would be raised.
Nonetheless, it seems this metric is being accessed just after the first validation iteration.

I thought that maybe this was due to the sanity checking process when training starts. However, setting num_sanity_val_steps=0 or num_sanity_val_steps=-1 in the Trainer did not solve anything.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

No response

Error messages and logs

lightning_fabric.utilities.exceptions.MisconfigurationException: `ModelCheckpoint(monitor='v_nll_unsupervised')` could not find the monitored key in the returned metrics: 
['v_nll_supervised_encoder', 'v_nll_supervised_decoder', 'v_nll_supervised', 'v_nll', 'v_nll_supervised_encoder_clip', 'v_nll_supervised_decoder_clip', 'v_nll_supervised_c
lip', 'v_nll_clip', 'v_mse_supervised_encoder', 'v_mse_supervised_decoder', 'v_mse_encoder', 'v_mse_decoder', 'v_mse', 'v_mse_supervised_encoder_clip', 'v_mse_supervised_d
ecoder_clip', 'v_mse_encoder_clip', 'v_mse_decoder_clip', 'v_mse_clip', 'v_baseline_l_mse_supervised', 'v_baseline_l_mse', 'v_baseline_prior_mse_supervised', 'v_baseline_p
rior_mse', 'v_mu_supervised_encoder', 'v_mu_supervised_decoder', 'v_mu_encoder', 'v_mu_decoder', 'v_sigma_supervised_encoder', 'v_sigma_supervised_decoder', 'v_sigma_encod
er', 'v_sigma_decoder', 'hp_metric', 'epoch', 'step']. HINT: Did you call `log('v_nll_unsupervised', value)` in the `LightningModule`?

Environment

Current environment
  • CUDA:
    - GPU:
    - Tesla V100-PCIE-16GB
    - Tesla V100-PCIE-16GB
    - available: True
    - version: 11.7
  • Lightning:
    - lightning-cloud: 0.5.37
    - lightning-utilities: 0.8.0
    - pytorch-lightning: 2.1.0
    - pytorch-ranger: 0.1.1
    - torch: 2.0.1
    - torch-optimizer: 0.3.0
    - torch-scatter: 2.1.1
    - torchmetrics: 0.11.4
  • Packages:
    - absl-py: 1.4.0
    - aiohttp: 3.8.4
    - aiosignal: 1.3.1
    - ansicolors: 1.1.8
    - antlr4-python3-runtime: 4.7.2
    - anyio: 3.7.1
    - arrow: 1.2.3
    - async-timeout: 4.0.2
    - attrs: 23.1.0
    - backoff: 2.2.1
    - beautifulsoup4: 4.12.2
    - blessed: 1.20.0
    - boto: 2.49.0
    - cachetools: 5.3.1
    - certifi: 2023.5.7
    - charset-normalizer: 3.1.0
    - click: 8.1.3
    - cmake: 3.26.4
    - contourpy: 1.1.0
    - croniter: 1.4.1
    - cycler: 0.11.0
    - dateutils: 0.6.12
    - deepdiff: 6.3.1
    - exceptiongroup: 1.1.2
    - fastapi: 0.100.0
    - filelock: 3.12.2
    - fonttools: 4.40.0
    - frozenlist: 1.3.3
    - fsspec: 2023.6.0
    - google-auth: 2.20.0
    - google-auth-oauthlib: 1.0.0
    - gprof2dot: 2022.7.29
    - graphviz: 0.20.1
    - grpcio: 1.51.3
    - h11: 0.14.0
    - idna: 3.4
    - importlib-metadata: 6.7.0
    - importlib-resources: 5.12.0
    - inquirer: 3.1.3
    - itsdangerous: 2.1.2
    - jinja2: 3.1.2
    - joblib: 1.2.0
    - jsonschema: 4.17.3
    - kiwisolver: 1.4.4
    - lifted-pddl: 1.2.2
    - lightning-cloud: 0.5.37
    - lightning-utilities: 0.8.0
    - lit: 16.0.6
    - markdown: 3.4.3
    - markdown-it-py: 3.0.0
    - markupsafe: 2.1.3
    - matplotlib: 3.7.1
    - mdurl: 0.1.2
    - mpmath: 1.3.0
    - msgpack: 1.0.5
    - multidict: 6.0.4
    - multipledispatch: 0.6.0
    - mypy: 1.3.0
    - mypy-extensions: 1.0.0
    - networkx: 3.1
    - numpy: 1.25.0
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-cupti-cu11: 11.7.101
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - nvidia-cufft-cu11: 10.9.0.58
    - nvidia-curand-cu11: 10.2.10.91
    - nvidia-cusolver-cu11: 11.4.0.1
    - nvidia-cusparse-cu11: 11.7.4.91
    - nvidia-nccl-cu11: 2.14.3
    - nvidia-nvtx-cu11: 11.7.91
    - oauthlib: 3.2.2
    - ordered-set: 4.1.0
    - packaging: 23.1
    - pandas: 2.0.2
    - pddl-generators: 1.0
    - pillow: 9.5.0
    - pip: 23.1.2
    - protobuf: 4.23.3
    - psutil: 5.9.5
    - pyarrow: 12.0.1
    - pyasn1: 0.5.0
    - pyasn1-modules: 0.3.0
    - pydantic: 1.10.11
    - pygments: 2.15.1
    - pyjwt: 2.7.0
    - pynvml: 11.5.0
    - pyparsing: 3.1.0
    - pyperplan: 2.1
    - pyrsistent: 0.19.3
    - python-dateutil: 2.8.2
    - python-editor: 1.0.4
    - python-multipart: 0.0.6
    - pytorch-lightning: 2.1.0
    - pytorch-ranger: 0.1.1
    - pytz: 2023.3
    - pyyaml: 6.0
    - ray: 2.5.0
    - readchar: 4.0.5
    - requests: 2.31.0
    - requests-oauthlib: 1.3.1
    - rich: 13.4.2
    - rsa: 4.9
    - scikit-learn: 1.2.2
    - scipy: 1.10.1
    - seaborn: 0.12.2
    - setuptools: 67.7.2
    - six: 1.16.0
    - snakeviz: 2.2.0
    - sniffio: 1.3.0
    - soupsieve: 2.4.1
    - stable-trunc-gaussian: 1.3.9
    - starlette: 0.27.0
    - starsessions: 1.3.0
    - strips-hgn: 1.0
    - sympy: 1.12
    - tarski: 0.8.2
    - tensorboard: 2.16.2
    - tensorboard-data-server: 0.7.1
    - tensorboardx: 2.6.1
    - threadpoolctl: 3.1.0
    - tomli: 2.0.1
    - torch: 2.0.1
    - torch-optimizer: 0.3.0
    - torch-scatter: 2.1.1
    - torchmetrics: 0.11.4
    - tornado: 6.3.3
    - tqdm: 4.65.0
    - traitlets: 5.9.0
    - triton: 2.0.0
    - typing-extensions: 4.6.3
    - tzdata: 2023.3
    - urllib3: 1.26.16
    - uvicorn: 0.23.0
    - wcwidth: 0.2.6
    - websocket-client: 1.6.1
    - websockets: 11.0.3
    - werkzeug: 2.3.6
    - wheel: 0.40.0
    - yarl: 1.9.2
    - z3: 0.2.0
    - zipp: 3.15.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.9.16
    - release: 5.4.0-174-generic
    - version: fix import in Tensorboard example #193-Ubuntu SMP Thu Mar 7 14:29:28 UTC 2024

More info

No response

cc @carmocca @awaelchli

@TheAeryan TheAeryan added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jul 4, 2024
@awaelchli awaelchli added help wanted Open to be worked on callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working callback: model checkpoint help wanted Open to be worked on ver: 2.1.x
Projects
None yet
Development

No branches or pull requests

2 participants