Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model #31821

Closed
ayukh opened this issue Jul 6, 2024 · 3 comments
Labels

Comments

@ayukh
Copy link

ayukh commented Jul 6, 2024

Hi,

I am trying to finetune an LLM on v4-64 GCP TPU Pod. However, when I launch the training script, it runs 8 times on each host (worker) separately and the job is not distributed across all workers. According to torch XLA documentation, this process should be handled by the package automatically, but it does not work in my case. The training script saves a model.bin file as a result, and I am not sure whether that is the same model duplicated across workers or not. I have also tried using accelerate, but it fails to find xla device even.
I use the following command to run the script for reference:
gcloud compute tpus tpu-vm ssh tpu-llama3-test --zone=us-central2-b --worker=all --command='export global PJRT_DEVICE=TPU XLA_USE_SPMD=1 XLA_USE_BF16=1 XLA_TENSOR_ALLOCATOR_MAX_SIZE=1000000; python training_script.py’

Package versions installed:

  • torch~=2.3.1
  • torch_xla[tpu]~=2.3.0
  • datasets=2.20.0
  • transformers=4.42.3

Edit: I have tried changing TPU Pod size and the number of simultaneous runs changed too - I switched to v4-32 and I have 4 simultaneous runs, given there are 4 hosts in total. I assume this is how it should be working (no of runs = no of hosts), am I correct? The issue with accelerate not recognizing XLA device still remains, unfortunately.
Also the training script finishes training the model, but throws an error at the end:

Traceback (most recent call last):
  File "/home/user_name/train_llama.py", line 254, in <module>
    main()
  File "/home/user_name/train_llama.py", line 246, in main
    trainer.train()
  File "/home/user_name/.local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train
    output = super().train(*args, **kwargs)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2796, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2875, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3395, in save_model
    self._save_tpu(output_dir)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3472, in _save_tpu
    self.tokenizer.save_pretrained(output_dir)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2572, in save_pretrained
    out_str = json.dumps(tokenizer_config, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
  File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type device is not JSON serializable
wandb: \ 0.044 MB of 0.044 MB uploaded

Thanks!

@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

@amyeroberts amyeroberts added the TPU label Jul 6, 2024
@ayukh ayukh changed the title Fail to distribute training on GCP TPU Pod Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model Jul 7, 2024
@ayukh
Copy link
Author

ayukh commented Jul 10, 2024

Update: I checked the worker names in W&B runs - they correspond to all the workers I have available. I am not sure if all the cores per worker are used. I checked the Trainer source and I can see that the distributed training is implemented. However, I don't see xmp.spawn() anywhere, but you give a separate example script to run Trainer on multiple TPU cores here. I tried using this script and I see now 32 runs spawned in logging (4 per each worker, equals to total no. of chips=32). I can also see that CPU utilization has risen to 70% using this, while without it it is only 5%.

  1. Am I supposed to use the script with spawning for maximum compute utilization?
  2. I am not sure why there are multiple runs spawned for every worker, is this common for TPU?
  3. TypeError: Object of type device is not JSON serializable error is still present for saving the tokenizer model after save_steps epochs, which I assume is caused purely by TPU usage. With this error I can't continue training as the code crashes after first model saving. I don't see anywhere where device should be saved in TokenizerConfig, so is it possible to modify source code to handle that?

@ayukh
Copy link
Author

ayukh commented Jul 15, 2024

Closing the issue due to inactivity and resolved training distribution - will reopen a better documented issue about TypeError bug for model saving on TPU.

@ayukh ayukh closed this as completed Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants