You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to finetune an LLM on v4-64 GCP TPU Pod. However, when I launch the training script, it runs 8 times on each host (worker) separately and the job is not distributed across all workers. According to torch XLA documentation, this process should be handled by the package automatically, but it does not work in my case. The training script saves a model.bin file as a result, and I am not sure whether that is the same model duplicated across workers or not. I have also tried using accelerate, but it fails to find xla device even.
I use the following command to run the script for reference: gcloud compute tpus tpu-vm ssh tpu-llama3-test --zone=us-central2-b --worker=all --command='export global PJRT_DEVICE=TPU XLA_USE_SPMD=1 XLA_USE_BF16=1 XLA_TENSOR_ALLOCATOR_MAX_SIZE=1000000; python training_script.py’
Package versions installed:
torch~=2.3.1
torch_xla[tpu]~=2.3.0
datasets=2.20.0
transformers=4.42.3
Edit: I have tried changing TPU Pod size and the number of simultaneous runs changed too - I switched to v4-32 and I have 4 simultaneous runs, given there are 4 hosts in total. I assume this is how it should be working (no of runs = no of hosts), am I correct? The issue with accelerate not recognizing XLA device still remains, unfortunately.
Also the training script finishes training the model, but throws an error at the end:
Traceback (most recent call last):
File "/home/user_name/train_llama.py", line 254, in <module>
main()
File "/home/user_name/train_llama.py", line 246, in main
trainer.train()
File "/home/user_name/.local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train
output = super().train(*args, **kwargs)
File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
return inner_training_loop(
File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2796, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2875, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3395, in save_model
self._save_tpu(output_dir)
File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3472, in _save_tpu
self.tokenizer.save_pretrained(output_dir)
File "/home/user_name/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2572, in save_pretrained
out_str = json.dumps(tokenizer_config, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type device is not JSON serializable
wandb: \ 0.044 MB of 0.044 MB uploaded
Thanks!
The text was updated successfully, but these errors were encountered:
ayukh
changed the title
Fail to distribute training on GCP TPU Pod
Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model
Jul 7, 2024
Update: I checked the worker names in W&B runs - they correspond to all the workers I have available. I am not sure if all the cores per worker are used. I checked the Trainer source and I can see that the distributed training is implemented. However, I don't see xmp.spawn() anywhere, but you give a separate example script to run Trainer on multiple TPU cores here. I tried using this script and I see now 32 runs spawned in logging (4 per each worker, equals to total no. of chips=32). I can also see that CPU utilization has risen to 70% using this, while without it it is only 5%.
Am I supposed to use the script with spawning for maximum compute utilization?
I am not sure why there are multiple runs spawned for every worker, is this common for TPU?
TypeError: Object of type device is not JSON serializable error is still present for saving the tokenizer model after save_steps epochs, which I assume is caused purely by TPU usage. With this error I can't continue training as the code crashes after first model saving. I don't see anywhere where device should be saved in TokenizerConfig, so is it possible to modify source code to handle that?
Closing the issue due to inactivity and resolved training distribution - will reopen a better documented issue about TypeError bug for model saving on TPU.
Hi,
I am trying to finetune an LLM on v4-64 GCP TPU Pod. However, when I launch the training script, it runs 8 times on each host (worker) separately and the job is not distributed across all workers. According to torch XLA documentation, this process should be handled by the package automatically, but it does not work in my case. The training script saves a
model.bin
file as a result, and I am not sure whether that is the same model duplicated across workers or not. I have also tried usingaccelerate
, but it fails to find xla device even.I use the following command to run the script for reference:
gcloud compute tpus tpu-vm ssh tpu-llama3-test --zone=us-central2-b --worker=all --command='export global PJRT_DEVICE=TPU XLA_USE_SPMD=1 XLA_USE_BF16=1 XLA_TENSOR_ALLOCATOR_MAX_SIZE=1000000; python training_script.py’
Package versions installed:
Edit: I have tried changing TPU Pod size and the number of simultaneous runs changed too - I switched to v4-32 and I have 4 simultaneous runs, given there are 4 hosts in total. I assume this is how it should be working (no of runs = no of hosts), am I correct? The issue with accelerate not recognizing XLA device still remains, unfortunately.
Also the training script finishes training the model, but throws an error at the end:
Thanks!
The text was updated successfully, but these errors were encountered: