Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model #31821

ayukh · 2024-07-06T21:41:36Z

Hi,

I am trying to finetune an LLM on v4-64 GCP TPU Pod. However, when I launch the training script, it runs 8 times on each host (worker) separately and the job is not distributed across all workers. According to torch XLA documentation, this process should be handled by the package automatically, but it does not work in my case. The training script saves a model.bin file as a result, and I am not sure whether that is the same model duplicated across workers or not. I have also tried using accelerate, but it fails to find xla device even.
I use the following command to run the script for reference:
gcloud compute tpus tpu-vm ssh tpu-llama3-test --zone=us-central2-b --worker=all --command='export global PJRT_DEVICE=TPU XLA_USE_SPMD=1 XLA_USE_BF16=1 XLA_TENSOR_ALLOCATOR_MAX_SIZE=1000000; python training_script.py’

Package versions installed:

torch~=2.3.1
torch_xla[tpu]~=2.3.0
datasets=2.20.0
transformers=4.42.3

Edit: I have tried changing TPU Pod size and the number of simultaneous runs changed too - I switched to v4-32 and I have 4 simultaneous runs, given there are 4 hosts in total. I assume this is how it should be working (no of runs = no of hosts), am I correct? The issue with accelerate not recognizing XLA device still remains, unfortunately.
Also the training script finishes training the model, but throws an error at the end:

Traceback (most recent call last):
  File "/home/user_name/train_llama.py", line 254, in <module>
    main()
  File "/home/user_name/train_llama.py", line 246, in main
    trainer.train()
  File "/home/user_name/.local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train
    output = super().train(*args, **kwargs)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2345, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2796, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2875, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3395, in save_model
    self._save_tpu(output_dir)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3472, in _save_tpu
    self.tokenizer.save_pretrained(output_dir)
  File "/home/user_name/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2572, in save_pretrained
    out_str = json.dumps(tokenizer_config, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
  File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type device is not JSON serializable
wandb: \ 0.044 MB of 0.044 MB uploaded

Thanks!

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-07-06T22:42:43Z

cc @muellerzr @SunMarc

ayukh · 2024-07-10T12:16:22Z

Update: I checked the worker names in W&B runs - they correspond to all the workers I have available. I am not sure if all the cores per worker are used. I checked the Trainer source and I can see that the distributed training is implemented. However, I don't see xmp.spawn() anywhere, but you give a separate example script to run Trainer on multiple TPU cores here. I tried using this script and I see now 32 runs spawned in logging (4 per each worker, equals to total no. of chips=32). I can also see that CPU utilization has risen to 70% using this, while without it it is only 5%.

Am I supposed to use the script with spawning for maximum compute utilization?
I am not sure why there are multiple runs spawned for every worker, is this common for TPU?
TypeError: Object of type device is not JSON serializable error is still present for saving the tokenizer model after save_steps epochs, which I assume is caused purely by TPU usage. With this error I can't continue training as the code crashes after first model saving. I don't see anywhere where device should be saved in TokenizerConfig, so is it possible to modify source code to handle that?

ayukh · 2024-07-15T09:36:06Z

Closing the issue due to inactivity and resolved training distribution - will reopen a better documented issue about TypeError bug for model saving on TPU.

amyeroberts added the TPU label Jul 6, 2024

ayukh changed the title ~~Fail to distribute training on GCP TPU Pod~~ Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model Jul 7, 2024

ayukh closed this as completed Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model #31821

Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model #31821

ayukh commented Jul 6, 2024 •

edited

Loading

amyeroberts commented Jul 6, 2024

ayukh commented Jul 10, 2024 •

edited

Loading

ayukh commented Jul 15, 2024

Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model #31821

Fail to distribute training on GCP TPU Pod+Object of type device is not JSON serializable Error when saving the model #31821

Comments

ayukh commented Jul 6, 2024 • edited Loading

amyeroberts commented Jul 6, 2024

ayukh commented Jul 10, 2024 • edited Loading

ayukh commented Jul 15, 2024

ayukh commented Jul 6, 2024 •

edited

Loading

ayukh commented Jul 10, 2024 •

edited

Loading