[BUG]: ColossalChat train sft is skipped with opt-1.3b model #5865

smash1999 · 2024-06-27T09:44:19Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

I use ColossalChat to train opt-1.3b model, I modify train_sft.sh and run SFT training, it get successful result but the progress bar is abnormal that show skip evaluation.
My command and Log is as below:

colossalai run --nproc_per_node 2 train_sft.py \
    --pretrain $PRETRAINED_MODEL_PATH \
    --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \
    --save_interval 4000 \
    --dataset ${dataset[@]} \
    --save_path $SAVE_DIR \
    --config_file $CONFIG_FILE \
    --plugin zero2 \
    --batch_size 1 \
    --max_epochs 10 \
    --accumulation_steps 1 \
    --lr 2e-5 \
    --max_len 512 \
    --grad_checkpoint

GPU Memory Usage:
     0	272 MiB
     1	11 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=1,0
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] 
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] *****************************************
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] *****************************************
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[06/27/24 16:55:31] INFO     colossalai - colossalai - INFO:                    
                             /home/test/anaconda3/envs/colo01/lib/python3.10/sit
                             e-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 2          
[06/27/24 16:55:31] INFO     colossalai - colossalai - INFO:                    
                             /home/test/anaconda3/envs/colo01/lib/python3.10/sit
                             e-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 2          
Gradient checkpointing enabled successfully
Configuration file will be saved at: output/-sft-2024-06-27-16-55-29.json
Model checkpoint will be saved at: output/
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.030973196029663086 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to compile fused_optim_cuda op: 0.040076494216918945 seconds
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[extension] Time taken to compile cpu_adam_x86 op: 0.1013331413269043 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to compile fused_optim_cuda op: 0.03631329536437988 seconds
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Max CUDA memory before data loader: 0.00 MB
Max CUDA memory after data loader: 0.00 MB
Warmup steps is set to 0
Booster init max CUDA memory: 5019.22 MB
Booster init max CPU memory: 8468.58 MB
Epochs:   0%|          | 0/10 No eval dataloader is provided, skip evaluation
Epoch 1/10: 0it [00:00, ?it/s]
                              No eval dataloader is provided, skip evaluation
Epoch 2/10: 0it [00:00, ?it/s]
                              No eval dataloader is provided, skip evaluation
Epoch 3/10: 0it [00:00, ?it/s]
Epoch 4/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 5/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 6/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 7/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 8/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 9/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 10/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epochs: 100%|██████████| 10/10 [00:00<00:00, 2525.17it/s]Start saving final model checkpoint

Saved final model checkpoint at epoch 10 at folder output/
Max CUDA memory usage: 5019.22 MB

====== Training on All Nodes =====
127.0.0.1: success

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

CPU: Intel Platform with Z790 + i9-14900K
GPU: NVIDIA RTX-4090 *2
UBUNTU: 22.04
Python: 3.10.14
Colossal-AI: 0.3.6
Pytorch: 2.3.0
CUDA: 12.1

The text was updated successfully, but these errors were encountered:

TongLi3701 · 2024-06-27T10:15:03Z

Could you have a look at your training data loader? Probably print out the length of your data loader and see if there are actually data inside?

Here, we will iterate through the train data loader:

ColossalAI/applications/ColossalChat/coati/trainer/sft.py

Line 100 in b117274

for i, batch in enumerate(self.train_dataloader):

smash1999 · 2024-07-01T07:54:15Z

How could I get length of data loader?
I add code after for i, batch in enumerate(self.train_dataloader): and no log print.
Below is code I add.
coordinator.print_on_master(f"Length of DataLoader: {len(self.train_dataloader)}")

TongLi3701 · 2024-07-03T06:01:03Z

ColossalAI/applications/ColossalChat/coati/trainer/sft.py

Lines 60 to 76 in b117274

    
               def _before_fit( 
        
                   self, 
        
                   train_dataloader: DataLoader, 
        
                   eval_dataloader: Optional[DataLoader] = None, 
        
                   log_dir: Optional[str] = None, 
        
                   use_wandb: bool = False, 
        
               ): 
        
                   """ 
        
                   Args: 
        
                       train_dataloader: the dataloader to use for training 
        
                       eval_dataloader: the dataloader to use for evaluation 
        
                       log_dir: the directory to save logs 
        
                       use_wandb: whether to use wandb for logging 
        
                   """ 
        
                   self.train_dataloader = train_dataloader 
        
                   self.eval_dataloader = eval_dataloader

You can print inside this function using self.coordinator.print_on_master, then you should be able to see the length of the dataloader.

I have tested myself and it worked fine from my side.

smash1999 · 2024-07-05T01:15:43Z

How to use this script to output the length of the dataloader？
How do I run it so he can get the information I want？tks!
`
def _before_fit(
self,
train_dataloader: DataLoader,
eval_dataloader: Optional[DataLoader] = None,
log_dir: Optional[str] = None,
use_wandb: bool = False,
):
"""
Args:
train_dataloader: the dataloader to use for training
eval_dataloader: the dataloader to use for evaluation
log_dir: the directory to save logs
use_wandb: whether to use wandb for logging
"""
self.train_dataloader = train_dataloader
self.eval_dataloader = eval_dataloader

self.coordinator.print_on_master = length_data
print(length_data)`

YeAnbang · 2024-07-11T01:41:37Z

Hi, you can check if the dataset is empty by inserting "print(len(dataset))" under the underlined line. Additionally, you can go to the jsonl files generated by the data preparation script to check if the dataset is tokenized correctly. Based on your script, I think it is possible that your max_length is too small, and the SFT data point will be ignored if the length of the first round after tokenization already exceeded the max_length.

smash1999 added the bug Something isn't working label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: ColossalChat train sft is skipped with opt-1.3b model #5865

[BUG]: ColossalChat train sft is skipped with opt-1.3b model #5865

smash1999 commented Jun 27, 2024

TongLi3701 commented Jun 27, 2024

smash1999 commented Jul 1, 2024

TongLi3701 commented Jul 3, 2024

smash1999 commented Jul 5, 2024

YeAnbang commented Jul 11, 2024

[BUG]: ColossalChat train sft is skipped with opt-1.3b model #5865

[BUG]: ColossalChat train sft is skipped with opt-1.3b model #5865

Comments

smash1999 commented Jun 27, 2024

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

TongLi3701 commented Jun 27, 2024

smash1999 commented Jul 1, 2024

TongLi3701 commented Jul 3, 2024

smash1999 commented Jul 5, 2024

YeAnbang commented Jul 11, 2024