Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ColossalChat train sft is skipped with opt-1.3b model #5865

Open
1 task done
smash1999 opened this issue Jun 27, 2024 · 5 comments
Open
1 task done

[BUG]: ColossalChat train sft is skipped with opt-1.3b model #5865

smash1999 opened this issue Jun 27, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@smash1999
Copy link

Is there an existing issue for this bug?

  • I have searched the existing issues

🐛 Describe the bug

I use ColossalChat to train opt-1.3b model, I modify train_sft.sh and run SFT training, it get successful result but the progress bar is abnormal that show skip evaluation.
My command and Log is as below:

colossalai run --nproc_per_node 2 train_sft.py \
    --pretrain $PRETRAINED_MODEL_PATH \
    --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \
    --save_interval 4000 \
    --dataset ${dataset[@]} \
    --save_path $SAVE_DIR \
    --config_file $CONFIG_FILE \
    --plugin zero2 \
    --batch_size 1 \
    --max_epochs 10 \
    --accumulation_steps 1 \
    --lr 2e-5 \
    --max_len 512 \
    --grad_checkpoint 
GPU Memory Usage:
     0	272 MiB
     1	11 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=1,0
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] 
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] *****************************************
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] *****************************************
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[06/27/24 16:55:31] INFO     colossalai - colossalai - INFO:                    
                             /home/test/anaconda3/envs/colo01/lib/python3.10/sit
                             e-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 2          
[06/27/24 16:55:31] INFO     colossalai - colossalai - INFO:                    
                             /home/test/anaconda3/envs/colo01/lib/python3.10/sit
                             e-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 2          
Gradient checkpointing enabled successfully
Configuration file will be saved at: output/-sft-2024-06-27-16-55-29.json
Model checkpoint will be saved at: output/
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.030973196029663086 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to compile fused_optim_cuda op: 0.040076494216918945 seconds
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[extension] Time taken to compile cpu_adam_x86 op: 0.1013331413269043 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to compile fused_optim_cuda op: 0.03631329536437988 seconds
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Max CUDA memory before data loader: 0.00 MB
Max CUDA memory after data loader: 0.00 MB
Warmup steps is set to 0
Booster init max CUDA memory: 5019.22 MB
Booster init max CPU memory: 8468.58 MB
Epochs:   0%|          | 0/10 No eval dataloader is provided, skip evaluation
Epoch 1/10: 0it [00:00, ?it/s]
                              No eval dataloader is provided, skip evaluation
Epoch 2/10: 0it [00:00, ?it/s]
                              No eval dataloader is provided, skip evaluation
Epoch 3/10: 0it [00:00, ?it/s]
Epoch 4/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 5/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 6/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 7/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 8/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 9/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 10/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epochs: 100%|██████████| 10/10 [00:00<00:00, 2525.17it/s]Start saving final model checkpoint

Saved final model checkpoint at epoch 10 at folder output/
Max CUDA memory usage: 5019.22 MB

====== Training on All Nodes =====
127.0.0.1: success

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

  1. CPU: Intel Platform with Z790 + i9-14900K
  2. GPU: NVIDIA RTX-4090 *2
  3. UBUNTU: 22.04
  4. Python: 3.10.14
  5. Colossal-AI: 0.3.6
  6. Pytorch: 2.3.0
  7. CUDA: 12.1
@smash1999 smash1999 added the bug Something isn't working label Jun 27, 2024
@TongLi3701
Copy link
Member

Could you have a look at your training data loader? Probably print out the length of your data loader and see if there are actually data inside?

Here, we will iterate through the train data loader:

for i, batch in enumerate(self.train_dataloader):

@smash1999
Copy link
Author

How could I get length of data loader?
I add code after for i, batch in enumerate(self.train_dataloader): and no log print.
Below is code I add.
coordinator.print_on_master(f"Length of DataLoader: {len(self.train_dataloader)}")

@TongLi3701
Copy link
Member

def _before_fit(
self,
train_dataloader: DataLoader,
eval_dataloader: Optional[DataLoader] = None,
log_dir: Optional[str] = None,
use_wandb: bool = False,
):
"""
Args:
train_dataloader: the dataloader to use for training
eval_dataloader: the dataloader to use for evaluation
log_dir: the directory to save logs
use_wandb: whether to use wandb for logging
"""
self.train_dataloader = train_dataloader
self.eval_dataloader = eval_dataloader

You can print inside this function using self.coordinator.print_on_master, then you should be able to see the length of the dataloader.

I have tested myself and it worked fine from my side.

@smash1999
Copy link
Author

How to use this script to output the length of the dataloader?
How do I run it so he can get the information I want?tks!
`
def _before_fit(
self,
train_dataloader: DataLoader,
eval_dataloader: Optional[DataLoader] = None,
log_dir: Optional[str] = None,
use_wandb: bool = False,
):
"""
Args:
train_dataloader: the dataloader to use for training
eval_dataloader: the dataloader to use for evaluation
log_dir: the directory to save logs
use_wandb: whether to use wandb for logging
"""
self.train_dataloader = train_dataloader
self.eval_dataloader = eval_dataloader

self.coordinator.print_on_master = length_data
print(length_data)`

@YeAnbang
Copy link
Contributor

Hi, you can check if the dataset is empty by inserting "print(len(dataset))" under the underlined line. Additionally, you can go to the jsonl files generated by the data preparation script to check if the dataset is tokenized correctly. Based on your script, I think it is possible that your max_length is too small, and the SFT data point will be ignored if the length of the first round after tokenization already exceeded the max_length.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants