[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32` #5664

Camille7777 · 2024-04-26T14:21:01Z

🐛 Describe the bug

At the stage of booster initialization, some params have wrong dtype of torch.float32 while the precision is set "bf16", and the optimizer initialzation in booster cannot pass the sanity check of params dtype.

Here is the detailed error info:

The bug can be retrivaled as : self.plugin.configure -> HybridParallelZeroOptimizer -> LowLevelZeroOptimizer -> _sanity_checks

There may be some bugs in HybridParallelModule or MixtralModelPolicy.

My test shell:

NUM_GPU=2
MODEL="path to Mixtral-tiny model"
SEQ_LENGTH=2048
BATCH_SIZE=1
LR=0.00001

# hybrid
# torchrun --standalone --nproc_per_node $NUM_GPU \
colossalai run --nproc_per_node $NUM_GPU --hostfile "hostfile" \
    train.py \
    --num_epoch 1 \
    --model_name $MODEL \
    --plugin "hybrid" \
    --batch_size $BATCH_SIZE \
    --lr $LR \
    --zero_stage 1 \
    --pp_size 1 \
    --dp_size 1 \
    --ep_size 2 \
    --max_length $SEQ_LENGTH

Environment

CUDA 12.1
torch 2.1.0
Python 3.10.14
colossalai 0.3.6 (main)
colossal-moe 1.0.0
transformers 4.36.2

The text was updated successfully, but these errors were encountered:

Edenzzzz · 2024-04-27T06:14:50Z

Both @ver217 and I have seen this bug, which appears when pp is off. Will dig more into it

Camille7777 added the bug Something isn't working label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32` #5664

[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32` #5664

Camille7777 commented Apr 26, 2024

Edenzzzz commented Apr 27, 2024

[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype torch.bfloat16, but got torch.float32 #5664

[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype torch.bfloat16, but got torch.float32 #5664

Comments

Camille7777 commented Apr 26, 2024

🐛 Describe the bug

Environment

Edenzzzz commented Apr 27, 2024

[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32` #5664

[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32` #5664