[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype torch.bfloat16
, but got torch.float32
#5664
Labels
bug
Something isn't working
馃悰 Describe the bug
At the stage of booster initialization, some params have wrong dtype of torch.float32 while the precision is set "bf16", and the optimizer initialzation in booster cannot pass the sanity check of params dtype.
Here is the detailed error info:
The bug can be retrivaled as : self.plugin.configure -> HybridParallelZeroOptimizer -> LowLevelZeroOptimizer -> _sanity_checks
There may be some bugs in HybridParallelModule or MixtralModelPolicy.
My test shell:
Environment
CUDA 12.1
torch 2.1.0
Python 3.10.14
colossalai 0.3.6 (main)
colossal-moe 1.0.0
transformers 4.36.2
The text was updated successfully, but these errors were encountered: