[BUG] [Shardformer]: Error in blip2 testing with half precision #5600

insujang · 2024-04-15T20:59:12Z

🐛 Describe the bug

It seems blip2 testing doesn't work correctly at all if model is half precision (torch.float16).
With bfloat16, colossalai.shardformer.layer.FusedLayerNorm doesn't seem to work correctly.

https://github.com/hpcaitech/ColossalAI/blob/main/tests/test_shardformer/test_model/test_shard_blip2.py
This test file passes as it is.

But if I change dtype to torch.float16:

ColossalAI/tests/test_shardformer/test_model/test_shard_blip2.py

Line 92 in 89049b0

dtype=torch.float,

It fails:

E         File "test_shard_blip2.py", line 28, in check_forward_backward
E           assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E         File "colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E           assert_close(
E         File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E           raise error_metas[0].to_error(msg)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5947392 / 5947392 (100.0%)
E       Greatest absolute difference: nan at index (0, 0) (up to 1e-06 allowed)
E       Greatest relative difference: nan at index (0, 0) (up to 1e-05 allowed)

With dtype=torch.bfloat16 and without enable_fused_normalization it passes, but if I enable enable_fused_normalization, it fails again:

E         File "test_shard_blip2.py", line 28, in check_forward_backward
E           assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E         File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "/colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E           assert_close(
E         File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E           raise error_metas[0].to_error(msg)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 24271 / 2161696 (1.1%)
E       Greatest absolute difference: 0.0078125 at index (0, 3, 47) (up to 1e-05 allowed)
E       Greatest relative difference: 169.0 at index (0, 3, 47325) (up to 1e-05 allowed)

Environment

torch 2.2.1 / CUDA 12.1
colossalai 0.3.6
transformesr 4.36.0

The text was updated successfully, but these errors were encountered:

insujang · 2024-04-16T20:48:58Z

I am not sure if it is a bug or an unavoidable error due to lower precision and it was intended to test only on fp32. Would appreciate it if you could share some insights about it. Thanks.

insujang added the bug Something isn't working label Apr 15, 2024

Edenzzzz assigned Edenzzzz and ver217 and unassigned Edenzzzz Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Shardformer]: Error in blip2 testing with half precision #5600

[BUG] [Shardformer]: Error in blip2 testing with half precision #5600

insujang commented Apr 15, 2024

insujang commented Apr 16, 2024

[BUG] [Shardformer]: Error in blip2 testing with half precision #5600

[BUG] [Shardformer]: Error in blip2 testing with half precision #5600

Comments

insujang commented Apr 15, 2024

🐛 Describe the bug

Environment

insujang commented Apr 16, 2024