Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChatGLM模型微调问题咨询 #19

Open
alexhmyang opened this issue Apr 14, 2023 · 33 comments
Open

ChatGLM模型微调问题咨询 #19

alexhmyang opened this issue Apr 14, 2023 · 33 comments
Labels
bug Something isn't working enhancement New feature or request wontfix This will not be worked on

Comments

@alexhmyang
Copy link

image

运行 /ChatGLM-6B/textgen/examples/chatglm$ python predict_demo.py 报错,glm6B模型用的是 原版, lora 微调模型 用的是 git clone https://huggingface.co/shibing624/chatglm-6b-csc-zh-lora

报错

(pt) ubuntu@youran-gpu21:~/ChatGLM-6B/textgen/examples/chatglm$ python predict_demo2.py
2023-04-14 11:47:33.176 | DEBUG | textgen.chatglm.chatglm_model:init:98 - Device: cuda
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.58s/it]
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
2023-04-14 11:48:08.995 | INFO | textgen.chatglm.chatglm_model:load_lora:342 - Loaded lora model from /home/ubuntu/ChatGLM-6B/textgen/chatglm-6b-csc-zh-lora
Traceback (most recent call last):
File "/home/ubuntu/ChatGLM-6B/textgen/examples/chatglm/predict_demo2.py", line 12, in
r = model.predict(["对下面中文拼写纠错:\n少先队员因该为老人让坐。\n答:"])
File "/home/ubuntu/anaconda3/envs/pt/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/pt/lib/python3.9/site-packages/textgen-0.1.9-py3.9.egg/textgen/chatglm/chatglm_model.py", line 385, in predict
self.model.eval()
File "/home/ubuntu/anaconda3/envs/pt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1930, in eval
return self.train(False)
File "/home/ubuntu/anaconda3/envs/pt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1911, in train
module.train(mode)
File "/home/ubuntu/anaconda3/envs/pt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1911, in train
module.train(mode)
File "/home/ubuntu/anaconda3/envs/pt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1911, in train
module.train(mode)
[Previous line repeated 4 more times]
File "/home/ubuntu/anaconda3/envs/pt/lib/python3.9/site-packages/peft-0.2.0-py3.9.egg/peft/tuners/lora.py", line 417, in train
delta_w = F.conv1d(
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [8192, 8, 1, 1], but got 3-dimensional input of size [1, 16, 4096] instead

@alexhmyang alexhmyang added the bug Something isn't working label Apr 14, 2023
@shibing624
Copy link
Owner

peft==0.3.0.dev0

pip install git+https://github.com/huggingface/peft

@alexhmyang
Copy link
Author

alexhmyang commented Apr 14, 2023

peft==0.3.0.dev0

pip install git+https://github.com/huggingface/peft

fatal: unable to access 'https://github.com/huggingface/peft/': Could not resolve host: github.com
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/huggingface/peft 'C:\Users\86187\AppData\Local\Temp\pip-req-build-n5rknw77' did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/huggingface/peft 'C:\Users\86187\AppData\Local\Temp\pip-req-build-n5rknw77' did not run successfully.
│ exit code: 128
╰─> See above for output.

能否直接给个 能用的命令

@shibing624
Copy link
Owner

pip install git+https://github.com/huggingface/peft

@playinlife
Copy link

playinlife commented Apr 19, 2023

安装0.3.0.dev0后运行chatglm-6b-belle-zh-lora示例报了如下错误:
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.transformer.layers.0.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.0.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.1.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.1.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.2.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.2.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.3.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.3.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.4.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.4.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.5.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.5.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.6.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.6.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.7.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.7.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.8.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.8.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.9.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.9.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.10.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.10.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.11.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.11.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.12.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.12.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.13.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.13.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.14.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.14.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.15.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.15.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.16.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.16.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.17.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.17.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.18.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.18.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.19.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.19.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.20.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.20.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.21.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.21.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.22.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.22.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.23.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.23.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.24.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.24.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.25.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.25.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.26.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.26.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).
size mismatch for base_model.model.transformer.layers.27.attention.query_key_value.lora_A.default.weight: copying a param with shape torch.Size([16, 4096]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.transformer.layers.27.attention.query_key_value.lora_B.default.weight: copying a param with shape torch.Size([8192, 8, 1]) from checkpoint, the shape in current model is torch.Size([12288, 8]).

@shibing624
Copy link
Owner

chatglm-6b-belle-zh-lora 权重我没更新,可以自己训练,或者用shibing624/chatglm-6b-csc-zh-lora

@bash99
Copy link

bash99 commented Apr 19, 2023

chatglm-6b-belle-zh-lora 权重我没更新,可以自己训练,或者用shibing624/chatglm-6b-csc-zh-lora

最新的代码跑train也有报错

 /DaTa/.local/home/hai.li/mambaforge/lib/python3.10/site-packages/torch/_dynamo/variables/builder │
│ .py:812 in wrap_fx_proxy_cls                                                                     │
│                                                                                                  │
│   809 │   │   │   │   "ignore_subclass": ignore_subclass,                                        │
│   810 │   │   │   │   "is_tensor": target_cls is TensorVariable,                                 │
│   811 │   │   │   }                                                                              │
│ ❱ 812 │   │   │   assert "source" in options and options["source"] is not None                   │
│   813 │   │   │   kwargs["source"] = options["source"]                                           │
│   814 │   │   │   example_value = wrap_to_fake_tensor_and_record(                                │
│   815 │   │   │   │   example_value, tx=tx, **kwargs                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError:

from user code:
   File "/DaTa/.local/home/hai.li/mambaforge/lib/python3.10/site-packages/torch/random.py", line 23, in get_rng_state
    return default_generator.get_state()

Set torch._dynamo.config.verbose=True for more information


You can suppress this exception and fall back to eager by setting:
    torch._dynamo.config.suppress_errors = True

完整的log如下

$ python training_chatglm_demo.py --do_train
2023-04-19 19:41:46.611 | INFO     | __main__:main:43 - Namespace(train_file='../data/zh_csc_train.tsv', test_file='../data/zh_csc_test.tsv', model_type='chatglm', model_name='THUDM/chatglm-6b', do_train=True, do_predict=False, output_dir='./outputs/', max_seq_length=128, max_length=128, num_epochs=0.2, batch_size=2)
2023-04-19 19:41:46.611 | INFO     | __main__:main:47 - Loading data...
2023-04-19 19:41:46.612 | DEBUG    | textgen.chatglm.chatglm_model:__init__:91 - Device: cuda
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00,  1.29s/it]
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
2023-04-19 19:42:09.099 | DEBUG    | __main__:main:62 - train_data: [['对下面中文拼写纠错:', '对台湾的大学制度和社会血管而言,学生要工作的话很难,要辨读大学边工作的话,这会逼迫学生工作和学习上分心,让学生陷于力不从心精神分散的恶境。', '对台湾的大学制度和社会血管而言,学生要工作的话很难,要边读大学边工作的话,这会逼迫学生工作和学习上分心,让学生陷于力不从心精神分散的恶境。'], ['对下面中文拼写纠错:', '而大众对于其好坏方面的比例又不同的判断,所以对其的态度也完全不一至。', '而大众对于其好坏方面的比例有不同的判断,所以对其的态度也完全不一致。'], ['对下面中文拼写纠错:', '怎么办!我的房子里大学很远!时间不够了!', '怎么办!我的房子离大学很远!时间不够了!'], ['对下面中文拼写纠错:', '所以老师们应该这导最好交孩子的方法就是让他们玩儿而发展。', '所以老师们应该知道最好教孩子的方法就是让他们玩儿而发展。'], ['对下面中文拼写纠错:', '搭进第二十一世纪,顺著社会、科学的进步,网路科学也不断地发展同时电脑领域也速步地更新。', '踏进第二十一世纪,顺著社会、科学的进步,网路科学也不断地发展同时电脑领域也速步地更新。'], ['对下面中文拼写纠错:', '因为现在,我们再得这一时代,就是不能相信别人家的很冷淡的时代嘛!', '因为现在,我们在的这一时代,就是不能相信别人家的很冷淡的时代嘛!'], ['对下面中文拼写纠错:', '好可惜我下个礼拜要回国,我已经买过飞机票所以没办法那天跟你们一起庆祝你们的寰麟。', '好可惜我下个礼拜要回国,我已经买过飞机票所以没办法那天跟你们一起庆祝你们的婚礼。'], ['对下面中文拼写纠错:', '请你先不要放弃!你可以利用在家理的时间想一想你未来最想要做的是什么?', '请你先不要放弃!你可以利用在家里的时间想一想你未来最想要做的是什么?'], ['对下面中文拼写纠错:', '「宠物出租」我看在都市区会受欢迎。老实说我想各各人应该要考虑之后养动物才对。', '「宠物出租」我看在都市区会受欢迎。老实说我想各个人应该要考虑之后养动物才对。'], ['对下面中文拼写纠错:', '到了学校一后,他跟他的同学一起上数学科。', '到了学校以后,他跟他的同学一起上数学课。']]
2023-04-19 19:42:30.857 | WARNING  | textgen.chatglm.chatglm_model:train_model:241 - Checkpoint ./outputs/adapter_model.bin not found
trainable params: 3670016 || all params: 6176956416 || trainable%: 0.05941463324063059
2023-04-19 19:42:30.860 | INFO     | textgen.chatglm.chatglm_utils:__init__:93 -  Creating features from dataset file at cache_dir/
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2338/2338 [00:01<00:00, 1777.50it/s]
2023-04-19 19:42:32.181 | INFO     | textgen.chatglm.chatglm_utils:__init__:121 -  Saving features into cached file cache_dir/THUDM_chatglm-6b_cached_1282338
2023-04-19 19:42:32.184 | DEBUG    | textgen.chatglm.chatglm_model:train_model:251 - train_dataset len: 2338, train_dataset[0]: [5, 64286, 12, 63836, 65845, 68088, 66642, 64339, 89435, 12, 4, 63836, 91601, 64236, 64802, 72925, 66817, 65049, 6, 64050, 63858, 63889, 64112, 65539, 6, 63858, 71808, 105293, 64436, 63889, 64112, 6, 86045, 83875, 64050, 123629, 63839, 109352, 6, 70230, 109951, 107027, 64428, 69353, 63825, 65561, 66612, 63823, 4, 67342, 12, 130001, 130004, 5, 63836, 91601, 64236, 64802, 72925, 66817, 65049, 6, 64050, 63858, 63889, 64112, 65539, 6, 63858, 64436, 105293, 64436, 63889, 64112, 6, 86045, 83875, 64050, 123629, 63839, 109352, 6, 70230, 109951, 107027, 64428, 69353, 63825, 65561, 66612, 63823, 130005]
2023-04-19 19:42:32.185 | WARNING  | textgen.chatglm.chatglm_model:train_model:284 - Process rank: -1, device: cuda:0, n_gpu: 4, distributed training: False, 16-bits training: True
2023-04-19 19:42:32.186 | INFO     | textgen.chatglm.chatglm_model:train_model:288 - Training/evaluation parameters TrainingArguments(
_n_gpu=4,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0002,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./outputs//logs,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=50,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=0.2,
optim=adamw_torch,
optim_args=None,
output_dir=./outputs/,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=2,
per_device_train_batch_size=2,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./outputs/,
save_on_each_node=False,
save_safetensors=False,
save_steps=400,
save_strategy=steps,
save_total_limit=3,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
2023-04-19 19:42:32.191 | INFO     | textgen.chatglm.chatglm_model:train_model:302 - *** Train ***
  0%|                                                                                                                   | 0/234 [00:00<?, ?it/s]/home/myuser/mambaforge/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/myuser/dl/textgen_lora_train/textgen/examples/chatglm/training_chatglm_demo.py │
│ :100 in <module>                                                                                 │
│                                                                                                  │
│    97                                                                                            │
│    98                                                                                            │
│    99 if __name__ == '__main__':                                                                 │
│ ❱ 100 │   main()                                                                                 │
│   101                                                                                            │
│                                                                                                  │
│ /home/myuser/dl/textgen_lora_train/textgen/examples/chatglm/training_chatglm_demo.py │
│ :64 in main                                                                                      │
│                                                                                                  │
│    61 │   │   train_data = load_data(args.train_file)                                            │
│    62 │   │   logger.debug('train_data: {}'.format(train_data[:10]))                             │
│    63 │   │   train_df = pd.DataFrame(train_data, columns=["instruction", "input", "output"])    │
│ ❱  64 │   │   model.train_model(train_df)                                                        │
│    65 │   if args.do_predict:                                                                    │
│    66 │   │   if model is None:                                                                  │
│    67 │   │   │   model = ChatGlmModel(                                                          │
│                                                                                                  │
│ /home/myuser/dl/textgen_lora_train/textgen/examples/chatglm/../../textgen/chatglm/ch │
│ atglm_model.py:303 in train_model                                                                │
│                                                                                                  │
│   300 │   │   │   self.model = torch.compile(self.model)                                         │
│   301 │   │                                                                                      │
│   302 │   │   logger.info("*** Train ***")                                                       │
│ ❱ 303 │   │   (global_step, training_loss, metrics) = trainer.train(resume_from_checkpoint=res   │
│   304 │   │   self.handle_metrics("train", metrics, self.args.output_dir)                        │
│   305 │   │   self.results.update(metrics)                                                       │
│   306 │   │   self.save_model(model=self.model)                                                  │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/transformers/trainer.py:1662 in │
│ train                                                                                            │
│                                                                                                  │
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1661 │   │   )                                                                                 │
│ ❱ 1662 │   │   return inner_training_loop(                                                       │
│   1663 │   │   │   args=args,                                                                    │
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1665 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/transformers/trainer.py:1929 in │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1926 │   │   │   │   │   with model.no_sync():                                                 │
│   1927 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1928 │   │   │   │   else:                                                                     │
│ ❱ 1929 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1930 │   │   │   │                                                                             │
│   1931 │   │   │   │   if (                                                                      │
│   1932 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/transformers/trainer.py:2699 in │
│ training_step                                                                                    │
│                                                                                                  │
│   2696 │   │   │   return loss_mb.reduce_mean().detach().to(self.args.device)                    │
│   2697 │   │                                                                                     │
│   2698 │   │   with self.compute_loss_context_manager():                                         │
│ ❱ 2699 │   │   │   loss = self.compute_loss(model, inputs)                                       │
│   2700 │   │                                                                                     │
│   2701 │   │   if self.args.n_gpu > 1:                                                           │
│   2702 │   │   │   loss = loss.mean()  # mean() to average on multi-gpu parallel training        │
│                                                                                                  │
│ /home/myuser/dl/textgen_lora_train/textgen/examples/chatglm/../../textgen/chatglm/ch │
│ atglm_model.py:506 in compute_loss                                                               │
│                                                                                                  │
│   503                                                                                            │
│   504 class FinetuneTrainer(Trainer):                                                            │
│   505 │   def compute_loss(self, model, inputs, return_outputs=False):                           │
│ ❱ 506 │   │   return model(                                                                      │
│   507 │   │   │   input_ids=inputs["input_ids"],                                                 │
│   508 │   │   │   labels=inputs["labels"],                                                       │
│   509 │   │   ).loss                                                                             │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 │
│ in _call_impl                                                                                    │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/peft/peft_model.py:663 in       │
│ forward                                                                                          │
│                                                                                                  │
│    660 │   ):                                                                                    │
│    661 │   │   peft_config = self.active_peft_config                                             │
│    662 │   │   if not isinstance(peft_config, PromptLearningConfig):                             │
│ ❱  663 │   │   │   return self.base_model(                                                       │
│    664 │   │   │   │   input_ids=input_ids,                                                      │
│    665 │   │   │   │   attention_mask=attention_mask,                                            │
│    666 │   │   │   │   inputs_embeds=inputs_embeds,                                              │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 │
│ in _call_impl                                                                                    │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:82  │
│ in forward                                                                                       │
│                                                                                                  │
│    79 │   │   return getattr(self._orig_mod, name)                                               │
│    80 │                                                                                          │
│    81 │   def forward(self, *args, **kwargs):                                                    │
│ ❱  82 │   │   return self.dynamo_ctx(self._orig_mod.forward)(*args, **kwargs)                    │
│    83                                                                                            │
│    84                                                                                            │
│    85 def remove_from_cache(f):                                                                  │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:209 │
│ in _fn                                                                                           │
│                                                                                                  │
│   206 │   │   │   dynamic_ctx = enable_dynamic(self.dynamic)                                     │
│   207 │   │   │   dynamic_ctx.__enter__()                                                        │
│   208 │   │   │   try:                                                                           │
│ ❱ 209 │   │   │   │   return fn(*args, **kwargs)                                                 │
│   210 │   │   │   finally:                                                                       │
│   211 │   │   │   │   set_eval_frame(prior)                                                      │
│   212 │   │   │   │   dynamic_ctx.__exit__(None, None, None)                                     │
│                                                                                                  │
│ /home/myuser/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/35ca52 │
│ 301fbedee885b0838da5d15b7b47faa37c/modeling_chatglm.py:1190 in forward                           │
│                                                                                                  │
│   1187 │   │   use_cache = use_cache if use_cache is not None else self.config.use_cache         │
│   1188 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return  │
│   1189 │   │                                                                                     │
│ ❱ 1190 │   │   transformer_outputs = self.transformer(                                           │
│   1191 │   │   │   input_ids=input_ids,                                                          │
│   1192 │   │   │   position_ids=position_ids,                                                    │
│   1193 │   │   │   attention_mask=attention_mask,                                                │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 │
│ in _call_impl                                                                                    │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /home/myuser/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/35ca52 │
│ 301fbedee885b0838da5d15b7b47faa37c/modeling_chatglm.py:936 in forward                            │
│                                                                                                  │
│    933 │   │   │   │   past_key_values = tuple([None] * len(self.layers))                        │
│    934 │   │   │                                                                                 │
│    935 │   │   │   if attention_mask is None:                                                    │
│ ❱  936 │   │   │   │   attention_mask = self.get_masks(                                          │
│    937 │   │   │   │   │   input_ids,                                                            │
│    938 │   │   │   │   │   device=input_ids.device                                               │
│    939 │   │   │   │   )                                                                         │
│                                                                                                  │
│ /home/myuser/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/35ca52 │
│ 301fbedee885b0838da5d15b7b47faa37c/modeling_chatglm.py:944 in <graph break in forward>           │
│                                                                                                  │
│    941 │   │   │                                                                                 │
│    942 │   │   │   if position_ids is None:                                                      │
│    943 │   │   │   │   MASK, gMASK = self.config.mask_token_id, self.config.gmask_token_id       │
│ ❱  944 │   │   │   │   seqs = input_ids.tolist()                                                 │
│    945 │   │   │   │                                                                             │
│    946 │   │   │   │   mask_positions, use_gmasks = [], []                                       │
│    947 │   │   │   │   for seq in seqs:                                                          │
│                                                                                                  │
│ /home/myuser/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/35ca52 │
│ 301fbedee885b0838da5d15b7b47faa37c/modeling_chatglm.py:985 in <graph break in forward>           │
│                                                                                                  │
│    982 │   │   │   layer_past = past_key_values[i]                                               │
│    983 │   │   │                                                                                 │
│    984 │   │   │   if self.gradient_checkpointing and self.training:                             │
│ ❱  985 │   │   │   │   layer_ret = torch.utils.checkpoint.checkpoint(                            │
│    986 │   │   │   │   │   layer,                                                                │
│    987 │   │   │   │   │   hidden_states,                                                        │
│    988 │   │   │   │   │   position_ids,                                                         │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/utils/checkpoint.py:249   │
│ in checkpoint                                                                                    │
│                                                                                                  │
│   246 │   │   raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwar   │
│   247 │                                                                                          │
│   248 │   if use_reentrant:                                                                      │
│ ❱ 249 │   │   return CheckpointFunction.apply(function, preserve, *args)                         │
│   250 │   else:                                                                                  │
│   251 │   │   return _checkpoint_without_reentrant(                                              │
│   252 │   │   │   function,                                                                      │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/autograd/function.py:506  │
│ in apply                                                                                         │
│                                                                                                  │
│   503 │   │   if not torch._C._are_functorch_transforms_active():                                │
│   504 │   │   │   # See NOTE: [functorch vjp and autograd interaction]                           │
│   505 │   │   │   args = _functorch.utils.unwrap_dead_wrappers(args)                             │
│ ❱ 506 │   │   │   return super().apply(*args, **kwargs)  # type: ignore[misc]                    │
│   507 │   │                                                                                      │
│   508 │   │   if cls.setup_context == _SingleLevelFunction.setup_context:                        │
│   509 │   │   │   raise RuntimeError(                                                            │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/utils/checkpoint.py:81 in │
│ forward                                                                                          │
│                                                                                                  │
│    78 │   │   # Accommodates the (remote) possibility that autocast is enabled for cpu AND gpu   │
│    79 │   │   ctx.gpu_autocast_kwargs, ctx.cpu_autocast_kwargs = _get_autocast_kwargs()          │
│    80 │   │   if preserve_rng_state:                                                             │
│ ❱  81 │   │   │   ctx.fwd_cpu_state = torch.get_rng_state()                                      │
│    82 │   │   │   # Don't eagerly initialize the cuda context by accident.                       │
│    83 │   │   │   # (If the user intends that the context is initialized later, within their     │
│    84 │   │   │   # run_function, we SHOULD actually stash the cuda state here.  Unfortunately   │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:337 │
│ in catch_errors                                                                                  │
│                                                                                                  │
│   334 │   │   │   │   │   return hijacked_callback(frame, cache_size, hooks)                     │
│   335 │   │                                                                                      │
│   336 │   │   with compile_lock:                                                                 │
│ ❱ 337 │   │   │   return callback(frame, cache_size, hooks)                                      │
│   338 │                                                                                          │
│   339 │   catch_errors._torchdynamo_orig_callable = callback  # type: ignore[attr-defined]       │
│   340 │   return catch_errors                                                                    │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py: │
│ 404 in _convert_frame                                                                            │
│                                                                                                  │
│   401 │   def _convert_frame(frame: types.FrameType, cache_size: int, hooks: Hooks):             │
│   402 │   │   counters["frames"]["total"] += 1                                                   │
│   403 │   │   try:                                                                               │
│ ❱ 404 │   │   │   result = inner_convert(frame, cache_size, hooks)                               │
│   405 │   │   │   counters["frames"]["ok"] += 1                                                  │
│   406 │   │   │   return result                                                                  │
│   407 │   │   except (NotImplementedError, Unsupported):                                         │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py: │
│ 104 in _fn                                                                                       │
│                                                                                                  │
│   101 │   │   prior_fwd_from_src = torch.fx.graph_module._forward_from_src                       │
│   102 │   │   torch.fx.graph_module._forward_from_src = fx_forward_from_src_skip_result          │
│   103 │   │   try:                                                                               │
│ ❱ 104 │   │   │   return fn(*args, **kwargs)                                                     │
│   105 │   │   finally:                                                                           │
│   106 │   │   │   torch._C._set_grad_enabled(prior_grad_mode)                                    │
│   107 │   │   │   torch.random.set_rng_state(rng_state)                                          │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py: │
│ 262 in _convert_frame_assert                                                                     │
│                                                                                                  │
│   259 │   │   global initial_grad_state                                                          │
│   260 │   │   initial_grad_state = torch.is_grad_enabled()                                       │
│   261 │   │                                                                                      │
│ ❱ 262 │   │   return _compile(                                                                   │
│   263 │   │   │   frame.f_code,                                                                  │
│   264 │   │   │   frame.f_globals,                                                               │
│   265 │   │   │   frame.f_locals,                                                                │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/utils.py:163 in   │
│ time_wrapper                                                                                     │
│                                                                                                  │
│    160 │   │   │   if key not in compilation_metrics:                                            │
│    161 │   │   │   │   compilation_metrics[key] = []                                             │
│    162 │   │   │   t0 = time.time()                                                              │
│ ❱  163 │   │   │   r = func(*args, **kwargs)                                                     │
│    164 │   │   │   time_spent = time.time() - t0                                                 │
│    165 │   │   │   # print(f"Dynamo timer: key={key}, latency={latency:.2f} sec")                │
│    166 │   │   │   compilation_metrics[key].append(time_spent)                                   │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py: │
│ 324 in _compile                                                                                  │
│                                                                                                  │
│   321 │   try:                                                                                   │
│   322 │   │   for attempt in itertools.count():                                                  │
│   323 │   │   │   try:                                                                           │
│ ❱ 324 │   │   │   │   out_code = transform_code_object(code, transform)                          │
│   325 │   │   │   │   orig_code_map[out_code] = code                                             │
│   326 │   │   │   │   break                                                                      │
│   327 │   │   │   except exc.RestartAnalysis:                                                    │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/bytecode_transfor │
│ mation.py:445 in transform_code_object                                                           │
│                                                                                                  │
│   442 │   instructions = cleaned_instructions(code, safe)                                        │
│   443 │   propagate_line_nums(instructions)                                                      │
│   444 │                                                                                          │
│ ❱ 445 │   transformations(instructions, code_options)                                            │
│   446 │   return clean_and_assemble_instructions(instructions, keys, code_options)[1]            │
│   447                                                                                            │
│   448                                                                                            │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py: │
│ 311 in transform                                                                                 │
│                                                                                                  │
│   308 │   │   │   export,                                                                        │
│   309 │   │   │   mutated_closure_cell_contents,                                                 │
│   310 │   │   )                                                                                  │
│ ❱ 311 │   │   tracer.run()                                                                       │
│   312 │   │   output = tracer.output                                                             │
│   313 │   │   assert output is not None                                                          │
│   314 │   │   assert output.output_instructions                                                  │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert. │
│ py:1726 in run                                                                                   │
│                                                                                                  │
│   1723 │                                                                                         │
│   1724 │   def run(self):                                                                        │
│   1725 │   │   _step_logger()(logging.INFO, f"torchdynamo start tracing {self.f_code.co_name}")  │
│ ❱ 1726 │   │   super().run()                                                                     │
│   1727 │                                                                                         │
│   1728 │   def match_nested_cell(self, name, cell):                                              │
│   1729 │   │   """Match a cell in this method to one in a function we are inlining"""            │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert. │
│ py:576 in run                                                                                    │
│                                                                                                  │
│    573 │   │   │   while (                                                                       │
│    574 │   │   │   │   self.instruction_pointer is not None                                      │
│    575 │   │   │   │   and not self.output.should_exit                                           │
│ ❱  576 │   │   │   │   and self.step()                                                           │
│    577 │   │   │   ):                                                                            │
│    578 │   │   │   │   pass                                                                      │
│    579 │   │   except BackendCompilerFailed:                                                     │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert. │
│ py:540 in step                                                                                   │
│                                                                                                  │
│    537 │   │   try:                                                                              │
│    538 │   │   │   if not hasattr(self, inst.opname):                                            │
│    539 │   │   │   │   unimplemented(f"missing: {inst.opname}")                                  │
│ ❱  540 │   │   │   getattr(self, inst.opname)(inst)                                              │
│    541 │   │   │                                                                                 │
│    542 │   │   │   return inst.opname != "RETURN_VALUE"                                          │
│    543 │   │   except BackendCompilerFailed:                                                     │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert. │
│ py:342 in wrapper                                                                                │
│                                                                                                  │
│    339 │   │   │   state = self.copy_graphstate()                                                │
│    340 │   │   │   reason = None                                                                 │
│    341 │   │   │   try:                                                                          │
│ ❱  342 │   │   │   │   return inner_fn(self, inst)                                               │
│    343 │   │   │   except Unsupported as excp:                                                   │
│    344 │   │   │   │   if self.has_backedge() and self.should_compile_partial_graph():           │
│    345 │   │   │   │   │   msg = "Skipping frame because there is a graph break in a for/while   │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert. │
│ py:965 in CALL_FUNCTION                                                                          │
│                                                                                                  │
│    962 │   def CALL_FUNCTION(self, inst):                                                        │
│    963 │   │   args = self.popn(inst.argval)                                                     │
│    964 │   │   fn = self.pop()                                                                   │
│ ❱  965 │   │   self.call_function(fn, args, {})                                                  │
│    966 │                                                                                         │
│    967 │   @break_graph_if_unsupported(push=1)                                                   │
│    968 │   def CALL_FUNCTION_EX(self, inst):                                                     │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert. │
│ py:474 in call_function                                                                          │
│                                                                                                  │
│    471 │   │   │   isinstance(x, VariableTracker)                                                │
│    472 │   │   │   for x in itertools.chain(args, kwargs.values())                               │
│    473 │   │   )                                                                                 │
│ ❱  474 │   │   self.push(fn.call_function(self, args, kwargs))                                   │
│    475 │                                                                                         │
│    476 │   def update_locals_and_stack(self, oldvar: VariableTracker, newvar: VariableTracker):  │
│    477 │   │   def repl(v: VariableTracker):                                                     │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/variables/torch.p │
│ y:368 in call_function                                                                           │
│                                                                                                  │
│   365 │   │   │   def get_state_from_generator():                                                │
│   366 │   │   │   │   return self.value()                                                        │
│   367 │   │   │                                                                                  │
│ ❱ 368 │   │   │   return wrap_fx_proxy(                                                          │
│   369 │   │   │   │   tx=tx,                                                                     │
│   370 │   │   │   │   proxy=tx.output.create_proxy(                                              │
│   371 │   │   │   │   │   "call_function",                                                       │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/variables/builder │
│ .py:754 in wrap_fx_proxy                                                                         │
│                                                                                                  │
│   751                                                                                            │
│   752                                                                                            │
│   753 def wrap_fx_proxy(tx, proxy, example_value=None, **options):                               │
│ ❱ 754 │   return wrap_fx_proxy_cls(                                                              │
│   755 │   │   target_cls=TensorVariable,                                                         │
│   756 │   │   tx=tx,                                                                             │
│   757 │   │   proxy=proxy,                                                                       │
│                                                                                                  │
│ /home/myuser/mambaforge/lib/python3.10/site-packages/torch/_dynamo/variables/builder │
│ .py:812 in wrap_fx_proxy_cls                                                                     │
│                                                                                                  │
│   809 │   │   │   │   "ignore_subclass": ignore_subclass,                                        │
│   810 │   │   │   │   "is_tensor": target_cls is TensorVariable,                                 │
│   811 │   │   │   }                                                                              │
│ ❱ 812 │   │   │   assert "source" in options and options["source"] is not None                   │
│   813 │   │   │   kwargs["source"] = options["source"]                                           │
│   814 │   │   │   example_value = wrap_to_fake_tensor_and_record(                                │
│   815 │   │   │   │   example_value, tx=tx, **kwargs                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError:

from user code:
   File "/home/myuser/mambaforge/lib/python3.10/site-packages/torch/random.py", line 23, in get_rng_state
    return default_generator.get_state()

Set torch._dynamo.config.verbose=True for more information


You can suppress this exception and fall back to eager by setting:
    torch._dynamo.config.suppress_errors = True

  0%|                                                                                                                   | 0/234 [00:06<?, ?it/s]

@shibing624
Copy link
Owner

更新chatglm-6b的文件。

@bash99
Copy link

bash99 commented Apr 20, 2023

更新chatglm-6b的文件。

你的torch版本是2.0还是1.13.1? 我用的torch 2.0 之前的错误应该和 pytorch/pytorch#97077 这个有关,但是勉强绕过去之后又碰到了新的错误

@shibing624
Copy link
Owner

更新代码

@bash99
Copy link

bash99 commented Apr 21, 2023

更新代码

总算可以了,不过继续训练时仍然有报错,看上去是peft加载训练后的模型时报错,是现在的模式无法继续训练吗?

textgen/examples/chatglm$ python training_chatglm_demo.py --do_train

2023-04-21 11:15:08.224 | INFO     | textgen.chatglm.chatglm_model:train_model:235 - Restarting from ./outputs/adapter_model.bin
...
│ /DaTa/dl/textgen_lora_train/textgen/examples/chatglm/../../textgen/chatglm/chatglm_model.py:241 in train_model                                                                
				if os.path.exists(checkpoint_name):
					logger.info(f"Restarting from {checkpoint_name}")
					adapters_weights = torch.load(checkpoint_name)
					self.model = set_peft_model_state_dict(self.model, adapters_weights)
│   238 │   │   │   │   else:                                                                      │
│   239 │   │   │   │   │   logger.warning(f"Checkpoint {checkpoint_name} not found")              │
│   240 │   │   │                                                                                  │
│ ❱ 241 │   │   │   self.model.print_trainable_parameters()  # Be more transparent about the % o   │
│   242 │   │   else:                                                                              │
│   243 │   │   │   logger.warning("Now full model params fine-tune, which is slow, set `use_lor   │
│   244 │   │   os.makedirs(output_dir, exist_ok=True)                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'print_trainable_parameters'

@shibing624
Copy link
Owner

当前模型未测试lora继续训练的情况。 列个todo。

@shibing624
Copy link
Owner

更新代码

总算可以了,不过继续训练时仍然有报错,看上去是peft加载训练后的模型时报错,是现在的模式无法继续训练吗?

textgen/examples/chatglm$ python training_chatglm_demo.py --do_train

2023-04-21 11:15:08.224 | INFO     | textgen.chatglm.chatglm_model:train_model:235 - Restarting from ./outputs/adapter_model.bin
...
│ /DaTa/dl/textgen_lora_train/textgen/examples/chatglm/../../textgen/chatglm/chatglm_model.py:241 in train_model                                                                
				if os.path.exists(checkpoint_name):
					logger.info(f"Restarting from {checkpoint_name}")
					adapters_weights = torch.load(checkpoint_name)
					self.model = set_peft_model_state_dict(self.model, adapters_weights)
│   238 │   │   │   │   else:                                                                      │
│   239 │   │   │   │   │   logger.warning(f"Checkpoint {checkpoint_name} not found")              │
│   240 │   │   │                                                                                  │
│ ❱ 241 │   │   │   self.model.print_trainable_parameters()  # Be more transparent about the % o   │
│   242 │   │   else:                                                                              │
│   243 │   │   │   logger.warning("Now full model params fine-tune, which is slow, set `use_lor   │
│   244 │   │   os.makedirs(output_dir, exist_ok=True)                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'print_trainable_parameters'

fixed. 633e376

@bash99
Copy link

bash99 commented Apr 23, 2023

我自己训练完了,对于 "少先队员因该为老人让坐" 的输出是正确的了,但是看着最后的loss和train_result我还是有点迷茫,似乎Loss很早就不收敛了,会不会中间某些步数的结果效果会更好?
trian_result:

epoch = 1.0
train_loss = 0.14049111964047134
train_runtime = 35059.424
train_samples_per_second = 7.183
train_steps_per_second = 3.592

中间及部分最后的输出:

{'loss': 0.0864, 'learning_rate': 5.5363014025000404e-05, 'epoch': 0.72}
...
{'loss': 0.078, 'learning_rate': 4.440667736145746e-05, 'epoch': 0.78}
...
{'loss': 0.1016, 'learning_rate': 3.511809272701282e-05, 'epoch': 0.82}
{'loss': 0.0743, 'learning_rate': 3.503867596372242e-05, 'epoch': 0.83}
{'loss': 0.0851, 'learning_rate': 3.495925920043203e-05, 'epoch': 0.83}
{'loss': 0.1033, 'learning_rate': 3.4879842437141635e-05, 'epoch': 0.83}
...
{'loss': 0.1196, 'learning_rate': 1.4237837322702078e-05, 'epoch': 0.93}
{'loss': 0.0837, 'learning_rate': 1.4158420559411681e-05, 'epoch': 0.93}
{'loss': 0.0762, 'learning_rate': 1.4079003796121287e-05, 'epoch': 0.93}
{'loss': 0.1166, 'learning_rate': 1.399958703283089e-05, 'epoch': 0.93}
...
{'loss': 0.0731, 'learning_rate': 1.1618672469384839e-05, 'epoch': 0.94}
...
{'loss': 0.0708, 'learning_rate': 2.4095045982305945e-06, 'epoch': 0.99}
...
{'loss': 0.1309, 'learning_rate': 3.4784542321193156e-07, 'epoch': 1.0}
{'loss': 0.0888, 'learning_rate': 2.6842865992153625e-07, 'epoch': 1.0}
{'loss': 0.0981, 'learning_rate': 1.8901189663114092e-07, 'epoch': 1.0}
{'loss': 0.0906, 'learning_rate': 1.0959513334074557e-07, 'epoch': 1.0}
{'train_runtime': 35059.424, 'train_samples_per_second': 7.183, 'train_steps_per_second': 3.592, 'train_loss': 0.14049111964047134, 'epoch': 1.0}
100%| | 125918/125918 [9:44:19<00:00,  3.59it/s]
2023-04-21 21:12:20.531 | INFO     | textgen.chatglm.chatglm_model:handle_metrics:327 - ***** train metrics *****
2023-04-21 21:12:20.532 | INFO     | textgen.chatglm.chatglm_model:handle_metrics:329 -   epoch = 1.0
2023-04-21 21:12:20.532 | INFO     | textgen.chatglm.chatglm_model:handle_metrics:329 -   train_loss = 0.14049111964047134
2023-04-21 21:12:20.532 | INFO     | textgen.chatglm.chatglm_model:handle_metrics:329 -   train_runtime = 35059.424
2023-04-21 21:12:20.532 | INFO     | textgen.chatglm.chatglm_model:handle_metrics:329 -   train_samples_per_second = 7.183
2023-04-21 21:12:20.532 | INFO     | textgen.chatglm.chatglm_model:handle_metrics:329 -   train_steps_per_second = 3.592
2023-04-21 21:12:20.563 | DEBUG    | textgen.chatglm.chatglm_model:train_model:308 - metrics: {'train_runtime': 35059.424, 'train_samples_per_second': 7.183, 'train_steps_per_second': 3.592, 'train_loss': 0.14049111964047134, 'epoch': 1.0}
2023-04-21 21:12:20.563 | INFO     | textgen.chatglm.chatglm_model:train_model:309 -  Training of /DaTa/textgen_lora_train/chatglm-6b model complete. Saved to ./outputs-csc/.


@shibing624
Copy link
Owner

具体模型效果哪个最好,依赖具体任务评估结果,train loss 最低也并不代表效果就最好。1)可以抽case看各checkpoint效果;2)可以算rouge,bleu看各checkpoint效果;3)csc任务可以看测试集的F1值。

@bash99
Copy link

bash99 commented Apr 23, 2023

具体模型效果哪个最好,依赖具体任务评估结果,train loss 最低也并不代表效果就最好。1)可以抽case看各checkpoint效果;2)可以算rouge,bleu看各checkpoint效果;3)csc任务可以看测试集的F1值。

csc任务的测试集似乎也不是全对?比如下面这个(我改了pycorrector/utils/eval.py试着运行一下的结果)

input  : 后来客人非常地生气,然后叫我过来。
truth  : 后来客人非常地生气,然后叫我过来。
predict: 后来客人非常地生气,然后叫我过去。 错误字:来
wrong

input  : 总而言之,正规教育是需要的,但是必要的是学者学习的过程与现在,如何减化不愉快的课程、如何解放学习的压力,这不是学该单方摸索,而是需要适 当的辅导老师。
truth  : 总而言之,正规教育是需要的,但是必要的是学者学习的过程与现在,如何减化不愉快的课程、如何解放学习的压力,这不是学该单方摸索,而是需要适 当的辅导老师。
predict: 总而言之,正规教育是需要的,但是必要的是学者学习的过程与现在,如何减化不愉快的课程、如何解放学习的压力,这不是学生单方摸索,而是需要适 当的辅导老师。 错误字:该
wrong

还有一些像是台湾用语和大陆用语的差异

input  : 可是从妈妈给我十岁生日的礼物一个口琴那时候我就发现我是艺术者。
truth  : 可是从妈妈给我十岁生日的礼物一个口琴那时候我就发现我是艺术者。
predict: 可是从妈妈给我十岁生日的礼物一个口琴那时候我就发现我是艺术家。 错误字:者
wrong

@shibing624
Copy link
Owner

是,SIGHAN数据集质量不够高。

@bash99
Copy link

bash99 commented Apr 23, 2023

是,SIGHAN数据集质量不够高。

我去手工修订一次,才1000条,按说还是可以搞的。
另外,temperature或者top_p多少设置会更合适?

用缺省的设置跑了一次这句话,十次有6次正确。

['这个人很利害。', '错误字:']
['这个人很利害。', '错误字:']
['这个人很厉害。', '错误字:利']
['这个人很利害。', '错误字:']
['这个人很危险。', '错误字:利']
['这个人很厉害。', '错误字:利']
['这个人很厉害。', '错误字:利']
['这个人很厉害。', '错误字:利']
['这个人很厉害。', '错误字:利']
['这个人很利害。', '错误字:']

@shibing624
Copy link
Owner

建议去调整下 top_p, num_beams, repetition_renalty, temperature, do_sample=True;

数据生成有重复,调高repetition_renalty;

csc这种任务不复杂,训练样本少,调低 temperature。

以上是经验参数,具体调参根据任务而定,不是固定的。

@shibing624 shibing624 changed the title RuntimeError: Expected 4-dimensional input for 4-dimensional weight [8192, 8, 1, 1], but got 3-dimensional input of size [1, 16, 4096] instead ChatGLM模型微调问题咨询 Apr 23, 2023
@shibing624 shibing624 added the enhancement New feature or request label Apr 23, 2023
@bash99
Copy link

bash99 commented Apr 23, 2023

数据生成有重复,调高repetition_renalty;

这个不是重复,是我重复调用chat了十次,发现结果不稳定。

您说的是训练时调整这些参数吗?

@shibing624
Copy link
Owner

  1. 我知道,我说的是调低temperature;
  2. 预测

@km1994
Copy link

km1994 commented Apr 24, 2023

运行

chatglm$ python predict_demo.py

报下面错误

RuntimeError: self and mat2 must have the same dtype

@bash99
Copy link

bash99 commented May 9, 2023

fixed. 633e376

我碰到了一个比较奇异的现象,就是training_chatglm_csc_demo.py 这个模式下继续训练似乎无效,而且还会产生忘记训练效果。
即使我在文件中增加了 "resume_from_checkpoint": args.output_dir, 这行

diff --git a/examples/chatglm/training_chatglm_csc_demo.py b/examples/chatglm/training_chatglm_csc_demo.py
index 84e066b..d291130 100644
--- a/examples/chatglm/training_chatglm_csc_demo.py
+++ b/examples/chatglm/training_chatglm_csc_demo.py
@@ -81,6 +81,7 @@ def main():
             "per_device_train_batch_size": args.batch_size,
             "num_train_epochs": args.num_epochs,
             "output_dir": args.output_dir,
+            "resume_from_checkpoint": args.output_dir,
         }
         model = ChatGlmModel(args.model_type, args.model_name, args=model_args)

我试过训练了1个epoch而且效果正常的outputs-csc,然后即使继续训练0.01个epoch,马上也会崩掉。
因为这个训练出来的回答是标准格式如"错误字: 因”,所以特别明显。但是看log,却是load了之前的lora

2023-05-09 13:28:16.469 | INFO     | textgen.chatglm.chatglm_model:load_lora:354 - Loaded lora model from outputs-csc/adapter_model.bin
2023-05-09 13:28:16.557 | INFO     | textgen.chatglm.chatglm_model:train_model:237 - Restarting from outputs-csc/adapter_model.bin

我还用 training_chatglm_hfdataset_demo.py 训练了其它数据,继续训练了7次(每次1 epoch),似乎结果是正常的。

@shibing624
Copy link
Owner

更新代码

@bash99
Copy link

bash99 commented May 12, 2023

更新代码

更新代码后,只有 ‘training_chatglm_adgen_demo.py’ 和 ‘training_chatglm_demo.py’ 有"resume_from_checkpoint" 参数了?
是现在不需要这个参数吗?

但是无论是否加上这个参数,在一个已经能正确输出标准格式 "错误字: 因” 的模型下继续训练0.1甚至0.01 epoch,都会导致原来训练出的能力丢失。
是否加上这个参数的区别仅仅在于会不会有下面的第三行

2023-05-12 01:59:49.638 | INFO     | textgen.chatglm.chatglm_model:load_peft_model:439 - Loaded peft model from output-csc/adapter_model.bin
2023-05-12 01:59:49.640 | INFO     | textgen.chatglm.chatglm_model:train_model:229 - Using PEFT type: LORA
2023-05-12 01:59:49.690 | INFO     | textgen.chatglm.chatglm_model:train_model:310 - Restarting from output-csc/adapter_model.bin

@shibing624
Copy link
Owner

  1. "resume_from_checkpoint" 参数是一个可选参数,代表从之前的中间结果恢复继续训练,所有模型都可以加;
  2. 基于已经训练好的lora继续训练的推荐做法,是把原来的lora权重merge到base model,再基于新base model训练;
  3. 训练时的peft_name已经取消了,原因是第二点,我也给了merge脚本;peft_name 只推荐预测来用;
  4. 基于纠错数据集来纠错微调,本来就是为了专项纠错领域处理,不用做其他任务的,数据集少,如100条也能训练,推荐epochs=20以上才能拟合任务;参考 chatglm用lora训练完predict出的结果和重新加载模型和lora后输出的结果差异很大 #27
  5. 如果加了csc纠错数据集微调,还要保留原来的对话等通用能力,建议在微调数据集中补充alpaca-zh等通用数据集,再训就可以,训练epoch推荐多一些(大于10),本质是相比大模型的预训练,微调数据集任务单一,且数据量相对太小,拟合好比较难。

@leiqing110
Copy link

第二条“基于已经训练好的lora继续训练的推荐做法,是把原来的lora权重merge到base model,再基于新base model训练;”意思是说得到merge以后,然后通过--model_name参数指定merge ,实现继续训练是这样吗

@shibing624
Copy link
Owner

shibing624 commented May 30, 2023

  1. 是,通过--model_name参数指定merge model dir ,实现继续训练
  2. 最新代码已经兼容继续训练逻辑,指定"resume_from_checkpoint": old_model_dir 即可接着训练。

@feng-1985
Copy link

alpaca-zh等通用数据集? 这是chatglm之前预训练过的语料吗?怎么知道模型训练过哪些语料?

@shibing624
Copy link
Owner

  1. chatglm之前没训练过alpaca-zh,alpaca数据发布在ChatGLM-6B后面。
  2. 怎么知道模型训练过哪些语料?我不知道,你问下清华官方,它们没放出paper,没说明预训练数据。

@feng-1985
Copy link

以为要保留原来的对话等通用能力,是加入chatglm训练时用过的语料 -- 所以...,其实类似alpaca-zh等通用数据集,也是可以避免遗忘问题

@wujiankun123
Copy link

第二条“基于已经训练好的lora继续训练的推荐做法,是把原来的lora权重merge到base model,再基于新base model训练;”意思是说得到merge以后,然后通过--model_name参数指定merge ,实现继续训练是这样吗

你的问题解决了?我用--mode_name参数指定merge的目录,继续训练会出现OSError: /opt/models/THUDM_chatglm-6b-lora/ does not appear to have a file named configuration_chatglm.py. 这个错误的。

@shibing624
Copy link
Owner

configuration_chatglm.py等原chatglm官方目录下的py文件手动拷贝过来

Copy link

stale bot commented Dec 27, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)

@stale stale bot added the wontfix This will not be worked on label Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

8 participants