[THUDM/ChatGLM-6B][BUG/Help] p-tuning如何使用多卡进行微调呢？

readme里只有多卡部署的方法，想咨询一下如何进行多卡微调？当前单卡微调速度非常之慢，仅50条数据就需要4小时+。如果使用多卡微调速度是否能够上升呢？

单卡微调可以成功，但是耗时很长。

Environment

- OS:EulorOS
- Python:3.10
- Transformers:4.26.1
- PyTorch:1.12
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

tykuyh

Transformers 是 4.27.1 多卡部署也失败了，直接按照readme中的操作，报错：

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

tykuyh

修改脚本里的 CUDA_VISIBLE_DEVICES 到你想要用的GPU编号（比如 0,1,2,3）即可

duzx16

Transformers 是 4.27.1 多卡部署也失败了，直接按照readme中的操作，报错：

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

你运行的是哪个模型？

duzx16

Transformers 是 4.27.1 多卡部署也失败了，直接按照readme中的操作，报错： RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

你运行的是哪个模型？

运行的是非量化的14GB左右的模型，跑的api.py脚本。也没有改相关脚本和代码，直接跑是正常的，但是在脚本里加上from utils import load_model_on_gpus，然后把 model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda() 改成model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=2).half().cuda()后再运行，报了这个错。

tykuyh

修改脚本里的 CUDA_VISIBLE_DEVICES 到你想要用的GPU编号（比如 0,1,2,3）即可

是可行的，感谢！但是有个小疑惑，似乎不能指定 4，而且最多指定四块，触发其中任意一个情况都会报 ValueError: 130004 is not in list。最终尝试了指定1,2,3,5，是正常的。

完整报错： ValueError: Caught ValueError in replica 4 on device 4. Original Traceback (most recent call last): File "/root/anaconda3/envs/chatGLM/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, kwargs) File "/root/anaconda3/envs/chatGLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward transformer_outputs = self.transformer( File "/root/anaconda3/envs/chatGLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 936, in forward attention_mask = self.get_masks( File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 682, in get_masks context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids] File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 682, in context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids] ValueError: 130004 is not in list

tykuyh

我按照如下修改ptuning/train.sh里的0,1,2，就一直报OOM，不过单卡能跑。

PRE_SEQ_LEN=128 LR=2e-2

CUDA_VISIBLE_DEVICES=0,1,2 python3 main.py \ --do_train \ --train_file data/PanguData/train.json \ --validation_file data/PanguData/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path /home/llm_files/chatglm-6b-v1_1 \ --output_dir output/pangu-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 128 \ --max_target_length 128 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 200 \ --logging_steps 10 \ --save_steps 10 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \

HelixPark

@duzx16 大佬，请问为什么我这样设置多卡微调之后，资源的利用和耗时要远远大于单卡微调时候的状态呢

summer-silence

@wutingjun CUDA_VISIBLE_DEVICES=0,1 感觉这样去设置多卡是有存在问题的，我这样设置多卡的时候感觉比单卡的时间要多很多，请问这个问题，您解决了吗

我和你遇到了一样的问题，单卡训练1小时半就够了，8张卡跑最简单的demo竟然跑了2天，而且微调效果极差，甚至不如单卡

VictoryBlue

@VictoryBlue 请问这个问题解决了吗？

summer-silence

@wutingjun CUDA_VISIBLE_DEVICES=0,1 感觉这样去设置多卡是有存在问题的，我这样设置多卡的时候感觉比单卡的时间要多很多，请问这个问题，您解决了吗

我和你遇到了一样的问题，单卡训练1小时半就够了，8张卡跑最简单的demo竟然跑了2天，而且微调效果极差，甚至不如单卡

我也遇到了多卡训练速度很慢且远低于单卡的情况

coolge

[THUDM/ChatGLM-6B][BUG/Help] p-tuning如何使用多卡进行微调呢？

回答