[THUDM/ChatGLM-6B]ptuning-v2单卡转多卡训练bug [BUG/Help]

在做ptuning-V2微调的时候，运行demo发现chatglm-6b只能进行单卡训练，在进行多卡训练时则会报错；如果把模型换成chatglm-6b-int8则可以运行，有遇到过这种问题的吗

CUDA_VISIBLE_DEVICES=0,1 python3 main.py \ --do_train \ --train_file AdvertiseGen/train.json \ --validation_file AdvertiseGen/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path /home/ChatGLM-6B/model/chatglm-6b \ --output_dir output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 64 \ --max_target_length 64 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 3000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN

Traceback (most recent call last): File "/home/ChatGLM-6B/ptuning/main.py", line 433, in main() File "/home/ChatGLM-6B/ptuning/main.py", line 372, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/ChatGLM-6B/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/home/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ChatGLM-6B/ptuning/trainer.py", line 2665, in training_step loss.backward() File "/home/anaconda3/envs/vicuna/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/anaconda3/envs/vicuna/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 23.68 GiB total capacity; 22.49 GiB already allocated; 29.31 MiB free; 22.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

wutingjun

@wutingjun CUDA_VISIBLE_DEVICES=0,1 感觉这样去设置多卡是有存在问题的，我这样设置多卡的时候感觉比单卡的时间要多很多，请问这个问题，您解决了吗

summer-silence

@wutingjun CUDA_VISIBLE_DEVICES=0,1 感觉这样去设置多卡是有存在问题的，我这样设置多卡的时候感觉比单卡的时间要多很多，请问这个问题，您解决了吗这个问题我没有遇到，多卡比单卡速度有提升。在训练的时候，通过nvidia-smi查看两张卡确实都用起来了。我目前在调试Lora的时候，是单卡可以运行多卡失败了

wutingjun

@wutingjun 好的，多谢。我在研究研究是什么问题

summer-silence

@wutingjun CUDA_VISIBLE_DEVICES=0,1 感觉这样去设置多卡是有存在问题的，我这样设置多卡的时候感觉比单卡的时间要多很多，请问这个问题，您解决了吗

请问一下，你们训练多卡微调的时候，有碰到学习率下降很快，才1.5个epoch学习率就为0了。

FanWan

@wutingjun 你好，我也遇到这个问题，请问解决了吗？

DuBaiSheng

@wutingjun 好的，多谢。我在研究研究是什么问题

您好，我也遇到同样的问题，多卡比单卡时间多了近5倍，我用的4张GPU，请问您解决了这个问题吗？谢谢。

yang1111-gif

@yang1111-gif @DuBaiSheng 我一直没有解决这个问题，但是我后来就不在使用train.sh这个脚本了。后来我使用ds_train.sh这个脚本，确实deepspeed之下会大大加速

summer-silence

[THUDM/ChatGLM-6B]ptuning-v2单卡转多卡训练bug [BUG/Help]

回答