[THUDM/ChatGLM-6B][BUG/Help] ptuning/train.sh 多卡微调就OOM，但是单卡能跑

train.sh如下： PRE_SEQ_LEN=128 LR=2e-2

CUDA_VISIBLE_DEVICES=2,3 python3 main.py \ --do_train \ --train_file data/PanguData/train.json \ --validation_file data/PanguData/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path /home/llm_files/chatglm-6b-v1_1 \ --output_dir output/pangu-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 128 \ --max_target_length 128 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 200 \ --logging_steps 10 \ --save_steps 10 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \

cuda设置成CUDA_VISIBLE_DEVICES=2,3就oom，设置成CUDA_VISIBLE_DEVICES=2就可以跑，跪求大佬！

报错如下： RuntimeError: HIP out of memory. Tried to allocate 128.00 MiB (GPU 0; 31.98 GiB total capacity; 31.60 GiB already allocated; 0 bytes free; 31.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Environment

- OS:
- Python:3.8
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

HelixPark

Batch_size调小

cywjava

Batch_size调小

这个sh里，我batch_size都调成了1

HelixPark

还需要改哪些参数？我的batch_size都是1，而且length也改小了，还是内存溢出 root@332be6867dd0:/install/ChatGLM-6B/ptuning# cat train.sh PRE_SEQ_LEN=128 LR=2e-2

CUDA_VISIBLE_DEVICES=0,1 python3 main.py \ --do_train \ --train_file /install/train.json \ --validation_file /install/dev.json \ --prompt_column prompt \ --response_column response \ --overwrite_cache \ --model_name_or_path /install/models/chatglm-6b \ --output_dir output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 16 \ --max_target_length 128 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 3000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \

--quantization_bit 4

panjing1111

@panjing1111 请问这样设置多卡可行吗CUDA_VISIBLE_DEVICES=0,1，我观察了一下我的资源情况，设置多卡之后反而运行时间大于单卡，这是为什么呢？这样设置是不是存在问题呢？请问您的问题解决了吗

summer-silence

@panjing1111 请问这样设置多卡可行吗CUDA_VISIBLE_DEVICES=0,1，我观察了一下我的资源情况，设置多卡之后反而运行时间大于单卡，这是为什么呢？这样设置是不是存在问题呢？请问您的问题解决了吗

@wutingjun CUDA_VISIBLE_DEVICES=0,1 感觉这样去设置多卡是有存在问题的，我这样设置多卡的时候感觉比单卡的时间要多很多，请问这个问题，您解决了吗

我和你遇到了一样的问题，单卡训练1小时半就够了，8张卡跑最简单的demo竟然跑了2天，而且微调效果极差，甚至不如单卡

VictoryBlue

@VictoryBlue 您找到了问题所在吗？不是很能理解这是为什么。另外多卡的时候，资源消耗也明细高于单卡

summer-silence

@summer-silence ，您好，这两天没有搞这个东西，请问你是何种方式设置的多卡。我是通过CUDA_VISIBLE_DEVICES=0，1，2，3，5，6，7

VictoryBlue

@panjing1111 请问这样设置多卡可行吗CUDA_VISIBLE_DEVICES=0,1，我观察了一下我的资源情况，设置多卡之后反而运行时间大于单卡，这是为什么呢？这样设置是不是存在问题呢？请问您的问题解决了吗

@wutingjun CUDA_VISIBLE_DEVICES=0,1 感觉这样去设置多卡是有存在问题的，我这样设置多卡的时候感觉比单卡的时间要多很多，请问这个问题，您解决了吗

我和你遇到了一样的问题，单卡训练1小时半就够了，8张卡跑最简单的demo竟然跑了2天，而且微调效果极差，甚至不如单卡

大佬你的单卡是什么卡，我3080TI 12G, train.sh 就会OOM

awake1t

@VictoryBlue # 两者方式：

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes 1 --nproc_per_node=4  main.py \
CUDA_VISIBLE_DEVICES=1,2,3 python3 main.py \

这是我设置多卡的时候的方式。但是我发现torchrun这样设置多卡的时间和单卡时间是一致的，为何他没有时间减少？多卡之间通信消耗时间？ python这样的方式多卡运行时间是远远大于单卡的？不清楚python和torchrun这样的实现的内部机制，有什么区别？

summer-silence

因为max_steps一样啊，那多卡肯定更慢啊。。。要加速就减少max_steps咯，或者设置epoch咯。。。可以python3 main.py -h看相关参数。。。

jiarenyf

[THUDM/ChatGLM-6B][BUG/Help] ptuning/train.sh 多卡微调就OOM，但是单卡能跑

回答