[THUDM/ChatGLM-6B]P-tuning v2怎么多卡训练啊？能不能出一个详细教程？ [BUG/Help]

我直接把train指定为下面这样，主要是CUDA_VISIBLE_DEVICES=0,1,2,3这里改了

PRE_SEQ_LEN=128 LR=2e-2

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 main.py \ --do_train \ --train_file junshi/full_train.json \ --validation_file junshi/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path ../../chatglm_finetuning/data/chatglm-6b \ --output_dir output/testtest-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 128 \ --max_target_length 256 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --predict_with_generate \ --num_train_epochs 1 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \

然后就爆显存了，CUDA out of memory,但是如果单卡训练是不爆显存的。每张显卡是RTX 3090 24G。

然后我看到在issue那里有一个人用下面这种,说可以多卡训练

PRE_SEQ_LEN=128 LR=2e-2 MASTER_PORT=$(shuf -n 1 -i 10000-65535)

deepspeed --include localhost:4,5,6,7 --master_port $MASTER_PORT main.py --deepspeed deepspeed.json --do_train \ --train_file junshi/full_train.json \ --validation_file junshi/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path ../../chatglm_finetuning/data/chatglm-6b \ --output_dir output/testtest-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 128 \ --max_target_length 256 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --predict_with_generate \ --num_train_epochs 1 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \

但是我运行时报错了：

usage: deepspeed [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--min_elastic_nodes MIN_ELASTIC_NODES] [--max_elastic_nodes MAX_ELASTIC_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR] [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--module] [--no_python] [--no_local_rank] [--no_ssh_check] [--force_multi] [--save_pid] [--enable_each_rank_log ENABLE_EACH_RANK_LOG] [--autotuning {tune,run}] [--elastic_training] [--bind_cores_to_rank] [--bind_core_list BIND_CORE_LIST] user_script ... deepspeed: error: the following arguments are required: user_script, user_args train3.sh: line 6: --master_port: command not found train3.sh: line 7: --deepspeed: command not found train3.sh: line 9: --do_train: command not found

能不能出一个详细的多卡训练的ptuning v2教程啊？

Environment

- OS:Ubantu 18.04
- Python: 3.10.9
- Transformers: 4.29.2
- PyTorch: 1.12.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
11.3版本 
print(torch.cuda.is_available())：True

MathamPollard

PRE_SEQ_LEN=128 LR=2e-2 MASTER_PORT=$(shuf -n 1 -i 10000-65535)

deepspeed --include localhost:4,5,6,7 \ --master_port $MASTER_PORT main.py \ --deepspeed deepspeed.json \ --do_train \ --train_file junshi/full_train.json \ --validation_file junshi/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path ../../chatglm_finetuning/data/chatglm-6b \ --output_dir output/testtest-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 128 \ --max_target_length 256 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --predict_with_generate \ --num_train_epochs 1 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN 你忘记加换行“ \”了，每一行要加空格\来换行.

zxddvp

PRE_SEQ_LEN=128 LR=2e-2 MASTER_PORT=$(shuf -n 1 -i 10000-65535)

deepspeed --include localhost:4,5,6,7 \ --master_port $MASTER_PORT main.py \ --deepspeed deepspeed.json \ --do_train \ --train_file junshi/full_train.json \ --validation_file junshi/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path ../../chatglm_finetuning/data/chatglm-6b \ --output_dir output/testtest-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 128 \ --max_target_length 256 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --predict_with_generate \ --num_train_epochs 1 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN 你忘记加换行“ \”了，每一行要加空格\来换行.

@zxddvp 这样子是deepseed进行多卡训练，官方的train.sh可以直接进行多卡训练吗？看到好多issues说CUDA_VISIBLE_DEVICES=1，2，3这样的形式，这种方式可行吗？

summer-silence

方法1： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes 1 --nproc_per_node=8 main.py ..... 方法2： accelerate launch main.py \ ...

qxde01

@qxde01 您好，按照方法1确实可以执行多卡的操作，但是为什么，执行多卡操作的时候资源的消耗远远大于单卡时候的资源消耗，但是耗时确是差不多的呢？麻烦您指导一下。多卡下面的资源消耗是这样的单卡下面消耗的资源是这样的

方法1： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes 1 --nproc_per_node=8 main.py ..... 方法2： accelerate launch main.py \ ...

summer-silence

@ summer-silence 1、耗时取决于batch-size=nproc_per_node*per_device_train_batch_size*gradient_accumulation_steps，如果多卡和单卡一样的参数设置，耗时应该差不多，多卡还增加了通信成本, 2、资源问题，我没遇到过，我也不知道

qxde01

@qxde01 多谢指导。是的，多卡确实会增加额外的通信成本，但是我好奇的是，既然耗时相同，那么对于ChatGLM模型而言，多卡的优势体现在哪里呢？我同时实现了另一个github开源项目，https://github.com/Facico/Chinese-Vicuna 他是以llama为基础模型的，当中他的多卡确实耗时速度会多于单卡。这样看来多卡与单卡实际上是和算法模型强相关的吗？本人初学者，对训练方式不是很熟悉，麻烦您在指导一下万分感谢

summer-silence

@qxde01 多谢指导。是的，多卡确实会增加额外的通信成本，但是我好奇的是，既然耗时相同，那么对于ChatGLM模型而言，多卡的优势体现在哪里呢？我同时实现了另一个github开源项目，https://github.com/Facico/Chinese-Vicuna 他是以llama为基础模型的，当中他的多卡确实耗时速度会多于单卡。这样看来多卡与单卡实际上是和算法模型强相关的吗？本人初学者，对训练方式不是很熟悉，麻烦您在指导一下万分感谢

我这边看，直接改visible_device，多卡还不如单卡

Pig255

[THUDM/ChatGLM-6B]P-tuning v2怎么多卡训练啊？能不能出一个详细教程？ [BUG/Help]

回答