环境是5个RTX3090,300多GB内存,使用了deepspeed的zero3,把优化器和模型参数都offload到内存
跑deepspeed全量模型微调 跑到1000个step的时候,做checkpoint,写入文件失败,查询发现是内存不够
想问问作者跑deepspeed的时候,用的服务器是多大内存
期望可以正常跑完训练
deepspeed.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" },
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}`
ds_train_finetune.sh文件 `LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --num_gpus=4 --master_port $MASTER_PORT main.py \ --deepspeed deepspeed.json \ --do_train \ --train_file /home/featurize/work/xxx/git/ChatGLM-6B/data/train.json \ --test_file /home/featurize/work/xxx/git/ChatGLM-6B/data/train.json \ --prompt_column context \ --response_column summary \ --overwrite_cache \ --model_name_or_path /home/featurize/work/xxx/git/Med-ChatGLM/model \ --output_dir ./output/deepspeed \ --overwrite_output_dir \ --max_source_length 256 \ --max_target_length 760 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --predict_with_generate \ --max_steps 5000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --fp16`
Environment- OS:ubuntu
- Python:3.9
- Transformers:
- PyTorch:2.0.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :