[THUDM/ChatGLM-6B]使用huggingface + cahtglm6b单机多卡推理加速有更好的办法吗？

目前的行为

实测主页util.py中的 from utils import load_model_on_gpus model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4) 效果与https://huggingface.co/docs/transformers/perf_infer_gpu_one中 max_memory_mapping = {0: "30GB", 1: "30GB", 2: "30GB", 3: "30GB"} model = AutoModel.from_pretrained( "THUDM/chatglm-6b", trust_remote_code=True, device_map='auto', max_memory=max_memory_mapping ).half()

推理速度几乎一模一样

None

LivinLuo1993

不知道官方是否能够装备到如

文本生成推理
vllm 这样的推理引擎项目就到了。

BrightXiaoHan

请问剧本如何单机多卡推理？

VictoryBlue

该脚本的作用是在多张小显存在的显卡上使用，而不是用多张显卡加速

duzx16

@duzx16 的意思是一个模型一块显存不下，只好放在多张显存上嘛？请问您是否知道如何在chatglm上使用数据训练模型呢？

VictoryBlue

请问您遇到过这个问题吗：预计所有张量都在同一设备上，但发现至少有两个设备，cuda:0 和 cuda:1！

xiaojidaner

我之前加载模型的代码是：model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto').half().cuda()提示的问题和你一样，后来我将.cuda()去掉后，这个问题就解决了。你参考一下。

yingzhang0709

[THUDM/ChatGLM-6B]使用huggingface + cahtglm6b单机多卡推理加速有更好的办法吗？

回答