实测主页util.py中的
from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4)
效果与https://huggingface.co/docs/transformers/perf_infer_gpu_one中
max_memory_mapping = {0: "30GB", 1: "30GB", 2: "30GB", 3: "30GB"}
model = AutoModel.from_pretrained(
"THUDM/chatglm-6b",
trust_remote_code=True,
device_map='auto',
max_memory=max_memory_mapping
).half()
推理速度几乎一模一样
None