我是用t4进行推理,它是支持int8和int4的
输入长度为1000,int4需要32s,fp16只需要12s
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).quantize(8).half().cuda() model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[]) print(response) response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) print(response) EnvironmentOS: Ubuntu 20.04
Python: 3.8
Transformers: 4.26.1
PyTorch: 1.12
CUDA Support: True