使用cli_demo.py尝试第一次对话时出现爆显存,具体如下:
Traceback (most recent call last):
File "C:\gits\ChatGLM-6B\cli_demo.py", line 44, in <module>
main()
File "C:\gits\ChatGLM-6B\cli_demo.py", line 34, in main
for response, history in model.stream_chat(tokenizer, query, history=history):
File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\modeling_chatglm.py", line 1163, in stream_chat
for outputs in self.stream_generate(**input_ids, **gen_kwargs):
File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\modeling_chatglm.py", line 1240, in stream_generate
outputs = self(
File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\modeling_chatglm.py", line 1056, in forward
lm_logits = self.lm_head(hidden_states).permute(1, 0, 2).contiguous()
File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\quantization.py", line 336, in forward
output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\autograd\function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\quantization.py", line 51, in forward
weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\quantization.py", line 229, in extract_weight_to_half
out = torch.empty(n, m * (8 // source_bit_width), dtype=torch.half, device="cuda")
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 6.00 GiB total capacity; 3.30 GiB already allocated; 0 bytes free; 4.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
使用
AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()
根据README.md应当能够正常运行,并占用4G多显存。
1.conda activate env_name //激活环境 2.python cli_demo.py //运行命令行demo 3.等待出现提示词提示符后,输入“你好”
Environment- OS:Windows11
- Python:3.10.9
- Transformers:4.26.1
- PyTorch:2.0.0
- CUDA Support:true
机器额外信息: cpu: i7-12700H 内存: 16G 显卡: NVIDA Geforce RTX 3060 Laptop GPU 显存: 6G 虽然6G的显存很勉强,但是我还是想知道这个可不可以解决。