[THUDM/ChatGLM-6B][BUG/Help] GTX3060 Laptop运行int4-qe量化模型爆显存

使用cli_demo.py尝试第一次对话时出现爆显存，具体如下：

Traceback (most recent call last):
  File "C:\gits\ChatGLM-6B\cli_demo.py", line 44, in <module>
    main()
  File "C:\gits\ChatGLM-6B\cli_demo.py", line 34, in main
    for response, history in model.stream_chat(tokenizer, query, history=history):
  File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\modeling_chatglm.py", line 1163, in stream_chat
    for outputs in self.stream_generate(**input_ids, **gen_kwargs):
  File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\modeling_chatglm.py", line 1240, in stream_generate
    outputs = self(
  File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\modeling_chatglm.py", line 1056, in forward
    lm_logits = self.lm_head(hidden_states).permute(1, 0, 2).contiguous()
  File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\quantization.py", line 336, in forward
    output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
  File "C:\Users\ZeroDegress\miniconda3\lib\site-packages\torch\autograd\function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\quantization.py", line 51, in forward
    weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
  File "C:\Users\ZeroDegress/.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4-qe\437fc94474689e27adfcb29ef768bfaef9be5c45\quantization.py", line 229, in extract_weight_to_half
    out = torch.empty(n, m * (8 // source_bit_width), dtype=torch.half, device="cuda")
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 6.00 GiB total capacity; 3.30 GiB already allocated; 0 bytes free; 4.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

使用

AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()

根据README.md应当能够正常运行，并占用4G多显存。

1.conda activate env_name //激活环境 2.python cli_demo.py //运行命令行demo 3.等待出现提示词提示符后，输入“你好”

Environment

- OS:Windows11
- Python:3.10.9
- Transformers:4.26.1
- PyTorch:2.0.0
- CUDA Support:true

机器额外信息： cpu: i7-12700H 内存: 16G 显卡: NVIDA Geforce RTX 3060 Laptop GPU 显存: 6G 虽然6G的显存很勉强，但是我还是想知道这个可不可以解决。

zerodegress

同3060M，不要说int4-qe了，就是int4都能正常运行，关闭独显直连以减少显存占用，还有就是不要边玩游戏边用chatglm

YIZXIY

当时我只开着glm，其他东西都没开。事实上我电脑里其他需要显卡的软件（不包括sd）加起来都没glm占用显存多

zerodegress

最近在写Windows的wsl部署手册，仅供参考

torch.cuda.OutOfMemoryError: CUDA out of memory. 在Windows的系统环境变量中增加
变量名：PYTORCH_CUDA_ALLOC_CONF
变量值：max_split_size_mb:32
文档书写时使用3090 24G显存配置，其他规格酌情调整 32至其他值，如未设置变量默认值为128 设置后未生效可以重启解决

ZhangErling

同跑不起来 qe，可以试一下 int4-slim，我也是 6g 的显存，可以正常运行

s0nny7

int4-qe确实有问题。我1660Ti运行int4稳定聊很多句，显存轻微上涨，但是int4-qe第一句话显存暴涨到5.8G,第二句就爆显存。

changjiuxiong

[THUDM/ChatGLM-6B][BUG/Help] GTX3060 Laptop运行int4-qe量化模型爆显存

回答