[THUDM/ChatGLM-6B][Help] windows gpu 环境 chatglm-6b-int4-qe 报错:AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

2024-07-12 388 views
7

windows gpu=6G显存 环境,CPU启动可以正常使用。换成cuda启动web_demo,提问时报错。

加载模型配置: model = AutoModel.from_pretrained("model", trust_remote_code=True).half().cuda()

错误信息: Traceback (most recent call last): File "C:\Python39\lib\site-packages\gradio\routes.py", line 394, in run_predict output = await app.get_blocks().process_api( File "C:\Python39\lib\site-packages\gradio\blocks.py", line 1075, in process_api result = await self.call_function( File "C:\Python39\lib\site-packages\gradio\blocks.py", line 898, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Python39\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Python39\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "C:\Python39\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "C:\Python39\lib\site-packages\gradio\utils.py", line 549, in async_iteration return next(iterator) File "D:\chatGLM\ChatGLM-6B\web_demo.py", line 16, in predict for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p, File "C:\Python39\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1163, in stream_chat for outputs in self.stream_generate(input_ids, gen_kwargs): File "C:\Python39\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1240, in stream_generate outputs = self( File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1042, in forward transformer_outputs = self.transformer( File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 855, in forward inputs_embeds = self.word_embeddings(input_ids) File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\quantization.py", line 380, in forward original_weight = extract_weight_to_half(weight=self.weight, scale_list=self.weight_scale, source_bit_width=self.weight_bit_width) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\quantization.py", line 223, in extract_weight_to_half func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

Windows环境加载chatglm-6b-int4-qe模型,GPU启动,提问时报错。

Environment
- OS:windows 10
- Python:3.9
- Transformers:4.26.1
- PyTorch:1.10
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

回答

6

int4-qe版的稳定性不好, 论省显存不如int4-slim版

4

改用wsl环境了

9

@zx2021 请问下int4-qe和int4两个版本性功能上有什么差异?

7

我也同样错误,不管是cli_demo.py 还是web_demo.py只要一提问就报这个错误,有人可以解决这个问题吗?

5

我遇到同样问题已经解决,是之前有个模块没有安装,具体的你可以查询一下运行 web_demo.py(或者其他demo.py)之前有没有报no module name的错误(有类似错误,也可以看到让你问话的中文,所以容易让人忽略掉前面的报错,我之前就因此忽视),pip install一下,重新运行即可

希望你的问题和我遇到的是一样的,希望对各位有帮助

8
int4WeightExtractionHalf报错 可安装cpm_kernels库,装完问题解决

pip install cpm_kernels