windows gpu=6G显存 环境,CPU启动可以正常使用。换成cuda启动web_demo,提问时报错。
加载模型配置: model = AutoModel.from_pretrained("model", trust_remote_code=True).half().cuda()
错误信息: Traceback (most recent call last): File "C:\Python39\lib\site-packages\gradio\routes.py", line 394, in run_predict output = await app.get_blocks().process_api( File "C:\Python39\lib\site-packages\gradio\blocks.py", line 1075, in process_api result = await self.call_function( File "C:\Python39\lib\site-packages\gradio\blocks.py", line 898, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Python39\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Python39\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "C:\Python39\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, args) File "C:\Python39\lib\site-packages\gradio\utils.py", line 549, in async_iteration return next(iterator) File "D:\chatGLM\ChatGLM-6B\web_demo.py", line 16, in predict for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p, File "C:\Python39\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1163, in stream_chat for outputs in self.stream_generate(input_ids, gen_kwargs): File "C:\Python39\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1240, in stream_generate outputs = self( File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 1042, in forward transformer_outputs = self.transformer( File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py", line 855, in forward inputs_embeds = self.word_embeddings(input_ids) File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\quantization.py", line 380, in forward original_weight = extract_weight_to_half(weight=self.weight, scale_list=self.weight_scale, source_bit_width=self.weight_bit_width) File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\local\quantization.py", line 223, in extract_weight_to_half func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'
Windows环境加载chatglm-6b-int4-qe模型,GPU启动,提问时报错。
Environment- OS:windows 10
- Python:3.9
- Transformers:4.26.1
- PyTorch:1.10
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :