[THUDM/ChatGLM-6B][BUG/Help] web_demo部署后无返回结果:RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

2024-05-21 950 views
8

新的checkpoint,网页可以打开,但是无返回结果,部署代码如下: origin_checkpoint = '/devops/chatglm-6b' ptuning_checkpoint = '/devops/20230412/checkpoint-50000' config = AutoConfig.from_pretrained(origin_checkpoint, trust_remote_code=True, pre_seq_len=128) tokenizer = AutoTokenizer.from_pretrained(origin_checkpoint, trust_remote_code=True) model = AutoModel.from_pretrained(origin_checkpoint, config=config, trust_remote_code=True) prefix_state_dict = torch.load(os.path.join(ptuning_checkpoint, "pytorch_model.bin")) new_prefix_state_dict = {} for k, v in prefix_state_dict.items(): if k.startswith("transformer.prefix_encoder."): new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)

nohup.out报错如下: This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/gradio/routes.py", line 395, in run_predict output = await app.get_blocks().process_api( File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1193, in process_api result = await self.call_function( File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 930, in call_function prediction = await anyio.to_thread.run_sync( File "/opt/conda/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, args) File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration return next(iterator) File "/workspace/ChatGLM-6B/web_demo_20230412.py", line 71, in predict for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p, File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 43, in generator_context response = gen.send(None) File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1312, in stream_chat for outputs in self.stream_generate(inputs, gen_kwargs): File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 43, in generator_context response = gen.send(None) File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1389, in stream_generate outputs = self( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1191, in forward transformer_outputs = self.transformer( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 997, in forward layer_ret = layer( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 624, in forward attention_input = self.input_layernorm(hidden_states) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 190, in forward return F.layer_norm( File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2515, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Environment
- OS:
- Python:3.10
- Transformers:4.27.1
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :11.6

回答

0
model = model.quantize(4)
model = model.half().cuda()
model.transformer.prefix_encoder.float()
model = model.eval()

这样才能使用GPU推理

5
model = model.quantize(4)
model = model.half().cuda()
model.transformer.prefix_encoder.float()
model = model.eval()

这样才能使用GPU推理

可以了,非常感谢您!

9
model = model.quantize(4)
model = model.half().cuda()
model.transformer.prefix_encoder.float()
model = model.eval()

这样才能使用GPU推理

为什么使用web_demo.py的结果和直接使用ptuning中的main.py predict,预测结果差别这么大?web中的结果明显比predict要差

9

@Data2Me 请问您找到web结果比predict结果要差的原因了吗?

1

@Data2Me 请问您找到web结果比predict结果要差的原因了吗?

web的第一次回复是和predict相同的,后面就不是了,所以我现在web上测试每次一个问题,然后clear history

5

噢噢,应该是因为微调的模型没有多轮对话问答能力,所以输入history时应该效果更差

7

可以修改web_demo代码,不传入history