[THUDM/ChatGLM-6B][BUG/Help] linux下chatglm-6b-int4模型无法用GPU加载

用GPU加载chatglm-6b-int4模型，kernel编译失败： >>> from transformers import AutoTokenizer, AutoModel >>> model = AutoModel.from_pretrained("./chatglm-6b-int4", trust_remote_code=True).half().cuda() Explicitly passing a ‘revision’ is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a ‘revision’ is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. No compiled kernel found. Compiling kernels : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c Compiling gcc -O3 -pthread -fopenmp -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.so /usr/bin/ld: /tmp/ccjNdJf6.o: relocation R_x86_64_32 against '. text ' can not be used when making a shared object; recompile with - fPIC /tmp/ccjNdJf6.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status Compile failed , using default cpu kernel code. Compiling gcc -O3 -fPIC -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so Kernels compiled : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so Cannot load cpu kernel, don't use quantized model on cpu. Using quantization cache Applying quantization to glm layers

调用模型则返回： RuntimeError: CUDA Error: no kernel image is available for execution on the device

尝试手动编译并指定kernel，未能解决，还会再走一遍上述编译流程，然后报上述一样的CUDA Error: gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so model = AutoModel.from_pretrained("./chatglm-6b-int4",trust_remote_code=True).half().cuda() model = model.quantize(bits=4, kernel_file="/home/pollymars/ChatGLM-6B-main/chatglm-6b-int4/quantization_kernels_parallel.so")

用CPU加载chatglm-6b-int4模型，手动编译并指定kernel则可以成功运行模型，但运算速度慢。

linux下加载chatglm-6b-int4模型，GPU kernel编译失败，手动编译并指定kernel也未解决。

Environment

- OS: Ubuntu 5.4.0-6ubentul~16.04.9
- Python: 3.8.5
- Transformers: 4.26.1
- PyTorch: 1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
- gcc: 5.4.0 20160609
- openmp: 201307

pollymars

``请问"./chatglm-6b-int4"中是最新的代码吗？ CUDA Error和CPU kernel无关，可能是cpm_kernels的问题？在最新的代码中，如果加载cpm_kernels失败，会有如下输出： Failed to load cpm_kernels: 此外也有可能是CUDA的问题：可以简单尝试：

>>>test=torch.Tensor([1,2,3,4])
>>>test=test.cuda()
>>>test

如果有问题，请问您的显卡是？

songxxzp

"./chatglm-6b-int4"中就是huggingface上面对应的代码和模型文件。没有看到关于cpm_kernels的报错。显卡是A100，我测试下。有没有可能是gcc和openmp的版本问题？谢谢！

pollymars

感觉应该不是gcc和openmp的问题。

RuntimeError: CUDA Error: no kernel image is available for execution on the device

这个错误感觉是CUDA不能使用（虽然torch.cuda.is_available() == True，但也有可能torch和cuda版本不匹配导致不能正确运行）。

>>>test=torch.Tensor([1,2,3,4])
>>>test=test.cuda()
>>>test

这段代码可以成功运行吗？测一下CUDA能不能用。

songxxzp

可以成功运行的。我更新了"./chatglm-6b-int4"中的代码，现在kernel编译成功了，不过模型推理时还是报“RuntimeError: CUDA Error: no kernel image is available for execution on the device”。报错信息显示，是在quantization.py的235行，即调用kernels.int4WeightExtractionHalf的时候，进到cpm_kernels/library/cuda.py，checkCUStatus(cuda.cuModuleLoadData(ctypes.byref(module), data))这一行报错。

可能和issue #119 的问题类似。

pollymars

这里说需要算力6.1及以上，太坑了吧 https://github.com/OpenBMB/BMInf/issues/29#issuecomment-1000951808

mozhuanzuojing

Duplicate of #119

zhangch9

[THUDM/ChatGLM-6B][BUG/Help] linux下chatglm-6b-int4模型无法用GPU加载

回答