[THUDM/ChatGLM-6B][BUG/Help] linux下chatglm-6b-int4模型无法用GPU加载

2024-06-18 987 views
9

用GPU加载chatglm-6b-int4模型,kernel编译失败: >>> from transformers import AutoTokenizer, AutoModel >>> model = AutoModel.from_pretrained("./chatglm-6b-int4", trust_remote_code=True).half().cuda() Explicitly passing a ‘revision’ is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a ‘revision’ is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. No compiled kernel found. Compiling kernels : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c Compiling gcc -O3 -pthread -fopenmp -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.so /usr/bin/ld: /tmp/ccjNdJf6.o: relocation R_x86_64_32 against '. text ' can not be used when making a shared object; recompile with - fPIC /tmp/ccjNdJf6.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status Compile failed , using default cpu kernel code. Compiling gcc -O3 -fPIC -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so Kernels compiled : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so Cannot load cpu kernel, don't use quantized model on cpu. Using quantization cache Applying quantization to glm layers

调用模型则返回: RuntimeError: CUDA Error: no kernel image is available for execution on the device

尝试手动编译并指定kernel,未能解决,还会再走一遍上述编译流程,然后报上述一样的CUDA Error: gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so model = AutoModel.from_pretrained("./chatglm-6b-int4",trust_remote_code=True).half().cuda() model = model.quantize(bits=4, kernel_file="/home/pollymars/ChatGLM-6B-main/chatglm-6b-int4/quantization_kernels_parallel.so")

用CPU加载chatglm-6b-int4模型,手动编译并指定kernel则可以成功运行模型,但运算速度慢。

linux下加载chatglm-6b-int4模型,GPU kernel编译失败,手动编译并指定kernel也未解决。

Environment
- OS: Ubuntu 5.4.0-6ubentul~16.04.9
- Python: 3.8.5
- Transformers: 4.26.1
- PyTorch: 1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
- gcc: 5.4.0 20160609
- openmp: 201307

回答

4

``请问"./chatglm-6b-int4"中是最新的代码吗? CUDA Error和CPU kernel无关,可能是cpm_kernels的问题?在最新的代码中,如果加载cpm_kernels失败,会有如下输出: Failed to load cpm_kernels: 此外也有可能是CUDA的问题: 可以简单尝试:

>>>test=torch.Tensor([1,2,3,4])
>>>test=test.cuda()
>>>test

如果有问题,请问您的显卡是?

8

"./chatglm-6b-int4"中就是huggingface上面对应的代码和模型文件。 没有看到关于cpm_kernels的报错。 显卡是A100,我测试下。 有没有可能是gcc和openmp的版本问题?谢谢!

5

感觉应该不是gcc和openmp的问题。

RuntimeError: CUDA Error: no kernel image is available for execution on the device

这个错误感觉是CUDA不能使用(虽然torch.cuda.is_available() == True,但也有可能torch和cuda版本不匹配导致不能正确运行)。

>>>test=torch.Tensor([1,2,3,4])
>>>test=test.cuda()
>>>test

这段代码可以成功运行吗?测一下CUDA能不能用。

8

可以成功运行的。 我更新了"./chatglm-6b-int4"中的代码,现在kernel编译成功了,不过模型推理时还是报“RuntimeError: CUDA Error: no kernel image is available for execution on the device”。 报错信息显示,是在quantization.py的235行,即调用kernels.int4WeightExtractionHalf的时候,进到cpm_kernels/library/cuda.py,checkCUStatus(cuda.cuModuleLoadData(ctypes.byref(module), data))这一行报错。

可能和issue #119 的问题类似。

3

Duplicate of #119