用GPU加载chatglm-6b-int4模型,kernel编译失败:
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("./chatglm-6b-int4", trust_remote_code=True).half().cuda()
Explicitly passing a ‘revision’ is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a ‘revision’ is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
No compiled kernel found.
Compiling kernels : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c
Compiling gcc -O3 -pthread -fopenmp -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.so
/usr/bin/ld: /tmp/ccjNdJf6.o: relocation R_x86_64_32 against '. text ' can not be used when making a shared object; recompile with - fPIC
/tmp/ccjNdJf6.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status
Compile failed , using default cpu kernel code.
Compiling gcc -O3 -fPIC -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so
Kernels compiled : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so
Cannot load cpu kernel, don't use quantized model on cpu.
Using quantization cache
Applying quantization to glm layers
调用模型则返回: RuntimeError: CUDA Error: no kernel image is available for execution on the device
尝试手动编译并指定kernel,未能解决,还会再走一遍上述编译流程,然后报上述一样的CUDA Error: gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so model = AutoModel.from_pretrained("./chatglm-6b-int4",trust_remote_code=True).half().cuda() model = model.quantize(bits=4, kernel_file="/home/pollymars/ChatGLM-6B-main/chatglm-6b-int4/quantization_kernels_parallel.so")
用CPU加载chatglm-6b-int4模型,手动编译并指定kernel则可以成功运行模型,但运算速度慢。
linux下加载chatglm-6b-int4模型,GPU kernel编译失败,手动编译并指定kernel也未解决。
Environment- OS: Ubuntu 5.4.0-6ubentul~16.04.9
- Python: 3.8.5
- Transformers: 4.26.1
- PyTorch: 1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
- gcc: 5.4.0 20160609
- openmp: 201307