[THUDM/ChatGLM-6B][Help] 每次运行都会出现 No compiled kernel found.

2024-06-12 685 views
1

每次运行一遍 model = AutoModel.from_pretrained('./int4', trust_remote_code=True).half().cuda() 都会出现如下报错

No compiled kernel found. Compiling kernels : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.c -shared -o /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Kernels compiled : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Load kernel : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Setting CPU quantization kernel threads to 16 Using quantization cache Applying quantization to glm layers

但是还是能正常往下运行

看上去是每次都要 compilling 一次 kernels,能不能定下来呢?

from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained(d, trust_remote_code=True) model = AutoModel.from_pretrained(d, trust_remote_code=True).half().cuda()

Environment
- OS:Ubuntu 22.04
- Python: 3.9.16
- Transformers: 4.27.1
- PyTorch: 2.0.0+cu117
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

如何提升反应速度,是不是只要加大现存就可以?

回答

8

这个是在编译 kernel,不是报错

1

No compiled kernel found. Compiling kernels : /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c -shared -o /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so /usr/bin/ld: cannot find -lgcc_s collect2: ld returned 1 exit status Compile default cpu kernel failed, using default cpu kernel code. Compiling gcc -O3 -fPIC -std=c99 /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels.c -shared -o /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels.so /usr/bin/ld: cannot find -lgcc_s collect2: ld returned 1 exit status Compile default cpu kernel failed. Failed to load kernel. Cannot load cpu kernel, don't use quantized model on cpu. Using quantization cache Applying quantization to glm layers INFO: Started server process [75233] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7000 (Press CTRL+C to quit)

我是这个状态。推理不能正常进行,报:RuntimeError: Library cudart is not initialized

1

可以固定下来,你需要先在调用时的from_pretrained()函数中给定cache_dir,以你的int4量化为例,先将.cache中的int4文件夹拷贝到某个目录下,然后在这个目录下放入预训练模型参数文件,这个目录路径就是cache_dir。之后修改该文件夹中的modeling_chatglm.py,将ChatGLMForConditionalGeneration类的初始化函数中添加kernel_file="",然后在函数末尾的self.quantize()调用中加入kernel_file=kernel_file。之后在from_pretrained()函数中将kernel_file一并传入即可。你可以将第一次编译生成的so文件拷贝到cache_dir下,那么kernel_file的值就是cache_dir连接上so文件。

6

抱歉,我回复的是上一条已经关闭的task,你这个是编译过程报错,缺少libgcc_s.so库,应该是GCC的安装环境有问题,建议重装GCC。

8

image

这种情况是什么问题,怎么解决,访问 loaclhost:7860 连接超时

0

你这个是远程服务器运行,要么通过SSH Tunnel的方式将端口映射到本地,要么绑定0.0.0.0地址,然后从外部IP访问(后者需要打开防火墙)

0

这个算是面模型运行成功了,就是我没有访问成功是吗

7

你的log里有个Error,应该是没有运行成功。另外,你应该新开一个issue来提问,或者找一下类似的issue