[THUDM/ChatGLM-6B][Help] 每次运行都会出现 No compiled kernel found.

每次运行一遍 model = AutoModel.from_pretrained('./int4', trust_remote_code=True).half().cuda() 都会出现如下报错

No compiled kernel found. Compiling kernels : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.c -shared -o /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Kernels compiled : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Load kernel : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Setting CPU quantization kernel threads to 16 Using quantization cache Applying quantization to glm layers

但是还是能正常往下运行

看上去是每次都要 compilling 一次 kernels，能不能定下来呢？

from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained(d, trust_remote_code=True) model = AutoModel.from_pretrained(d, trust_remote_code=True).half().cuda()

Environment

- OS:Ubuntu 22.04
- Python: 3.9.16
- Transformers: 4.27.1
- PyTorch: 2.0.0+cu117
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

如何提升反应速度，是不是只要加大现存就可以？

Yamol

这个是在编译 kernel，不是报错

duzx16

No compiled kernel found. Compiling kernels : /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c -shared -o /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so /usr/bin/ld: cannot find -lgcc_s collect2: ld returned 1 exit status Compile default cpu kernel failed, using default cpu kernel code. Compiling gcc -O3 -fPIC -std=c99 /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels.c -shared -o /home/xujm/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels.so /usr/bin/ld: cannot find -lgcc_s collect2: ld returned 1 exit status Compile default cpu kernel failed. Failed to load kernel. Cannot load cpu kernel, don't use quantized model on cpu. Using quantization cache Applying quantization to glm layers INFO: Started server process [75233] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7000 (Press CTRL+C to quit)

我是这个状态。推理不能正常进行，报：RuntimeError: Library cudart is not initialized

promexjm

可以固定下来，你需要先在调用时的from_pretrained()函数中给定cache_dir，以你的int4量化为例，先将.cache中的int4文件夹拷贝到某个目录下，然后在这个目录下放入预训练模型参数文件，这个目录路径就是cache_dir。之后修改该文件夹中的modeling_chatglm.py，将ChatGLMForConditionalGeneration类的初始化函数中添加kernel_file=""，然后在函数末尾的self.quantize()调用中加入kernel_file=kernel_file。之后在from_pretrained()函数中将kernel_file一并传入即可。你可以将第一次编译生成的so文件拷贝到cache_dir下，那么kernel_file的值就是cache_dir连接上so文件。

ThomasAtlantis

抱歉，我回复的是上一条已经关闭的task，你这个是编译过程报错，缺少libgcc_s.so库，应该是GCC的安装环境有问题，建议重装GCC。

ThomasAtlantis

这种情况是什么问题，怎么解决，访问 loaclhost:7860 连接超时

511515njlkh

你这个是远程服务器运行，要么通过SSH Tunnel的方式将端口映射到本地，要么绑定0.0.0.0地址，然后从外部IP访问（后者需要打开防火墙）

ThomasAtlantis

这个算是面模型运行成功了，就是我没有访问成功是吗

511515njlkh

你的log里有个Error，应该是没有运行成功。另外，你应该新开一个issue来提问，或者找一下类似的issue

ThomasAtlantis

[THUDM/ChatGLM-6B][Help] 每次运行都会出现 No compiled kernel found.

回答