[THUDM/ChatGLM-6B][BUG/Help] <CPU无法使用int4模型>

使用CPU无法运行chatglm-6b-int4，但可以运行chatglm-6b，主要的运行错误如下


Traceback (most recent call last):
  File "C:\Users\Azure/.cache\huggingface\modules\transformers_modules\chatglm_6b_int_4\quantization.py", line 18, in <module>
    from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up
  File "C:\Users\Azure\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\cpm_kernels\__init__.py", line 1, in <module>
    from . import library
  File "C:\Users\Azure\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\cpm_kernels\library\__init__.py", line 2, in <module>
    from . import cuda
  File "C:\Users\Azure\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\cpm_kernels\library\cuda.py", line 7, in <module>
    cuda = Lib.from_lib("cuda", ctypes.WinDLL("nvcuda.dll"))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\ctypes\__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: Could not find module 'nvcuda.dll' (or one of its dependencies). Try using the full path with constructor syntax.


Message: 'Failed to load cpm_kernels:'
Arguments: (FileNotFoundError("Could not find module 'nvcuda.dll' (or one of its dependencies). Try using the full path with constructor syntax."),)
No compiled kernel found.
Compiling kernels : C:\Users\Azure\.cache\huggingface\modules\transformers_modules\chatglm_6b_int_4\quantization_kernels_parallel.c
Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 C:\Users\Azure\.cache\huggingface\modules\transformers_modules\chatglm_6b_int_4\quantization_kernels_parallel.c -shared -o C:\Users\Azure\.cache\huggingface\modules\transformers_modules\chatglm_6b_int_4\quantization_kernels_parallel.so
Kernels compiled : C:\Users\Azure\.cache\huggingface\modules\transformers_modules\chatglm_6b_int_4\quantization_kernels_parallel.so
Cannot load cpu kernel, don't use quantized model on cpu.
Cannot load cuda kernel, quantization failed.

我的文件布局如下

Expected Behavior

无法使用cpu运行int4模型，No compiled kernel found. Cannot load cpu kernel，Cannot load cuda kernel

Steps To Reproduce

下载 https://github.com/THUDM/ChatGLM-6B 到 ChatGLM-6B 下载 https://huggingface.co/THUDM/chatglm-6b-int4 到 ChatGLM-6B\chatglm_6b_int_4

修改 cli_demo.py

...
tokenizer = AutoTokenizer.from_pretrained("chatglm_6b_int_4", trust_remote_code=True)
model = AutoModel.from_pretrained("chatglm_6b_int_4", trust_remote_code=True).float()
...

Environment

- OS: Window 11 Insider Preview 25336
- Python: 3.11
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : False

Anything else?

https://www.datalearner.com/blog/1051680925189690

Tsvetak

为什么完整的glm-6b可以直接运行而glm-6b-int4不可以

Tsvetak

~~如果遇到了报错 Could not find module 'nvcuda.dll' 或者 RuntimeError: Unknown platform: darwin (MacOS) ，请从本地加载模型~~

~~readme.md 上有的。你试试~~

我也是32G Cpu完整的可以. chatglm_6b_int_4 的不可以。提示没内核什么的，看着像是linux相关的东西缺少。。。

mingyue0094

我也试过手动编译，但是还是报这个错，tdm-gcc-10.3.0,安装也是选了openmp的，也试过修改 quantization.py

...
# from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up
...
kernels = CPUKernel 
...

但是报 name 'CPUKernel' is not defined，而且上面的错误还是有

Tsvetak

遇到同样问题

jiaozn

为什么完整的glm-6b可以直接运行而glm-6b-int4不可以原因是缺少 int4 量化的内核文件可以参考我的解决步骤。 https://github.com/THUDM/ChatGLM-6B/issues/529

mingyue0094

按照 #529 我还是不行。 @ffi7 你可以了？报错信息如下：

/Users/binsun1/.cache/huggingface/hub/models--THUDM--chatglm-6b-int4/snapshots/02a065cf2797029c036a02cac30f1da1a9bc49a3 Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. 已加载tokenizer Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. --- Logging error --- Traceback (most recent call last): File "/Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization.py", line 19, in from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up File "/usr/local/lib/python3.10/site-packages/cpm_kernels/init.py", line 1, in from . import library File "/usr/local/lib/python3.10/site-packages/cpm_kernels/library/init.py", line 1, in from . import nvrtc File "/usr/local/lib/python3.10/site-packages/cpm_kernels/library/nvrtc.py", line 5, in nvrtc = Lib("nvrtc") File "/usr/local/lib/python3.10/site-packages/cpm_kernels/library/base.py", line 59, in init raise RuntimeError("Unknown platform: %s" % sys.platform) RuntimeError: Unknown platform: darwin

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/logging/init.py", line 1100, in emit msg = self.format(record) File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/logging/init.py", line 943, in format return fmt.format(record) File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/logging/init.py", line 678, in format record.message = record.getMessage() File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/logging/init.py", line 368, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/Users/binsun1/AI/ChatGLM-6B/web_demo.py", line 20, in model = AutoModel.from_pretrained(mmod,trust_remote_code=True).float() File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 466, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2498, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/modeling_chatglm.py", line 1061, in init self.quantize(self.config.quantization_bit, self.config.quantization_embeddings, use_quantization_cache=True, empty_init=True) File "/Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/modeling_chatglm.py", line 1424, in quantize from .quantization import quantize, QuantizedEmbedding, QuantizedLinear, load_cpu_kernel File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "/Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization.py", line 46, in logger.warning("Failed to load cpm_kernels:", exception) Message: 'Failed to load cpm_kernels:' Arguments: (RuntimeError('Unknown platform: darwin'),) No compiled kernel found. Compiling kernels : /Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.c Compiling clang -O3 -fPIC -pthread -Xclang -fopenmp -lomp -std=c99 /Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.c -shared -o /Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.so Load kernel : /Users/binsun1/.cache/huggingface/modules/transformers_modules/02a065cf2797029c036a02cac30f1da1a9bc49a3/quantization_kernels_parallel.so Setting CPU quantization kernel threads to 3 Parallel kernel is not recommended when parallel num < 4. OMP: Error #15: Initializing libomp.dylib, but found libiomp5.dylib already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ [1] 40433 abort /usr/local/Cellar/python@3.10/3.10.11/bin/python3.10 web_demo.py

tianxiawoyougood

@tianxiawoyougood 没有，我放弃了QAQ

Tsvetak

@ffi7 他那个是win 10的，Mac按照他的方法，总是报 Unknown platform: darwin。 so, 我也准备放弃了。

tianxiawoyougood

ChatGLM2-6B % python3 web_demo.py Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions. Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions. Failed to load cpm_kernels:Unknown platform: darwin OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ zsh: abort python3 web_demo.py

同样报错，但是我用chatglm.cpp 可以正常运行

victoryangn

https://github.com/THUDM/ChatGLM-6B/issues/1164 pip install cpm_kernels

tm3422

[THUDM/ChatGLM-6B][BUG/Help] <CPU无法使用int4模型>

回答