每次运行一遍 model = AutoModel.from_pretrained('./int4', trust_remote_code=True).half().cuda() 都会出现如下报错
No compiled kernel found. Compiling kernels : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.c -shared -o /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Kernels compiled : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Load kernel : /home/devops/.cache/huggingface/modules/transformers_modules/int4/quantization_kernels_parallel.so Setting CPU quantization kernel threads to 16 Using quantization cache Applying quantization to glm layers
但是还是能正常往下运行
看上去是每次都要 compilling 一次 kernels,能不能定下来呢?
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained(d, trust_remote_code=True) model = AutoModel.from_pretrained(d, trust_remote_code=True).half().cuda()
Environment- OS:Ubuntu 22.04
- Python: 3.9.16
- Transformers: 4.27.1
- PyTorch: 2.0.0+cu117
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
如何提升反应速度,是不是只要加大现存就可以?