[THUDM/ChatGLM-6B][BUG/Help] 有没有比较好的方式提升qps

使用int4量化版本模型加载，单卡显存只使用了1/3，但是使用的时候基本上qps只能到1，有没有比较好的方法，提升调用的qps？

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

hiber-niu

同问

lfcxlfcx

多进程方式，我是Django + gunicorn部署的，24G显存的话gunicorn的workers可以设置为3，这样就能实现3个并发请求

ray-008

多进程方式，我是Django + gunicorn部署的，24G显存的话gunicorn的workers可以设置为3，这样就能实现3个并发请求

是用的多卡吗？昨天尝试了下使用gunicorn起2个worker，服务经常自动重启，不知道是否是因为超过了服务能力。

hiber-niu

是用的多卡吗？昨天尝试了下使用gunicorn起2个worker，服务经常自动重启，不知道是否是因为超过了服务能力。

不是多卡啊，我就一张3090

ray-008

是用的多卡吗？昨天尝试了下使用gunicorn起2个worker，服务经常自动重启，不知道是否是因为超过了服务能力。

不是多卡啊，我就一张3090

使用的是Django + gunicorn+api.py（fastapi）部署，3个worker，没有碰到服务重启的问题吗？

hiber-niu

我刚刚确认了。就是Django + gunicorn，3个worker 我没用fastapi，也没自动重启过，3个并发妥妥的

ray-008

我刚刚确认了。就是Django + gunicorn，3个worker 我没用fastapi，也没自动重启过，3个并发妥妥的

请问您是怎么实现的呢？能否提供一下相关代码，多谢啦

Lukangkang123

gunicorn.conf.py 配置文件：
```
import os
```

chdir = os.path.dirname(os.path.abspath(file)) reload = False bind = "0.0.0.0:10000" daemon = False worker_class = 'gevent' threads = 1
backlog = 512 timeout = 60

工作进程数, 重点就是这个参数

workers = 3


2. ai_chat 就是你django项目名称，views.py 就正常写加载model的代码：

from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda() model = model.eval() response, history = model.chat(tokenizer, "你好", history=[])



3. 启动命令：
gunicorn -c gunicorn.conf.py ai_chat.wsgi:application

这是用单机单卡开多进程的方式提升QPS，官方也更新了[多卡部署](https://github.com/THUDM/ChatGLM-6B/tree/main#%E5%A4%9A%E5%8D%A1%E9%83%A8%E7%BD%B2)的方式，可以按这个教程改成单机多卡

ray-008

gunicorn.conf.py 配置文件：
import os

chdir = os.path.dirname(os.path.abspath(__file__))
reload = False
bind = "0.0.0.0:10000"
daemon = False
worker_class = 'gevent'
threads = 1  
backlog = 512
timeout = 60
# 工作进程数, 重点就是这个参数
workers = 3
ai_chat 就是你django项目名称，views.py 就正常写加载model的代码：
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
启动命令： gunicorn -c gunicorn.conf.py ai_chat.wsgi:application

这是用单机单卡开多进程的方式提升QPS，官方也更新了多卡部署的方式，可以按这个教程改成单机多卡

非常感谢，已经成功跑通了！不过我还想问一下，能不能通过把threads设大来搞成多线程的？

Lukangkang123

我以前测试过多线程，发现没效果，并发还是1。猜测大模型推理是独占的，线程间不能共享，只能用进程多开的方式。

ray-008

明白了，非常感谢！

Lukangkang123

我以前测试过多线程，发现没效果，并发还是1。猜测大模型推理是独占的，线程间不能共享，只能用进程多开的方式。

我刚刚想到，验证的的时候，是可以有batch size的，能不能仿照验证的这种方式来推理呢？https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/evaluate.sh

Lukangkang123

你们是怎么让int4模型跑在GPU上的，我这边总是跑在CPU上了，还报这么个错 No compiled kernel found.

rxy1212

能启动3个worker，但是假设每个worker回复的时长是5s，那么我并发3个请求，理论上是5s返回结果，实际上是15s才返回结果，不知道你们是不是这个情况

TTyb

15s返回很正常因为GPU利用率已经100%了你起多少个 worker 都没有用

rxy1212

实际上并没有用完，1个worker使用6G左右，3个worker才18G左右，而我是有40G的显存

TTyb

显存是显存利用率是利用率

rxy1212

了解，感谢大佬！顺便问一下，这个可以怎样设置在pytorch上面，毕竟一个gpu显存很多，如果利用率100%会阻塞其他进程，有什么方法设置利用率在某个范围，或者只是需要多少就用多少，而不是直接100%

TTyb

[THUDM/ChatGLM-6B][BUG/Help] 有没有比较好的方式提升qps

回答