[PaddlePaddle/Paddle]关于multiprocessing启动多进程训练、导出模型报错(CUDA error(3), initialization error)

2024-02-22 535 views
2

环境:

paddle.utils.run_check() Running verify PaddlePaddle program ... W0213 15:33:14.557688 1583214 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 11.2 W0213 15:33:14.559726 1583214 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4. PaddlePaddle works well on 1 GPU. PaddlePaddle works well on 1 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

导出模型时报错日志如下:

Process Process-9: Traceback (most recent call last): File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, self._kwargs) File "/home/zksc/duanzhiqiang/code/easy_deploy/paddlex_restful/restful/project/operate.py", line 371, in _call_paddlex_export_infer p_export(paddleclas_yaml_path, pretrained_model, save_inference_dir) File "/home/zksc/duanzhiqiang/code/easy_deploy/PaddleClas/tools/export_model.py", line 46, in p_export engine = Engine(p_config, mode="export") File "/home/zksc/duanzhiqiang/code/easy_deploy/PaddleClas/ppcls/engine/engine.py", line 192, in init self.model = build_model(self.config, self.mode) File "/home/zksc/duanzhiqiang/code/easy_deploy/PaddleClas/ppcls/arch/init.py", line 40, in build_model arch = getattr(mod, model_type)(arch_config) File "/home/zksc/duanzhiqiang/code/easy_deploy/PaddleClas/ppcls/arch/backbone/legendary_models/mobilenet_v1.py", line 257, in MobileNetV1 kwargs) File "/home/zksc/duanzhiqiang/code/easy_deploy/PaddleClas/ppcls/arch/backbone/legendary_models/mobilenet_v1.py", line 125, in init padding=1) File "/home/zksc/duanzhiqiang/code/easy_deploy/PaddleClas/ppcls/arch/backbone/legendary_models/mobilenet_v1.py", line 64, in init bias_attr=False) File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 700, in init data_format=data_format, File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 160, in init default_initializer=_get_default_param_initializer(), File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 447, in create_parameter default_initializer) File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/site-packages/paddle/fluid/layer_helper_base.py", line 379, in create_parameter attr._to_kwargs(with_initializer=True)) File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3965, in create_parameter initializer(param, self) File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 56, in call return self.forward(param, block) File "/home/zksc/anaconda3/envs/PaddleX/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 803, in forward place) OSError: (External) CUDA error(3), initialization error. [Hint: Please search for the error code(3) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:243)

看日志是由于多进程调用了fluid.initializer初始化函数报错,且是偶然性的,执行clas、seg、det任务有时都会发生此报错,看了很多issue,都没有给出解决方法,请问这个问题应如何解决?

回答

2

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

3

训练时偶尔也会发生此错误,清除显卡缓存重新启动项目后就好了,但是治标不治本

0

可以给一下PaddlePaddle以及CUDA/cuDNN版本,以及运行的命令方便复现问题

1

可以给一下PaddlePaddle以及CUDA/cuDNN版本,以及运行的命令方便复现问题 paddlepaddle-gpu 2.4.1.post112 cuda 11.2 cudnn 8.4 运行命令就是通过paddlex restfulapi运行的

7

这个已经试过了,不行,不知开启子进程后调用paddleclas、paddlese、paddledet等模块是否有问题?而且导出模型不使用gpu为什么还会报gpu的错呢?

5

导出模型需要使用GPU

4

子进程之间如果相互隔离,调用paddleclas,paddleseg和paddledet应该是没问题的,如果引用了在全局可见的paddle应该会出现上述的问题

7

目前主进程是没有import paddle的,只引用了paddlex项目内的方法,子进程内部是调用clas、det等其他模块,所以不明白是哪里出了问题,设置multiprocessing.set_start_method("spawn")启动子进程也无效

主进程导入如下: import os.path as osp import os import numpy as np from PIL import Image import sys import cv2 import psutil import shutil import pickle import base64 import multiprocessing as mp from ..utils import (pkill, set_folder_status, get_folder_status, TaskStatus, PredictStatus, PruneStatus) from .evaluate.draw_pred_result import visualize_classified_result, visualize_detected_result, visualize_segmented_result from .visualize import plot_det_label, plot_insseg_label, get_color_map_list from paddlex_restful.restful.dataset.utils import get_encoding from pathlib import Path

4

可以再仔细排查下,比如utils这种本身可能就已经import paddle了

0

排查过了,paddlex的其余文件方法都没有导入paddle