在进行finetune的时候 训练公开数据集 在本地CPU机器上能运行 在Linux服务器上使用gpu无法运行 能跑几个batch 然后突然停止 显存 CPU占用都无异常 并且无代码报错
python tools/train.py
[2023/02/24 01:25:00] ppocr INFO: epoch: [1/120], global_step: 2, lr: 0.001000, loss: -0.579372, loss_shrink_maps: 0.312662, loss_threshold_maps: 20.220646, loss_binary_maps: 0.000001, avg_reader_cost: 0.00045 s, avg_batch_cost: 0.68065 s, avg_samples: 1.0, ips: 1.46919 samples/s, eta: 4 days, 0:36:35 [2023/02/24 01:25:00] ppocr INFO: epoch: [1/120], global_step: 3, lr: 0.001000, loss: 0.000000, loss_shrink_maps: 0.000000, loss_threshold_maps: 0.000000, loss_binary_maps: 0.000000, avg_reader_cost: 0.00060 s, avg_batch_cost: 0.61854 s, avg_samples: 1.0, ips: 1.61670 samples/s, eta: 2 days, 23:16:43 [2023/02/24 01:25:01] ppocr INFO: epoch: [1/120], global_step: 4, lr: 0.001000, loss: 0.000000, loss_shrink_maps: 0.000000, loss_threshold_maps: 0.000000, loss_binary_maps: 0.000000, avg_reader_cost: 0.00061 s, avg_batch_cost: 0.64394 s, avg_samples: 1.0, ips: 1.55293 samples/s, eta: 2 days, 10:49:28
C++ Traceback (most recent call last): 0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)
Error Message Summary: FatalError: Process abort signal is detected by the operating system. [TimeInfo: Aborted at 1677201901 (unix time) try "date -d @1677201901" if you are using GNU date ] [SignalInfo: SIGABRT (@0x1035) received by PID 4149 (TID 0x7f1393bd7080) from PID 4149 ]```
其他补充信息 Additional Supplementary Information显卡是 M40