@mcarilli 感谢您的指点!
“自己实现反向传播”的想法只是为了进行概念证明。如果需要的话,真的不想做任何比简单的密集网络或单个卷积模型更多的事情。绝对不是A选项哈哈。
这是导出 TORCH_SHOW_CPP_STACKTRACES=1 的堆栈跟踪
local_loss.backward()
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream
Exception raised from block at ../c10/cuda/impl/CUDAGuardImpl.h:150 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f25d368cddc in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f25d365a1f4 in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0xb7ce (0x7f25d36bd7ce in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x362c437 (0x7f261d5bc437 in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x362ec6f (0x7f261d5bec6f in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x23a (0x7f261d5c47ea in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x9f (0x7f261d5bccdf in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x67 (0x7f262376baf7 in /home/ubuntu/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xd6d84 (0x7f262fe17d84 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x9609 (0x7f26500cc609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #10: clone + 0x43 (0x7f264fff3293 in /lib/x86_64-linux-gnu/libc.so.6)
我希望这有帮助。我应该在下周的某个时间向您提供有关注释掉该行并尝试手动同步流的建议。我会让你知道进展如何。目前,我在捕获之前没有进行任何手动同步,而是在执行“torch.cuda.synchronize()”来同步整个设备之后。听起来这可能有点矫枉过正,但确实完成了工作。谢谢!
编辑:刚刚注意到你对热身跑的评论。我正在做一些,但至少我需要做一个,因为如果我不这样做,就会出错CUDNN_STATUS_ALLOC_FAILED
。