[PaddlePaddle/Paddle]运行resnet50 amp提示fused_batch_norm_act算子找不到

在尝试运行DLE samples 里的使能AMP后的训练https://github.com/NVIDIA/DeepLearningExamples/tree/master/PaddlePaddle/Classification/RN50v1.5#inference-process 出现以下错误，请教下是什么原因？

UnavailableError: There are no kernels which are registered in the fused_batch_norm_act operator. [Hint: Expected kernels_iter != all_op_kernels.end(), but received kernels_iter == all_op_kernels.end().] (at /babyblue/paddle_workspace/Paddle/paddle/fluid/framework/operator.cc:1667) [operator < fused_batch_norm_act > error]

bluebabyxp

请问你的Paddle是那个版本的呢？

yeliang2258

v2.3.2

bluebabyxp

@bluebabyxp , 你的Paddle 是來自於 NVIDIA NGC 嗎? https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle

例如

$ docker pull nvcr.io/nvidia/paddlepaddle:22.12-py3

p.s. 最新的 NVIDIA NGC Paddle 是 v2.4

jeng1220

是从NGC网站下载的，2.3.2用AMP会有问题吗？

bluebabyxp

要能正常執行才對 RN50 最近的更新沒動到 fused_batch_norm_act 內部的每月紀錄是通過的你的GPU是什麼型號?

jeng1220

你的 NGC Paddle 版本是多少, 22.?-py3 用 $ docker images 應能看到

jeng1220

nvcr.io/nvidia/paddlepaddle 22.12-py3

bluebabyxp

GPU is A100

bluebabyxp

好的我這週試試

jeng1220

Hi @bluebabyxp , 我們這邊測試 AMP 訓練沒有發現問題測試腳本如下

git clone the repository and change the directory

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/PaddlePaddle/Classification/RN50v1.5

modify the Dockerfile 將 Dockerfile 第一行改為 ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:22.12-py3
build docker image
```
docker build . -t nvidia_resnet50
```

launch docker container

docker run --gpus all --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_resnet50
# Make sure `ls /imagenet` shows `train` and `val` folder

run AMP training

bash scripts/training/train_resnet50_AMP_90E_DGXA100.sh # 預設為 8 卡，可修改 script 至其他張數

leo0519

[PaddlePaddle/Paddle]运行resnet50 amp提示fused_batch_norm_act算子找不到

回答