[PaddlePaddle/Paddle]运行resnet50 amp提示fused_batch_norm_act算子找不到

2024-03-22 112 views
8

在尝试运行DLE samples 里的使能AMP后的训练https://github.com/NVIDIA/DeepLearningExamples/tree/master/PaddlePaddle/Classification/RN50v1.5#inference-process 出现以下错误,请教下是什么原因?

UnavailableError: There are no kernels which are registered in the fused_batch_norm_act operator. [Hint: Expected kernels_iter != all_op_kernels.end(), but received kernels_iter == all_op_kernels.end().] (at /babyblue/paddle_workspace/Paddle/paddle/fluid/framework/operator.cc:1667) [operator < fused_batch_norm_act > error]

回答

3

请问你的Paddle是那个版本的呢?

5

v2.3.2

9

是从NGC网站下载的,2.3.2用AMP会有问题吗?

8

要能正常執行才對 RN50 最近的更新沒動到 fused_batch_norm_act 內部的每月紀錄是通過的 你的GPU是什麼型號?

2

你的 NGC Paddle 版本是多少, 22.?-py3 用 $ docker images 應能看到

1

nvcr.io/nvidia/paddlepaddle 22.12-py3

9

GPU is A100

0

好的 我這週試試

3

Hi @bluebabyxp , 我們這邊測試 AMP 訓練沒有發現問題 測試腳本如下

  1. git clone the repository and change the directory

    git clone https://github.com/NVIDIA/DeepLearningExamples.git
    cd DeepLearningExamples/PaddlePaddle/Classification/RN50v1.5
  2. modify the Dockerfile 將 Dockerfile 第一行改為 ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:22.12-py3

  3. build docker image

    docker build . -t nvidia_resnet50
  4. launch docker container

    docker run --gpus all --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_resnet50
    # Make sure `ls /imagenet` shows `train` and `val` folder
  5. run AMP training

    bash scripts/training/train_resnet50_AMP_90E_DGXA100.sh # 預設為 8 卡,可修改 script 至其他張數