[PaddlePaddle/Paddle]paddle动态图转静态图后进行推理转换后scatter.cu.h报错

请提出你的问题 Please ask your question

报错内容：Error: /paddle/paddle/phi/kernels/funcs/scatter.cu.h:101 Assertion index_value >= 0 && index_value < output_dims[j] failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater or equal to 0, but received [0]
Error: /paddle/paddle/phi/kernels/funcs/scatter.cu.h:101 Assertion index_value >= 0 && index_value < output_dims[j] failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [21] and greater or equal to 0, but received [0] 报错代码部分：

paddle.scatter_nd(
        index=sparse_tensor_indices,
        updates=point_indices + 1,
        shape=output_shape) - 1

报错提醒 It should be less than [1] and greater or equal to 0, but received [0] ` 接收为0 为什么会不符条件？将报错代码部分修改为自己构造一个形状和paddle.scatter_nd结果相同的Tensor 就不会报错

环境：paddlepaddle-gpu=2.4.1.post112 cuda=11.1 python3.8 cudnn=8.0.5

wayyeah

麻烦提供一下paddle版本，或试一下最新的paddle dev版本

LokeZhou

paddlepaddle-gpu=2.4.1.post112

wayyeah

@LokeZhou ，使用paddle dev版本报新的错误，输出如下：

2023-05-09 16:49:37,945 -  WARNING - No custom op iou3d_nms_cuda found, try JIT build
Compiling user custom op, it will cost a few seconds.....
2023-05-09 16:49:39,816 - INFO - using custom operator only
2023-05-09 16:49:39,821 -     INFO - iou3d_nms_cuda builded success!
2023-05-09 16:49:40,324 -  WARNING - No custom op voxelize found, try JIT build
Compiling user custom op, it will cost a few seconds.....
W0509 16:49:42.137914 11248 custom_operator.cc:1210] Operator (nms_normal_gpu) has been registered.
W0509 16:49:42.137989 11248 custom_operator.cc:1210] Operator (nms_gpu) has been registered.
W0509 16:49:42.138005 11248 custom_operator.cc:1210] Operator (boxes_overlap_bev_gpu) has been registered.
W0509 16:49:42.138546 11248 custom_operator.cc:1210] Operator (boxes_iou_bev_gpu) has been registered.
W0509 16:49:42.138568 11248 custom_operator.cc:1210] Operator (boxes_iou_bev_cpu) has been registered.
2023-05-09 16:49:42,162 - INFO - using custom operator only
2023-05-09 16:49:42,165 -     INFO - voxelize builded success!
2023-05-09 16:49:42,166 -  WARNING - No custom op pointnet2_ops found, try JIT build
Compiling user custom op, it will cost a few seconds.....
W0509 16:49:44.024470 11248 custom_operator.cc:1210] Operator (nms_normal_gpu) has been registered.
W0509 16:49:44.024618 11248 custom_operator.cc:1210] Operator (nms_gpu) has been registered.
W0509 16:49:44.024639 11248 custom_operator.cc:1210] Operator (boxes_overlap_bev_gpu) has been registered.
W0509 16:49:44.024652 11248 custom_operator.cc:1210] Operator (hard_voxelize) has been registered.
W0509 16:49:44.024664 11248 custom_operator.cc:1210] Operator (boxes_iou_bev_gpu) has been registered.
W0509 16:49:44.024675 11248 custom_operator.cc:1210] Operator (boxes_iou_bev_cpu) has been registered.
2023-05-09 16:49:44,047 - INFO - using custom operator only
2023-05-09 16:49:44,052 -     INFO - pointnet2_ops builded success!
I0509 16:49:44.054957 11248 helper.cc:56] The operator `farthest_point_sample` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.054980 11248 helper.cc:56] The operator `grouping_operation_stack` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.054986 11248 helper.cc:56] The operator `voxel_query_wrapper` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.054992 11248 helper.cc:56] The operator `grouping_operation_batch` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.054998 11248 helper.cc:56] The operator `ball_query_batch` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055004 11248 helper.cc:56] The operator `gather_operation` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055009 11248 helper.cc:56] The operator `nms_normal_gpu` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055016 11248 helper.cc:56] The operator `ball_query_stack` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055022 11248 helper.cc:56] The operator `nms_gpu` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055027 11248 helper.cc:56] The operator `boxes_overlap_bev_gpu` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055033 11248 helper.cc:56] The operator `hard_voxelize` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055039 11248 helper.cc:56] The operator `boxes_iou_bev_gpu` has been registered. Therefore, we will not repeat the registration here.
I0509 16:49:44.055045 11248 helper.cc:56] The operator `boxes_iou_bev_cpu` has been registered. Therefore, we will not repeat the registration here.
--- Running analysis [ir_graph_build_pass]
I0509 16:49:48.664840 11248 executor.cc:186] Old Executor is Running.
--- Running analysis [ir_analysis_pass]
--- Running IR pass [map_op_to_another_pass]
--- Running IR pass [identity_scale_op_clean_pass]
I0509 16:49:48.869513 11248 fuse_pass_base.cc:59] ---  detected 11 subgraphs
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [delete_quant_dequant_linear_op_pass]
--- Running IR pass [delete_weight_dequant_linear_op_pass]
--- Running IR pass [constant_folding_pass]
--- Running IR pass [silu_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
I0509 16:49:50.090797 11248 fuse_pass_base.cc:59] ---  detected 14 subgraphs
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [embedding_eltwise_layernorm_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass_v2]
--- Running IR pass [vit_attention_fuse_pass]
--- Running IR pass [fused_multi_transformer_encoder_pass]
--- Running IR pass [fused_multi_transformer_decoder_pass]
--- Running IR pass [fused_multi_transformer_encoder_fuse_qkv_pass]
--- Running IR pass [fused_multi_transformer_decoder_fuse_qkv_pass]
--- Running IR pass [multi_devices_fused_multi_transformer_encoder_pass]
--- Running IR pass [multi_devices_fused_multi_transformer_encoder_fuse_qkv_pass]
--- Running IR pass [multi_devices_fused_multi_transformer_decoder_fuse_qkv_pass]
--- Running IR pass [fuse_multi_transformer_layer_pass]
--- Running IR pass [gpu_cpu_squeeze2_matmul_fuse_pass]
--- Running IR pass [gpu_cpu_reshape2_matmul_fuse_pass]
--- Running IR pass [gpu_cpu_flatten2_matmul_fuse_pass]
--- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass]
I0509 16:49:53.852699 11248 fuse_pass_base.cc:59] ---  detected 10 subgraphs
--- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass]
I0509 16:49:53.861145 11248 fuse_pass_base.cc:59] ---  detected 5 subgraphs
--- Running IR pass [matmul_scale_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass_v3]
--- Running IR pass [gpu_cpu_map_matmul_to_mul_pass]
I0509 16:49:53.998447 11248 fuse_pass_base.cc:59] ---  detected 1 subgraphs
--- Running IR pass [fc_fuse_pass]
I0509 16:49:54.034281 11248 fuse_pass_base.cc:59] ---  detected 2 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
W0509 16:49:54.193147 11248 op_compat_sensible_pass.cc:232]  Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed!
W0509 16:49:54.193181 11248 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed.
W0509 16:49:54.193186 11248 op_compat_sensible_pass.cc:232]  Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed!
W0509 16:49:54.193192 11248 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed.
W0509 16:49:54.193197 11248 op_compat_sensible_pass.cc:232]  Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed!
W0509 16:49:54.193202 11248 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed.
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [conv2d_fusion_layout_transfer_pass]
--- Running IR pass [auto_mixed_precision_pass]
--- Running IR pass [delete_cast_op_pass]
free(): double free detected in tcache 2

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle_infer::Predictor::Predictor(paddle::AnalysisConfig const&)
1   std::unique_ptr<paddle::PaddlePredictor, std::default_delete<paddle::PaddlePredictor> > paddle::CreatePaddlePredictor<paddle::AnalysisConfig, (paddle::PaddleEngineKind)2>(paddle::AnalysisConfig const&)
2   paddle::AnalysisPredictor::Init(std::shared_ptr<paddle::framework::Scope> const&, std::shared_ptr<paddle::framework::ProgramDesc> const&)
3   paddle::AnalysisPredictor::PrepareProgram(std::shared_ptr<paddle::framework::ProgramDesc> const&)
4   paddle::AnalysisPredictor::OptimizeInferenceProgram()
5   paddle::inference::analysis::Analyzer::RunAnalysis(paddle::inference::analysis::Argument*)
6   paddle::inference::analysis::IrAnalysisPass::RunImpl(paddle::inference::analysis::Argument*)
7   paddle::inference::analysis::IRPassManager::Apply(std::unique_ptr<paddle::framework::ir::Graph, std::default_delete<paddle::framework::ir::Graph> >)
8   paddle::framework::ir::Pass::Apply(paddle::framework::ir::Graph*) const
9   paddle::framework::ir::DeleteCastOpPass::ApplyImpl(paddle::framework::ir::Graph*) const
10  paddle::framework::ir::DeleteCastOpPass::ApplyCastPass(paddle::framework::ir::Graph*) const
11  paddle::framework::ir::GraphPatternDetector::operator()(paddle::framework::ir::Graph*, std::function<void (std::map<paddle::framework::ir::PDNode*, paddle::framework::ir::Node*, paddle::framework::ir::GraphPatternDetector::PDNodeCompare, std::allocator<std::pair<paddle::framework::ir::PDNode* const, paddle::framework::ir::Node*> > > const&, paddle::framework::ir::Graph*)>)
12  void std::vector<paddle::framework::ir::Node*, std::allocator<paddle::framework::ir::Node*> >::_M_realloc_insert<paddle::framework::ir::Node* const&>(__gnu_cxx::__normal_iterator<paddle::framework::ir::Node**, std::vector<paddle::framework::ir::Node*, std::allocator<paddle::framework::ir::Node*> > >, paddle::framework::ir::Node* const&)

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1683622194 (unix time) try "date -d @1683622194" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x3e900002bf0) received by PID 11248 (TID 0x7f16fa20b080) from PID 11248 ***]

Aborted (core dumped)

wayyeah

@wayyeah 您好，可否看下：

使用CPU 推理是否有问题，以排除是否是相关Op CUDA Kernel的问题
若CPU也存在问题，尝试使用 paddle.static.load_inference_model API 加载预测看是否有问题，以排除是否是推理引擎Predictor的问题

我这边check了下报错栈对应的源码，当 index == 0 也应该是合法的才对，这里确实有一些奇怪的行为，辛苦先确认下上面两个实验的结果。

Aurelius84

@Aurelius84 您好，感谢你的回复，因为模型是基于Paddle3DvoxelRCNN模型进行修改的，带有许多GPU操作，没有办法使用GPU推理，使用paddle.static.load_inference_model API 加载预测，报错如下：

Traceback (most recent call last):
  File "/home/aistudio/work/Paddle3D/deploy/ted/python/load.py", line 139, in <module>
    main(args)
  File "/home/aistudio/work/Paddle3D/deploy/ted/python/load.py", line 132, in main
    results = exe.run(inference_program,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.9/site-packages/paddle/fluid/executor.py", line 1463, in run
    six.reraise(*sys.exc_info())
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.9/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.9/site-packages/paddle/fluid/executor.py", line 1450, in run
    res = self._run_impl(program=program,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.9/site-packages/paddle/fluid/executor.py", line 1661, in _run_impl
    return new_exe.run(scope, list(feed.keys()), fetch_list,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.9/site-packages/paddle/fluid/executor.py", line 631, in run
    tensors = self._new_exe.run(scope, feed_names,
ValueError: (InvalidArgument) Axis should be less than 1, but received axis is 1.
  [Hint: Expected axis < max_dim, but received axis:1 >= max_dim:1.] (at /paddle/paddle/phi/kernels/funcs/common_shape.h:53)
  [operator < elementwise_add > error]

wayyeah

@wayyeah 方便提供下你导出的模型文件，以及使用 paddle.static.load_inference_model 加载预测执行的python脚本，打包成一个tar包发下么？

Aurelius84

@Aurelius84 链接：https://pan.baidu.com/s/1sBMLWYqBlnC8oOE5myjzQw?pwd=1234 提取码：1234 执行压缩包下load.sh 运行

wayyeah

@wayyeah 我这边能正常执行你的load.sh脚本，但执行输出的结果如下，符合预期么？

Aurelius84

@Aurelius84 符合预期，那我这边报错原因可能是什么呢？尝试过paddlepaddle2.4.1 2.4.2 AIstudio 线上和本地环境，全部报错

wayyeah

我是用develop分支的whl包加载的你的模型，我不是很确定是不是已知的问题被修复了。你可以使用pip install paddlepaddle-gpu==2.5.0rc 安装预发布的版本试下？或者参考官网文档安装下nightly build 的whl 试下呢？

Aurelius84

感谢你的帮助，我稍后尝试看看

wayyeah

@Aurelius84 ,在paddlepaddle-gpu==2.5.0rc版本使用[paddle.static.load_inference_model] API 加载预测确实不会报错了，但是我更换多个数据，输出还是全为空,感觉存在问题，并且使用推理引擎Predictor同样报错Error: ../paddle/phi/kernels/funcs/scatter.cu.h:107 Assertion index_value >= 0 && index_value < output_dims[j] failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [21] and greater or equal to 0, but received [0],同时开启TensorRT加速同样报相同的错误

wayyeah

@wayyeah 有如下排查思路：

在通过paddle.seed(2023)固定随机性后，准备相同的输入，确保动态图模型eval下预测有结果输出。将此相同的输入借助paddle.load_inference_model API 预测查看是否输出相同
若输出不同，表明模型导出可能有问题。可以在动态图下借助「二分思想」来定位是哪部分代码转静导出的问题，方式如下：

class Net(Layer):
     def __init__(self, xxx):
         # ....(省略)

      @to_static                <------- 注释掉，获取动态图结果，保留就是动转静结果
      def forward(self, ....):
           out1 = self.func1(xxx)
           # return out1         <------  第 2 次二分，提前返回
           out2 = self.func2(xxx)
            # return out2        <------  第 1 次二分，提前返回
           out3 = self.func3(xxx)
           # return out13        <------  第 3 次二分，提前返回
           out4 = self.func(xxx)

           return out4

x = prepare_data()

out = net(x)
print(out)

可以借助插入return xxx 来触发动转静的提前返回，这样可以二分定位到是哪段code前后出现了第一次结果diff，再进行分析就比较容易了。

Aurelius84

感谢你的帮助，我自己再排查排查

wayyeah

@Aurelius84 解决问题了，代码中部分存在numpy操作导致静态图推理出错

wayyeah

[PaddlePaddle/Paddle]paddle动态图转静态图后进行推理 转换后scatter.cu.h报错

回答

[PaddlePaddle/Paddle]paddle动态图转静态图后进行推理转换后scatter.cu.h报错