[PaddlePaddle/PaddleOCR]训练SLANet表格识别模型时数据集标签问题

系统环境/System Environment：Ubuntu20.04
版本号/Version：Paddle： PaddleOCR：问题相关组件/Related components：ppstructure
运行指令/Command Code：python3 tools/train.py -c configs/table/SLANet.yml(如readme教程所给)

完整报错/Complete Error Message：

[2022/12/05 17:32:53] ppocr ERROR: When parsing line {"imgid": 975, "html": {"cells": [{"tokens": ["<b>", "m", "i", "R", "N", "A", "</b>"], "bbox": [1, 4, 26, 13]}, {"tokens": ["<b>", "H", "C", "*", "</b>"], "bbox": [57, 4, 72, 13]}, {"tokens": ["<b>", "R", "C", "C", "*", "</b>"], "bbox": [122, 4, 142, 13]}, {"tokens": ["<b>", "p", "-", "v", "a", "l", "u", "e", "</b>"], "bbox": [188, 4, 216, 13]}, {"tokens": ["<b>", "C", "a", "n", "c", "e", "r", " ", "a", "s", "s", "o", "c", "i", "a", "t", "i", "o", "n", "</b>"], "bbox": [244, 4, 311, 13]}, {"tokens": ["<b>", "E", "x", "p", "e", "r", "i", "m", "e", "n", "t", "a", "l", "y", " ", "v", "a", "l", "i", "d", "a", "t", "e", "d", " ", "t", "a", "r", "g", "e", "t", "</b>"], "bbox": [351, 4, 460, 13]}, {"tokens": ["m", "i", "R", "-", "3", "7", "8"], "bbox": [1, 17, 28, 27]}, {"tokens": ["0", ".", "0", "0", "4", "0", ".", "0", "0", "2", "-", "0", ".", "0", "0", "6"], "bbox": [57, 17, 94, 36]}, {"tokens": ["0", ".", "0", "0", "8", "0", ".", "0", "0", "4", "-", "0", ".", "0", "3", "7"], "bbox": [122, 17, 160, 36]}, {"tokens": ["0", ".", "0", "0", "0", "3"], "bbox": [188, 17, 211, 27]}, {"tokens": ["c", "o", "l", "o", "r", "e", "c", "t", "a", "l", "c", "a", "r", "c", "i", "n", "o", "m", "a", " ", "[", "1", "6", ",", "1", "7", "]", ",", " ", "o", "r", "a", "l", "s", "q", "u", "a", "m", "o", "u", "s", " ", "c", "e", "l", "l", "c", "a", "r", "c", "i", "n", "o", "m", "a", " ", "[", "1", "8", "]", ",", "l", "a", "r", "y", "n", "g", "e", "a", "l", "c", "a", "r", "c", "i", "n", "o", "m", "a", " ", "[", "1", "9", "]"], "bbox": [244, 17, 318, 71]}, {"tokens": ["S", "U", "F", "U", ",", " ", "T", "U", "S", "C", "2", ",", "T", "O", "B", "2", ",", " ", "C", "Y", "P", "2", "E", "1"], "bbox": [351, 17, 397, 36]}, {"tokens": ["m", "i", "R", "-", "4", "5", "1"], "bbox": [1, 74, 28, 83]}, {"tokens": ["2", ".", "0", "6", "7", "1", ".", "2", "5", "0", "-", "3", ".", "4", "8", "0"], "bbox": [57, 74, 94, 92]}, {"tokens": ["0", ".", "8", "0", "2", "0", ".", "0", "5", "5", "-", "1", ".", "0", "9", "1"], "bbox": [122, 74, 160, 92]}, {"tokens": ["0", ".", "0", "0", "0", "1"], "bbox": [188, 74, 211, 83]}, {"tokens": ["r", "e", "n", "a", "l", " ", "c", "e", "l", "l", " ", "c", "a", "r", "c", "i", "n", "o", "m", "a", " ", "[", "1", "]", ",", "c", "o", "l", "o", "r", "e", "c", "t", "a", "l", "c", "a", "r", "c", "i", "n", "o", "m", "a", " ", "[", "2", "0", "]", ",", "g", "a", "s", "t", "r", "i", "c", " ", "c", "a", "n", "c", "e", "r", " ", "[", "2", "0", "]"], "bbox": [244, 74, 322, 110]}, {"tokens": ["M", "M", "P", "2", ",", " ", "M", "M", "P", "9", ",", "B", "C", "L", "2"], "bbox": [351, 74, 398, 92]}, {"tokens": ["m", "i", "R", "-", "1", "5", "0"], "bbox": [1, 113, 28, 122]}, {"tokens": ["0", ".", "0", "1", "1", "0", ".", "0", "0", "9", "-", "0", ".", "0", "1", "6"], "bbox": [57, 113, 94, 131]}, {"tokens": ["0", ".", "0", "0", "8", "0", ".", "0", "0", "5", "-", "0", ".", "0", "2", "0"], "bbox": [122, 113, 160, 131]}, {"tokens": ["0", ".", "2", "2", "2", "2"], "bbox": [188, 113, 211, 122]}, {"tokens": ["g", "a", "s", "t", "r", "i", "c", " ", "c", "a", "n", "c", "e", "r", " ", "[", "2", "1", "]", ",", "c", "h", "r", "o", "n", "i", "c", " ", "m", "y", "e", "l", "o", "i", "d", "l", "e", "u", "k", "e", "m", "i", "a", " ", "[", "2", "2", "]", ",", "c", "o", "l", "o", "r", "e", "c", "t", "a", "l", "c", "a", "r", "c", "i", "n", "o", "m", "a", " ", "[", "2", "3", "]"], "bbox": [244, 113, 306, 158]}, {"tokens": ["H", "T", "T", ",", " ", "M", "Y", "B", ",", "E", "G", "F", "R", "2"], "bbox": [351, 113, 384, 131]}], "structure": {"tokens": ["<thead>", "<tr>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "</tr>", "</thead>", "<tbody>", "<tr>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "</tr>", "<tr>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "</tr>", "<tr>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "</tr>", "</tbody>"]}}, "split": "train", "filename": "PMC3340316_005_00.png"}, error happened with msg: Traceback (most recent call last):
File "/root/autodl-tmp/lxc/projects/PaddleOCR/ppocr/data/pubtab_dataset.py", line 107, in __getitem__
raise Exception("{} does not exist!".format(img_path))
Exception: /home/XXX/dataset/pubtabnet/val/PMC3340316_005_00.png does not exist!

上面给出部分示例，目前官网上能下载到的数据集压缩包里面只有一个标签文件截图 2022-12-05 17-15-26

一个标签文件里面有train、val、test的标签因此在configs/table/SLANet.yml里面，train和val的label_lists我均只能给

    label_file_list: [/home/XXX/dataset/pubtabnet/PubTabNet_2.0.0.jsonl]

然后就会出现大量标签找不到图片的error，在标签文件里面"split": "train"，而在验证时它仍会从val/里面寻找

enemy1205

需要准备图片数据和标签文件才能训练哈

andyjiang1116

需要准备图片数据和标签文件才能训练哈

PubTabNet整个数据集我都下了下来，里面共有三个文件夹，一个标签文件，如上面官网图片所示

enemy1205

按照文档准备数据 https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_ch/table_recognition.md#11-%E6%95%B0%E6%8D%AE%E9%9B%86%E6%A0%BC%E5%BC%8F

andyjiang1116

按照文档准备数据 https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_ch/table_recognition.md#11-%E6%95%B0%E6%8D%AE%E9%9B%86%E6%A0%BC%E5%BC%8F

我确实是对着教程上一步一步来的，教程提供的格式应该也就是PubTabNet数据集的标注格式

enemy1205

/home/XXX/dataset/pubtabnet/val/PMC3340316_005_00.png does not exist!

报错提示图片不存在，检查下配置是不是数据路径写错了

andyjiang1116

/home/XXX/dataset/pubtabnet/val/PMC3340316_005_00.png does not exist!

报错提示图片不存在，检查下配置是不是数据路径写错了

然而事实是这张图片在train文件夹里面，而且在标签jsonl文件里面它的'split'也是train,然后却在val文件夹下寻找，我也不明白为什么

enemy1205

配置文件中的数据部分发出来看看

andyjiang1116

配置文件中的数据部分发出来看看

Train:
  dataset:
    name: PubTabDataSet
    data_dir: /home/xxx/dataset/pubtabnet/train/
    label_file_list: [/home/xxx/dataset/pubtabnet/PubTabNet_2.0.0.jsonl]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - TableLabelEncode:
          learn_empty_box: False
          merge_no_span_structure: *merge_no_span_structure
          replace_empty_cell_token: False
          loc_reg_num: *loc_reg_num
          max_text_length: *max_text_length
      - TableBoxEncode:
          in_box_format: *box_format
          out_box_format: *box_format
      - ResizeTableImage:
          max_len: 488
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - PaddingTableImage:
          size: [488, 488]
      - ToCHWImage:
      - KeepKeys:
          keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ]
  loader:
    shuffle: True
    batch_size_per_card: 48
    drop_last: True
    num_workers: 12

Eval:
  dataset:
    name: PubTabDataSet
    data_dir: /home/xxx/dataset/pubtabnet/val/
    label_file_list: [/home/xxx/dataset/pubtabnet/PubTabNet_2.0.0.jsonl]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - TableLabelEncode:
          learn_empty_box: False
          merge_no_span_structure: *merge_no_span_structure
          replace_empty_cell_token: False
          loc_reg_num: *loc_reg_num
          max_text_length: *max_text_length
      - TableBoxEncode:
          in_box_format: *box_format
          out_box_format: *box_format
      - ResizeTableImage:
          max_len: 488
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - PaddingTableImage:
          size: [488, 488]
      - ToCHWImage:
      - KeepKeys:
          keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ]
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 48
    num_workers: 12

原本clone下来的代码模板上两个标签文件是分开的，有后缀_train和_val，但官方似乎把数据集标签整合到一个jsonl文件里面了

enemy1205

需要将train val标签分开

配置文件中的数据部分发出来看看

Train:

dataset:

name: PubTabDataSet

data_dir: /home/lxc/dataset/pubtabnet/train/

label_file_list: [/home/lxc/dataset/pubtabnet/PubTabNet_2.0.0.jsonl]

transforms:

  - DecodeImage: # load image

      img_mode: BGR

      channel_first: False

  - TableLabelEncode:

      learn_empty_box: False

      merge_no_span_structure: *merge_no_span_structure

      replace_empty_cell_token: False

      loc_reg_num: *loc_reg_num

      max_text_length: *max_text_length

  - TableBoxEncode:

      in_box_format: *box_format

      out_box_format: *box_format

  - ResizeTableImage:

      max_len: 488

  - NormalizeImage:

      scale: 1./255.

      mean: [0.485, 0.456, 0.406]

      std: [0.229, 0.224, 0.225]

      order: 'hwc'

  - PaddingTableImage:

      size: [488, 488]

  - ToCHWImage:

  - KeepKeys:

      keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ]

loader:

shuffle: True

batch_size_per_card: 48

drop_last: True

num_workers: 12

Eval:

dataset:

name: PubTabDataSet

data_dir: /home/lxc/dataset/pubtabnet/test/

label_file_list: [/home/lxc/dataset/pubtabnet/PubTabNet_2.0.0.jsonl]

transforms:

  - DecodeImage: # load image

      img_mode: BGR

      channel_first: False

  - TableLabelEncode:

      learn_empty_box: False

      merge_no_span_structure: *merge_no_span_structure

      replace_empty_cell_token: False

      loc_reg_num: *loc_reg_num

      max_text_length: *max_text_length

  - TableBoxEncode:

      in_box_format: *box_format

      out_box_format: *box_format

  - ResizeTableImage:

      max_len: 488

  - NormalizeImage:

      scale: 1./255.

      mean: [0.485, 0.456, 0.406]

      std: [0.229, 0.224, 0.225]

      order: 'hwc'

  - PaddingTableImage:

      size: [488, 488]

  - ToCHWImage:

  - KeepKeys:

      keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ]

loader:

shuffle: False

drop_last: False

batch_size_per_card: 48

num_workers: 12Train:

dataset:

name: PubTabDataSet

data_dir: /home/xxx/dataset/pubtabnet/train/

label_file_list: [/home/xxx/dataset/pubtabnet/PubTabNet_2.0.0.jsonl]

transforms:

  - DecodeImage: # load image

      img_mode: BGR

      channel_first: False

  - TableLabelEncode:

      learn_empty_box: False

      merge_no_span_structure: *merge_no_span_structure

      replace_empty_cell_token: False

      loc_reg_num: *loc_reg_num

      max_text_length: *max_text_length

  - TableBoxEncode:

      in_box_format: *box_format

      out_box_format: *box_format

  - ResizeTableImage:

      max_len: 488

  - NormalizeImage:

      scale: 1./255.

      mean: [0.485, 0.456, 0.406]

      std: [0.229, 0.224, 0.225]

      order: 'hwc'

  - PaddingTableImage:

      size: [488, 488]

  - ToCHWImage:

  - KeepKeys:

      keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ]

loader:

shuffle: True

batch_size_per_card: 48

drop_last: True

num_workers: 12

Eval:

dataset:

name: PubTabDataSet

data_dir: /home/xxx/dataset/pubtabnet/test/

label_file_list: [/home/xxx/dataset/pubtabnet/PubTabNet_2.0.0.jsonl]

transforms:

  - DecodeImage: # load image

      img_mode: BGR

      channel_first: False

  - TableLabelEncode:

      learn_empty_box: False

      merge_no_span_structure: *merge_no_span_structure

      replace_empty_cell_token: False

      loc_reg_num: *loc_reg_num

      max_text_length: *max_text_length

  - TableBoxEncode:

      in_box_format: *box_format

      out_box_format: *box_format

  - ResizeTableImage:

      max_len: 488

  - NormalizeImage:

      scale: 1./255.

      mean: [0.485, 0.456, 0.406]

      std: [0.229, 0.224, 0.225]

      order: 'hwc'

  - PaddingTableImage:

      size: [488, 488]

  - ToCHWImage:

  - KeepKeys:

      keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ]

loader:

shuffle: False

drop_last: False

batch_size_per_card: 48

num_workers: 12

原本clone下来的代码模板上两个标签文件是分开的，有后缀_train和_val，但官方似乎把数据集标签整合到一个jsonl文件里面了

― Reply to this email directly, view it on GitHubhttps://mailshield.baidu.com/check?q=5goFJbjcUfyP%2fyEjSFRCkfOv4BTC1hufHQCg%2bTvNUu2iQoxOwkZxiywjwatUwOJawoj3UVi%2fXA%2bGxQNi0xCE1UwI4ooo%2b1kHeCpBEBhabmgr%2fKtO, or unsubscribehttps://mailshield.baidu.com/check?q=96c8vuC0kPz3I9B05Eyk%2fcFnTeW9D1WPA%2bwAOJea8wqtIM6pBRuBUVypu4zVjfr8z7W79XA14Rwol0UP68eYOCcV8vT%2bwcHNnXP0V9yOkmNQdH4SSmNlCYnBEeim1PIDaZ5TXvIAC0w%3d. You are receiving this because you were assigned.Message ID: @.***>

andyjiang1116

由于这单个标签文件jsonl较大(2~3GB)，很难用一般编辑器打开，个人水平有限，也不太清楚去如何处理但最后还是感谢您的解答

enemy1205

可以自己写Python脚本来处理哈，编辑器是不行的

andyjiang1116

emmm最后试了一下发现实现起来没那么困难

import jsonlines
import os

if __name__ == "__main__":
    label_root_path='/home/XXX/dataset/pubtabnet'  # Path of the label file folder
    label_path=os.path.join(label_root_path,'PubTabNet_2.0.0.jsonl')
    with jsonlines.open(label_path, "r") as f:
        with jsonlines.open(os.path.join(label_root_path,"PubTabNet_2.0.0_train.jsonl"), "w") as train_f:
            with jsonlines.open(os.path.join(label_root_path,"PubTabNet_2.0.0_val.jsonl"), "w") as val_f:
                for data in f:
                    if data['split']=='train':
                        train_f.write(data)
                    else:
                        val_f.write(data)

enemy1205

C: cd C:\F\PaddleOCR-release-2.6 py -3 tools/train.py -c C:/F/SLANet_ch_border.yml -o Global.epoch_num=1 Global.pretrained_model="C:/Users/Administrator/Desktop/tableBorder/best_accuracy" Train.dataset.name='PubTabDataSet' Eval.dataset.name='PubTabDataSet' Train.dataset.data_dir='C:/F/wtw/pubtabnet/val/' Train.dataset.label_file_list=[C:/F/WTW/PubTabNet_2.0.0_val.jsonl] Eval.dataset.data_dir='C:/F/wtw/pubtabnet/val/' Eval.dataset.label_file_list=[C:/F/WTW/PubTabNet_2.0.0_val.jsonl] Train.loader.num_workers=0 Global.use_gpu=True Global.save_epoch_step=2000 Global.character_dict_path='C:/Users/Administrator/Desktop/tableBorder/table_structure_dict_ch_99span.txt' Global.eval_batch_step=[0,2000] Global.print_batch_step=100 Global.save_model_dir="C:/Users/Administrator/Desktop/tableBorder" Train.loader.batch_size_per_card=8 Train.loader.num_workers=0 Eval.loader.batch_size_per_card=8 Eval.loader.num_workers=0 Optimizer.lr.name=Const Optimizer.lr.learning_rate=0.0005

jsonl中单元格坐标和paddle的对不上，导致上面报错。怎么解决？

jjson坐标是4个数，paddlelabel是四个点，8个数

{"imgid": 548625, "html": {"cells": [{"tokens": []}, {"tokens": ["", "W", "e", "a", "n", "i", "n", "g", ""], "bbox": [66, 4, 96, 13]}, {"tokens": ["", "W", "e", "e", "k", " ", "1", "5", ""], "bbox": [131, 4, 160, 13]}, {"tokens": ["", "O", "f", "f", "-", "t", "e", "s", "t", ""], "bbox": [201, 4, 226, 13]}, {"tokens": ["W", "e", "a", "n", "i", "n", "g"], "bbox": [1, 17, 31, 26]}, {"tokens": ["–"], "bbox": [66, 21, 72, 25]}, {"tokens": ["–"], "bbox": [131, 21, 137, 25]}, {"tokens": ["–"], "bbox": [201, 21, 207, 25]}, {"tokens": ["W", "e", "e", "k", " ", "1", "5"], "bbox": [1, 31, 30, 40]}, {"tokens": ["–"], "bbox": [66, 35, 72, 39]}, {"tokens": ["0", ".", "1", "7", " ", "±", " ", "0", ".", "0", "8"], "bbox": [131, 31, 166, 40]}, {"tokens": ["0", ".", "1", "6", " ", "±", " ", "0", ".", "0", "3"], "bbox": [201, 31, 236, 40]}, {"tokens": ["O", "f", "f", "-", "t", "e", "s", "t"], "bbox": [1, 45, 26, 54]}, {"tokens": ["–"], "bbox": [66, 49, 72, 53]}, {"tokens": ["0", ".", "8", "0", " ", "±", " ", "0", ".", "2", "4"], "bbox": [131, 45, 166, 54]}, {"tokens": ["0", ".", "1", "9", " ", "±", " ", "0", ".", "0", "9"], "bbox": [201, 45, 236, 54]}], "structure": {"tokens": ["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}}, "split": "val", "filename": "PMC5755158_010_01.png"}

nissansz

[PaddlePaddle/PaddleOCR]训练SLANet表格识别模型时数据集标签问题

回答