[PaddlePaddle/PaddleOCR]【OCR Issue No.9】移除明确不适合放在ppocr依赖中的依赖项

2024-05-13 476 views
3

https://github.com/PaddlePaddle/PaddleOCR/issues/11906 task9 https://github.com/PaddlePaddle/PaddleOCR/issues/11924

为了减小paddleocr的依赖,将部分包移除requirement.txt,采用paddle.utils.try_import的方式引用,当用户使用到时,提示用户安装。 各个移除的依赖项如下所示:

  • pdf2docx: 用于将pdf转化为word
  • lxml: 在table_metric.py的TEDS::evaluate函数中使用,该函数应当被用于验证评估阶段
  • premailer: 将html转化为excel
  • openpyxl: 辅助输出excel

回答

7

是不是得有一种机制在运行 paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --use_pdf2docx_api=true 自动安装依赖

运行时安装也是可以的,但是我觉得运行时安装包是不是不太好?我更推荐使用extern的方式指定包的安装

4

类似于 pip install "paddleocr[structure]"

8

@jzhang533

2

测试发现能够正常工作

pip uninstall lxml
Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 32, in try_import
    mod = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/premailer/__init__.py", line 1, in <module>
    from .premailer import Premailer, transform  # noqa
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/premailer/premailer.py", line 12, in <module>
    from lxml import etree
ModuleNotFoundError: No module named 'lxml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/tools/test.py", line 11, in <module>
    save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/predict_system.py", line 279, in save_structure_res
    to_excel(region["res"]["html"], excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 153, in to_excel
    tablepyxl.document_to_xl(html_table, excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 118, in document_to_xl
    wb = document_to_workbook(doc, base_url=base_url)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 93, in document_to_workbook
    try_import("premailer")
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 41, in try_import
    raise ImportError(err_msg)
ImportError: Failed importing premailer. This likely means that some paddle modules require additional dependencies that have to be manually installed (usually with `pip install premailer`). 
pip uninstall premailer
Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 32, in try_import
    mod = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'premailer'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/tools/test.py", line 11, in <module>
    save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/predict_system.py", line 279, in save_structure_res
    to_excel(region["res"]["html"], excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 153, in to_excel
    tablepyxl.document_to_xl(html_table, excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 118, in document_to_xl
    wb = document_to_workbook(doc, base_url=base_url)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 93, in document_to_workbook
    try_import("premailer")
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 41, in try_import
    raise ImportError(err_msg)
ImportError: Failed importing premailer. This likely means that some paddle modules require additional dependencies that have to be manually installed (usually with `pip install premailer`).
4

ppstructure 目录下要不要建一个requirement.txt用来让用户手动一键安装依赖

1

ppstructure 目录下要不要建一个requirement.txt用来让用户手动一键安装依赖

都已经打包成ppcor了,用户应该已经没办法通过requirement进行一键安装了。除非新增一个函数,专门用来安装依赖

8

都已经打包成ppcor了,用户应该已经没办法通过requirement进行一键安装了

用户clone下来的情况下

8

都已经打包成ppcor了,用户应该已经没办法通过requirement进行一键安装了

用户clone下来的情况下

这样就意味着ppstructure/recovery/requirements.txt是需要被保留的,即将所有try_import的包加到这里面

1

这样就意味着ppstructure/recovery/requirements.txt是需要被保留的,即将所有try_import的包加到这里面

先setup.py里面去掉ppstructure/recovery/requirements.txt,后面改造pyproject.toml再说

5

这样就意味着ppstructure/recovery/requirements.txt是需要被保留的,即将所有try_import的包加到这里面

先setup.py里面去掉ppstructure/recovery/requirements.txt,后面改造pyproject.toml再说

关于这个移除的工作,我建议之后先考虑ppstructure是否需要保留在此项目中,如果不需要保留,则完成迁移后再移除。