发票隐式表格单元信息提取算法设计
DOI:
作者:
作者单位:

1.华中科技大学电子信息与通信学院;2.广东烟草梅州市有限公司

作者简介:

通讯作者:

中图分类号:

TP391.1

基金项目:

梅州市烟草专卖局(公司)科技项目资助(2023441400240048);


Algorithm design for extracting information from implicit table cells in invoices
Author:
Affiliation:

1.School of Electronic Information and Communications, Huazhong University of Science and Technology;2.Meizhou Tobacco Monopoly Bureau (Company)

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    增值税发票、电子发票等财务票据内含有大量隐式表格单元,现有表格识别模型对此类发票明细信息提取准确率低。基于形态学方法检测单据中的表格线,进而检测表格最小矩形包围框,利用该包围框信息完成表格倾斜校正,提取表格横、竖线的坐标。基于形态学操作获得文字连通区域,依此提取隐含的单元格横竖线,构建完整表格结构。基于DBNet网络对单元格进行文本行检测,采用随机退化结合学习到的退化算子构建配对高低分辨率图像数据集训练盲超分辨率模型对低质文本图像进行重建,采用CRNN对文本行进行文字识别。在两种含隐式表格单元的发票数据集上测试,提出的方法对隐式表格结构的识别准确率100%,对低质文本图像盲超分辨率重建后字符识别错误率下降14%。本文方法相较于最新商用百度云表格文字识别v2版和基于深度学习的LGPMA表格识别模型在表格结构构建准确率和运行速度等方面均具有优势。实验表明本文方法能准确识别发票隐式表格结构并高效完成该类发票文字识别提取。

    Abstract:

    The VAT invoices, electronic invoices and other financial invoices contain a large number of implicit table cells. Existing table recognition models have low accuracy in extracting detailed information from these invoices. Therefore, it is necessary to design a method to accurately recognize and extract the implicit table structure in VAT invoices and other financial invoices. The method proposed in this paper uses morphological operations to detect the table lines in the invoices, detects the minimum rectangle bounding box of the table, corrects tilt of the table based on the bounding box information, extracts the coordinates of horizontal and vertical lines of the table. Morphological operations are used to obtain connected-component areas, then the implicit cell lines’ coordinates are extracted to construct the complete table structure. The method uses DBNet to detect text lines in cells, and uses CRNN to recognize characters in text lines. A blind super-resolution model trained by paired high-resolution and low-resolution image dataset constructed by random degradation combined with learned degradation operators is applied to low-quality text images. The method is tested on a dataset of two types of invoices containing implicit table cells, and achieves 100% recognition accuracy for implicit table structures and a 14% reduction in character recognition error rate for low-quality text images after blind super-resolution reconstruction. Compared to current commercial Baidu Cloud table text recognition v2 and deep-learning based LGPMA table recognition models, the proposed method has performance advantages in both table recognition accuracy and running speed. Experimental results show that the proposed method can accurately recognize implicit table structure in invoices and efficiently complete text recognition and information extraction for these type of invoices.

    参考文献
    相似文献
    引证文献
引用本文

贺锋,张威,杨玉燕,等. 发票隐式表格单元信息提取算法设计[J]. 科学技术与工程, , ():

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-04-03
  • 最后修改日期:2024-06-17
  • 录用日期:2024-07-09
  • 在线发布日期:
  • 出版日期:
×
亟待确认版面费归属稿件,敬请作者关注
《科学技术与工程》入选维普《中文科技期刊数据库》自然科学类期刊月度下载排行榜TOP10