[Pycon] [new paper] "Peter Engerer" - AN OPEN-SOURCE PROCESSING PIPELINE FOR TEXTUAL INFORMATION EXTRACTION FROM RECEIPTS

Ven 22 Dic 2017 12:11:52 CET

Title: AN OPEN-SOURCE PROCESSING PIPELINE FOR TEXTUAL INFORMATION EXTRACTION FROM RECEIPTS
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk

Abstract: We provide a processing pipeline that allows for extracting textual information from images of receipts acquired by mobile devices. The proposed open-source pipeline is implemented in Python 3 and is comprised of three main parts: (1) Finding text in the image by either recognizing the contour of the receipt or Stroke Width Transform. In this part, we will also discuss the effects of various image pre-processing techniques of OpenCV (such as noise filtering, contrast enhancement or morphological operations). (2) Text Orientation Detection, Page Segmentation and Optical Character Recognition (OCR) using the recurrent neural network (RNN) of Tesseract in order to obtain a hierarchical output (hOCR). (3) Extracting relevant information from the hOCR output by matching information to a pre-defined lexicon of keywords and visualizing the results in a concise and aesthetically-pleasing way. Using a Jupyter notebook, we want to interactively demonstrate this pipeline on pharmaceutical bills.  

Tags: [u'image-processing', u'tesseract', u'recurrent-neural-network', u'scikit-', u'OCR', u'python3', u'Text-Mining', u'opencv']