[Pycon] [new paper] "Vinayak Mehta" - Extracting tabular data from PDFs with Camelot & Excalibur

Sab 5 Gen 2019 21:07:09 CET

Title: Extracting tabular data from PDFs with Camelot & Excalibur
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: it
Type: Talk

Abstract: Extracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data is shared as PDFs and getting tables out for analysis is a pain. A simple copy-and-paste from a PDF into a text file or spreadsheet program doesn't work.

This talk will briefly touch upon the history of the Portable Document Format, discuss some problems that arise when extracting tabular data from PDFs using the current ecosystem of libraries and tools and demonstrate how Camelot and Excalibur solve this problem better and in a scalable manner. These easy-to-use packages automatically detect and extract tables from PDFs and give you access to the extracted tables in pandas DataFrames. You can also download them as CSVs or Excel files.  

After watching this talk, the audience will have a high-level understanding of how the Portable Document Format works. They will also learn how to easily extract tabular data from any type of PDF (the table structures can be bizarre!) using Camelot (the Python library) or Excalibur (the web interface), access extracted tables as pandas DataFrames and save them into CSVs or Excel files. The talk can be particularly helpful for data analysts, scientists and journalists since they work with a lot of open data (a lot of which is shared as PDFs) and have a recurrent need to extract tables from PDFs for analysis and record-keeping.

Tags: [u'pdf', u'opencv', u'scraping', u'computer-vision']