e x t ra ct i n g ta b les f rom p d f s - europython · camelot & excalibur: overview, main...

13
Extracting Tables from PDFs Extracting Tables from PDFs Using and to automate PDF table extraction and export Dimiter Naydenov @dimitern Camelot Excalibur 1.1

Upload: others

Post on 02-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Extracting Tables from PDFsExtracting Tables from PDFsUsing and to automate PDF table extraction and export

Dimiter Naydenov

@dimitern

Camelot Excalibur

1 . 1

Page 2: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

OverviewOverviewPDF: brief history, structure, representing tablesCamelot & Excalibur: overview, main features, installationDemo: quick tour of Camelot, visual debugging, andplottingFuture improvements, Q&A

2 . 1

Page 3: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Portable Document FormatPortable Document Formatalmost 30 years ago…

Evolution of the Digital Document:Celebrating Adobe Acrobat’s 25th Anniversary

This document describes the base technology and ideas behind the projectnamed "Camelot".[…] a universal way to communicate documents across a wide variety ofmachine con�gurations, operating systems and communication networks.[…] viewable on any display […] printable on any modern printers.

—The Camelot Project, John Warnock

source:

3 . 1

Page 4: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

PDF: At a GlancePDF: At a GlanceCreated in the early 1990s by Adobe SystemsPredates the World Wide Web and HTMLProprietary format initially, released as open standard as of v1.7Based on a subset of Adobe PostScriptSelf-contained: embedded fonts, attachments, annotations, rich media,etc.13 versions released; an ISO standard since 2008 (PDF 1.7).Structured as a hierarchy of objects (words, paragraphs, fonts, etc.)

3 . 2

Page 5: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

PDF: StructurePDF: Structure

source: Introduction to PDF syntax: by Guillaume Endignoux

3 . 3

Page 6: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Text Selection & PDF "Tables"Text Selection & PDF "Tables"Looks familiar?

Often you need to: select one cell at a time, copy & paste, repeat.

3 . 4

Page 7: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

PDF Table Extraction ToolsPDF Table Extraction Tools - Java-based, open-source.

- Python, open-source. - Python, proprietary, paid.

- Python, open-source, no longermaintained.

- Proprietary, free and paid online service.

Tabulapdfplumberpdftablespdf-table-extract

OCR.space

3 . 5

Page 8: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Camelot & ExcaliburCamelot & ExcaliburCamelot

Excalibur

Started in 2016 by Vinayak Mehta @vortex_ape

at SocialCops in Bangalore, India.

https://github.com/camelot-dev/camelot

https://github.com/camelot-dev/excaliburhttps://tryexcalibur.com

4 . 1

Page 9: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Camelot: FeaturesCamelot: FeaturesExcellent Python-based, MIT licensedTwo extraction algorithms: Lattice and StreamWorks well out-of-the-box, but very con�gurableExports to CSV, TSV, Excel, JSON, HTML, or PandasDataFrames!Visual debugging and plotting with matplotlibActively maintained, contributors welcome!

documentation

4 . 2

Page 10: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Camelot & Excalibur: InstallationCamelot & Excalibur: InstallationCamelot

Using (easiest way)

Using pip, after installing : tk and ghostscript

Excalibur

Using pip, after installing tk and ghostscript

Conda

conda install -c conda-forge camelot-py

prerequisites

pip install --upgrade pip camelot-py[cv]

prerequisites

pip install --upgrade pip excalibur-py

4 . 3

Page 11: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Demo Time!Demo Time!

5 . 1

Page 12: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Future Improvements / Q&AFuture Improvements / Q&A

Performance improvementsReplacing Ghostscript withalternativesMore testsBetter memory footprint with large PDFs<your-favourite-feature?>

6 . 1

Page 13: E x t ra ct i n g Ta b les f rom P D F s - EuroPython · Camelot & Excalibur: overview, main features, installation Demo: quick tour of Camelot, visual debugging, and ... a uni versal

Questions ?

@dimitern

6 . 2