e x t ra ct i n g ta b les f rom p d f s - europython · camelot & excalibur: overview, main...
TRANSCRIPT
Extracting Tables from PDFsExtracting Tables from PDFsUsing and to automate PDF table extraction and export
Dimiter Naydenov
@dimitern
Camelot Excalibur
1 . 1
OverviewOverviewPDF: brief history, structure, representing tablesCamelot & Excalibur: overview, main features, installationDemo: quick tour of Camelot, visual debugging, andplottingFuture improvements, Q&A
2 . 1
Portable Document FormatPortable Document Formatalmost 30 years ago…
Evolution of the Digital Document:Celebrating Adobe Acrobat’s 25th Anniversary
This document describes the base technology and ideas behind the projectnamed "Camelot".[…] a universal way to communicate documents across a wide variety ofmachine con�gurations, operating systems and communication networks.[…] viewable on any display […] printable on any modern printers.
—The Camelot Project, John Warnock
source:
3 . 1
PDF: At a GlancePDF: At a GlanceCreated in the early 1990s by Adobe SystemsPredates the World Wide Web and HTMLProprietary format initially, released as open standard as of v1.7Based on a subset of Adobe PostScriptSelf-contained: embedded fonts, attachments, annotations, rich media,etc.13 versions released; an ISO standard since 2008 (PDF 1.7).Structured as a hierarchy of objects (words, paragraphs, fonts, etc.)
3 . 2
PDF: StructurePDF: Structure
source: Introduction to PDF syntax: by Guillaume Endignoux
3 . 3
Text Selection & PDF "Tables"Text Selection & PDF "Tables"Looks familiar?
Often you need to: select one cell at a time, copy & paste, repeat.
3 . 4
PDF Table Extraction ToolsPDF Table Extraction Tools - Java-based, open-source.
- Python, open-source. - Python, proprietary, paid.
- Python, open-source, no longermaintained.
- Proprietary, free and paid online service.
Tabulapdfplumberpdftablespdf-table-extract
OCR.space
3 . 5
Camelot & ExcaliburCamelot & ExcaliburCamelot
Excalibur
Started in 2016 by Vinayak Mehta @vortex_ape
at SocialCops in Bangalore, India.
https://github.com/camelot-dev/camelot
https://github.com/camelot-dev/excaliburhttps://tryexcalibur.com
4 . 1
Camelot: FeaturesCamelot: FeaturesExcellent Python-based, MIT licensedTwo extraction algorithms: Lattice and StreamWorks well out-of-the-box, but very con�gurableExports to CSV, TSV, Excel, JSON, HTML, or PandasDataFrames!Visual debugging and plotting with matplotlibActively maintained, contributors welcome!
documentation
4 . 2
Camelot & Excalibur: InstallationCamelot & Excalibur: InstallationCamelot
Using (easiest way)
Using pip, after installing : tk and ghostscript
Excalibur
Using pip, after installing tk and ghostscript
Conda
conda install -c conda-forge camelot-py
prerequisites
pip install --upgrade pip camelot-py[cv]
prerequisites
pip install --upgrade pip excalibur-py
4 . 3
Demo Time!Demo Time!
5 . 1
Future Improvements / Q&AFuture Improvements / Q&A
Performance improvementsReplacing Ghostscript withalternativesMore testsBetter memory footprint with large PDFs<your-favourite-feature?>
6 . 1
Questions ?
@dimitern
6 . 2