your data. your value. · document classification – categorize documents information extraction...

25
Your Data. Your Value. 23384: How AI Revolutionizes Data & Document Management Florian Kuhlmann (CTO & Co-founder)

Upload: others

Post on 16-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Your Data. Your Value.

23384: How AI Revolutionizes Data & Document Management

Florian Kuhlmann (CTO & Co-founder)

Page 2: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Data TransparencyGlobal Corporate Clients

100+ 20+ 100%

Years using Deep Learning Technology

75+Employees in offices in

Berlin, London and New YorkData Security – ISO 27001 / 9001

Highest

Languages supported

4+

LEVERTON - Company overview

Page 3: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

LEVERTON - Company vision

„Weenablesmarterdecisionsbystructuringtheworld’sdata“

Page 4: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

DocumentManagement

From documents to dataWhich problem does LEVERTON solve?

DataManagement

Manual data input

Inefficient Non-transparent Error prone

UseAItoclassifydocumentsandextractdata

Controlusingefficient4-eyesprincipal

Keeplinktopositionindocument

Page 5: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

DeepLearning

Data input with LEVERTONThree steps from document to data

Upload scanned PDFs

Automated extraction

Review and correct

Document Classification – categorize documents

Information Extraction – extract relevant data

OCR – convert images into machine readable text

DATA CORE

Train

Page 6: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Information Extraction

1. Define the type of information (e.g. as part of a data model)

2. Find location (start + end, coordinates) of information in document

3. Extract the information (parse numbers, dates, …)

Automaticallyextractstructuredinformationfrom un- orsemi-structured machine-readable documents

(Wikipedia)

Page 7: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Information Extraction: Define TypesData model as abstraction of real world entities

Amount DECIMAL LIST(Currencies)

Period LIST(Periods)

Startdate DATE

Rentcharge

- Exampledatamodelforrentchargesofalease-

Page 8: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Information ExtractionFind, classify and extract information

Amount

Period

Startdate

Rentcharge

From 1 October 2006 to 31 December 2010: The rent (inclusive of management fees and air-conditioning charges during normal office hours) for the period from the Commencement Date to the Expiration Date of the said term shall be USD ONE THOUSAND EIGHT HUNDRED AND THIRTY FOUR (USD 1,834.00) per calendar month.

2006-10-01

1.834,00 USD

monthly

Page 9: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Rule based

Extraction based on hand written rules

1. Apple is bad, if it has brown spots

2. Apple is bad, if it has wrinkles

3. Apple is bad, if it is brown

4. …

Machine & Deep Learning

Learn from examples

Bad apple

Bad apple

Good apple

Information Extraction

Page 10: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Rule based Information Extraction

Rule: Phone1

(({Token.string=="Telefon“})|({Token.string=="Tel“}({Token.string=="."})?) |({Token.string=="Fax“}))({Token.kind==punctuation})?({Token.string=="("})?({Token.numtype==ordinal})[1,3]({Token.string==")"}|{Token.string=="/"})?({Token.numtype==ordinal})[1,6]({Token.string=="-"}{Token.numtype==ordinal})?

:contact

-->

:contact.Phone = {rule = "Phone"}

Example: Extract phone or fax number from text

Page 11: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Information Extraction with Deep LearningFeed annotated examples and wait

yesno

… 0.0230.1213.4210.223…

Page 12: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Information ExtractionRule based vs. Deep Learning

Rule based Machine & Deep Learning

PROS• Samealgorithmsforalllanguages/domains• Noneedforengineerstounderstandthelanguage• Noneedforengineerstounderstandthedomain• Robust,generalizestodifferentverbalizations• Fast

CONS• Labeleddataneeded• Cannotbedebugged• Hardtounderstandandtotune• Trainingmighttakelongtime…

PROS• Startwithoutlabeleddata• Canbedebuggedandadjustedeasily

CONS• Ruleengineersneeddomainknowledge• Ruleengineersneedlanguageknowledge• Rulesmustbeadoptedtodifferentlanguages• Rulescanbecomequitecomplexandhardto

understand• Rulesnevercoverallpossiblecases,esp.incomplex

languages

...Sowhatisthebetterchoice?

Page 13: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Rules vs. Deep Learning @ LEVERTON

• Approx. one person year into writing rules led to performance (F1 score) of

à 20%-65% on 20 different data points in 1 language

• Current Deep Learning led to performance (F1 score) of

à 70% to near 100% on 700+ different data points in 4 languages

à This sounds good, but how can we get better?

What to use?

Page 14: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Challenges: LayoutDifferent layouts might need different extraction strategies

Structured Semi structured Unstructured

Page 15: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Challenge in Layout:Determine the correct reading order

• Visualfeaturesaloneoftennotsufficient!• Impossibletodeterminethereadingorder

withoutreadingthetext.Weneedtoguess.

Page 16: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

LayoutDetermine the correct reading order

• Combinationofvisualfeaturesandinterpretationofextractedtextleadstocorrectreadingorder

• Humanbrainsareabletocombinemultiplesteps(Visualseparation,recognizingofcharactersandwords,interpretingwordstoformsentences,whichleadstocorrectseparation/layoutrecognition)

• DeepLearningisfocusedononeproblematatime

• Jointapproachesnecessarytosolvesuchproblems

1

2

Page 17: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Challenges: Data modelAbstraction comes with a price

Amount

Period

Startdate

Rentcharge

DATE???

yearly

1 Peppercorn

Page 18: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

More challenges ahead…

• Co-Referencing (which information belongs to what)

• Scan quality

• Scaling out: Training already takes 6 weeks of one core

• …

How can we improve in the future

Page 19: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

... But we are on a good track

• Simple documents: >95% automation for >50 different data points

• Complex documents: >70% automation for >200 different data points

• Available in >20 different languages

• Build own OCR engine in < 1 year, competitive to all known OCR engines, more robust on layout

Achievements of LEVERTON AI

Page 20: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Example Use Case: IFRS16Changes in balancing standards forces corporates to revisit leasing data

Global consolidation Extract data System Integration

IFRS 16 DATA

• Options

• Valuation

• Indexation

• Payment types

LEVERTON can be integrated into ERP systems at customers

��

� ����

Real estate leases

Machinery leases

Car leases& more

��

Page 21: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

LEVERTON COREExample complex document >70% automation

Page 22: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

LEVERTON COREExample simple document, >95% automation

Page 23: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

LEVERTON COREEnables decisions using data based on legally binding documents

Page 24: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Recap

How AIRevolutionizes Data&Document Management

• Breakthrough with switch to Deep Learning

• More challenges for 100% automated information extraction on complex docs

• To achieve this, we probably need to combine multiple steps into one large network

• We’ll need lots of computational resources to do so

• If you have ideas about any of these: We are hiring.

Page 25: Your Data. Your Value. · Document Classification – categorize documents Information Extraction – extract relevant data ... • Rule engineers need domain knowledge • Rule engineers

Florian KuhlmannCTO & [email protected]