automating the extraction of genealogical information from historical documents

34
Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011

Upload: soleil

Post on 19-Feb-2016

39 views

Category:

Documents


1 download

DESCRIPTION

Automating the Extraction of Genealogical Information from Historical Documents. Aaron P. Stewart David W. Embley March 20, 2011. Part I: Vision. Current projects at the BYU Data Extraction Group. Goal: Search books for names. History of the Jones Family. George Jones. scanner. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automating the Extraction of Genealogical Information from Historical Documents

Automating the Extraction of Genealogical Information from Historical Documents

Aaron P. StewartDavid W. EmbleyMarch 20, 2011

Page 2: Automating the Extraction of Genealogical Information from Historical Documents

Part I: Vision

Current projects at the BYU Data Extraction Group

Page 3: Automating the Extraction of Genealogical Information from Historical Documents

4

Goal: Search books for names

History of the Jones

Family scanner

George Jones

Page 4: Automating the Extraction of Genealogical Information from Historical Documents

Original Document

Page 5: Automating the Extraction of Genealogical Information from Historical Documents

Original Document

Page 6: Automating the Extraction of Genealogical Information from Historical Documents

Original Document

Page 7: Automating the Extraction of Genealogical Information from Historical Documents

Extracted Facts

NamesWilliam Gerard LathropMary ElyGerard LathropCharlotte Brackett JenningsNathan Tilestone JenningsMaria MillerMaria Jennings [Lathrop]Donald McKenzie [Lathrop]Anna Margaretta [Lathrop]Anna Catherine [Lathrop]

RelationshipsWilliam Gerard Lathrop : son of : Mary ElyWilliam Gerard Lathrop : son of : Gerard LathropWilliam Gerard Lathrop : m. : Charlotte Brackett JenningsCharlotte Brackett Jennings : dau. of : Nathan Tilestone JenningsCharlotte Brackett Jennings : dau. of : Maria Miller

Relationships (continued)Maria Jennings : child of : William Gerard LathropMaria Jennings : child of : Charlotte BrackettWilliam Gerard : child of : William Gerard LathropWilliam Gerard : child of : Charlotte BrackettDonald McKenzie : child of : William Gerard LathropDonald McKenzie : child of : Charlotte BrackettAnna Margaretta : child of : William Gerard LathropAnna Margaretta : child of : Charlotte BrackettAnna Catherine : child of : William Gerard LathropAnna Catherine : child of : Charlotte Brackett

Page 8: Automating the Extraction of Genealogical Information from Historical Documents

Inferred Facts

NamesWilliam Gerard LathropMary ElyGerard LathropCharlotte Brackett JenningsNathan Tilestone JenningsMaria MillerMaria Jennings [Lathrop]Donald McKenzie [Lathrop]Anna Margaretta [Lathrop]Anna Catherine [Lathrop]

RelationshipsWilliam Gerard Lathrop : son of : Mary ElyWilliam Gerard Lathrop : son of : Gerard LathropWilliam Gerard Lathrop : m. : Charlotte Brackett JenningsCharlotte Brackett Jennings : dau. of : Nathan Tilestone JenningsCharlotte Brackett Jennings : dau. of : Maria Miller

Relationships (continued)Maria Jennings : child of : William Gerard LathropMaria Jennings : child of : Charlotte BrackettWilliam Gerard : child of : William Gerard LathropWilliam Gerard : child of : Charlotte BrackettDonald McKenzie : child of : William Gerard LathropDonald McKenzie : child of : Charlotte BrackettAnna Margaretta : child of : William Gerard LathropAnna Margaretta : child of : Charlotte BrackettAnna Catherine : child of : William Gerard LathropAnna Catherine : child of : Charlotte Brackett

Inferred RelationshipsMaria Jennings : grandchild of : Mary ElyMaria Jennings : grandchild of : Gerard LathropMaria Jennings : grandchild of : Nathan Tilestone JenningsMaria Jennings : grandchild of : Maria MillerWilliam Gerard : grandchild of : Mary ElyWilliam Gerard : grandchild of : Gerard LathropWilliam Gerard : grandchild of : Nathan Tilestone JenningsWilliam Gerard : grandchild of : Maria Miller…

Page 9: Automating the Extraction of Genealogical Information from Historical Documents

Keywords

Chief Justice

Page 10: Automating the Extraction of Genealogical Information from Historical Documents

Queries

• Is there a chief justice related to Mary Ely?• Who are the sons of Gerard Lathrop?• Who are the grandchildren of Mary Ely?

Page 11: Automating the Extraction of Genealogical Information from Historical Documents

Part II: Implementation

Page 12: Automating the Extraction of Genealogical Information from Historical Documents

Ontology Editor

Page 13: Automating the Extraction of Genealogical Information from Historical Documents

Data Frame Editor

Page 14: Automating the Extraction of Genealogical Information from Historical Documents

Rule Editor

Page 15: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 16: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 17: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 18: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 19: Automating the Extraction of Genealogical Information from Historical Documents

HyKSS Indexing

Page 20: Automating the Extraction of Genealogical Information from Historical Documents

HyKSS Indexing

Page 21: Automating the Extraction of Genealogical Information from Historical Documents

Keyword Search

Page 22: Automating the Extraction of Genealogical Information from Historical Documents

Keyword Search

Page 23: Automating the Extraction of Genealogical Information from Historical Documents

Keyword Search

Page 24: Automating the Extraction of Genealogical Information from Historical Documents

Relationship Search

Page 25: Automating the Extraction of Genealogical Information from Historical Documents

Relationship Search

Page 26: Automating the Extraction of Genealogical Information from Historical Documents

Inferred Relationship Search

Page 27: Automating the Extraction of Genealogical Information from Historical Documents

Inferred Relationship Search

Maria Jennings is a grandchild of Mary ElyGrandchildOf(Maria Jennings, Mary Ely) :- Child-Parent(Maria Jennings, William Gerard Lathrop), Child-Parent(William Gerard Lathrop, Mary Ely)

Page 28: Automating the Extraction of Genealogical Information from Historical Documents

Part III: Improvements

Page 29: Automating the Extraction of Genealogical Information from Historical Documents

Extraction Tools

Page 30: Automating the Extraction of Genealogical Information from Historical Documents

Need Better Extraction Results

From Packer et al., http://deg.byu.edu/papers/Ancestry_NAACL_HLT_Paper.pdf

------- Lists -------

Page 31: Automating the Extraction of Genealogical Information from Historical Documents

Example of a Better Extractor(Margin Finder)

B\ liee (OCR error)

Buekman (OCR error)

Jobsph (OCR error)

Baseline errors

Baseline errors

Uuckkman (OCR error)

Charles. (OCR error)

Page 32: Automating the Extraction of Genealogical Information from Historical Documents

Example of a Better Extractor(Margin Finder)

LEVEL 1

LEVEL 1

LEVEL 1LEVEL 1LEVEL 1LEVEL 1

LEVEL 1LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1LEVEL 1

LEVEL 1LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

LEVEL 2

Page 33: Automating the Extraction of Genealogical Information from Historical Documents

Need Annotation Tools

Page 34: Automating the Extraction of Genealogical Information from Historical Documents

Credits

• Ontology Editor – Numerous past students• Data Frame Editor – Numerous past students• Rule Editor – Nathan Tate• Hybrid Keyword and Semantic Search (HyKSS)

– Andrew Zitzelberger

• This presentation contains both actual screenshots and mock-ups of projected results