automatic extraction of individual and family information from primary genealogical records

30
Automatic Extraction of Automatic Extraction of Individual and Family Individual and Family Information from Information from Primary Genealogical Primary Genealogical Records Records By By Charla Woodbury Charla Woodbury October 17, 2006 October 17, 2006

Upload: alamea

Post on 14-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Automatic Extraction of Individual and Family Information from Primary Genealogical Records. By Charla Woodbury October 17, 2006. Digital Images – Human Index. Large number of competing family history websites Digital images Human indexes – Double entry - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

Automatic Extraction of Automatic Extraction of Individual and Family Individual and Family

Information from Information from Primary Genealogical Primary Genealogical

RecordsRecords

By By

Charla Woodbury Charla Woodbury October 17, 2006October 17, 2006

Page 2: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

2

Digital Images – Human Digital Images – Human IndexIndex

• Large number of competing family history websites•Digital images

•Human indexes – Double entry

• Researchers hunting through records and indexes to put families together

Page 3: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

3

ProblemProblem

Large amounts of primary genealogical Large amounts of primary genealogical datadata

Big projects to index and extract recordsBig projects to index and extract records

Two independent indexers and Two independent indexers and adjudicationadjudication

Millions of human hours used to index or Millions of human hours used to index or match records for names and familiesmatch records for names and families

Page 4: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

4

Automated Extraction Automated Extraction SolutionSolution

Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical datagenealogical data

Develop expert logic and rules thatDevelop expert logic and rules that Match and merge individuals Match and merge individuals

Group them into familiesGroup them into families

Page 5: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

5

MethodsMethods

Prepare for the records extractionPrepare for the records extraction

Run a 1Run a 1stst PASS to extract the PASS to extract the informationinformation

Run a 2Run a 2ndnd PASS to match individuals PASS to match individuals and link familiesand link families

Evaluate and optimize the resultsEvaluate and optimize the results

Page 6: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

6

Prepare for Records Prepare for Records ExtractionExtraction

Build an Ontology Build an Ontology BYU ontology software BYU ontology software Ontos Ontos to interpret and to interpret and

correctly label genealogical data usingcorrectly label genealogical data using DataframesDataframes Regular expressions Regular expressions LexiconsLexicons Conversion functionsConversion functions

““encapsulates knowledge about the appearance, encapsulates knowledge about the appearance, behavior, and context of a collection of data behavior, and context of a collection of data elements” Dr. David Embley elements” Dr. David Embley

Collect machine-readable recordsCollect machine-readable records

Page 7: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

7

Ontology – Entity LevelOntology – Entity Level

Page 8: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

8

Danish Danish GIVEN NAMEGIVEN NAME LEXICONLEXICON

MALEMALE Anders –And.Anders –And. AndreasAndreas Christen –KristenChristen –Kristen Christian –KristianChristian –Kristian Erik –EricErik –Eric GregersGregers HansHans Ib –Jep –JeppeIb –Jep –Jeppe JacobJacob JensJens Johan – Johannes – Joh.Johan – Johannes – Joh. Jorgen –JørgenJorgen –Jørgen KnudKnud Lars – Laurs – Laurids –LauritzLars – Laurs – Laurids –Lauritz Mads –Mats - MatsMads –Mats - Mats

FEMALEFEMALE Ane – Anna – AnneAne – Anna – Anne Birthe – BirteBirthe – Birte BodilBodil CarolineCaroline Dorthe – DorteDorthe – Dorte Ellen -Helene -EleneEllen -Helene -Elene Elisabeth –Elsbeth –LisbethElisabeth –Elsbeth –Lisbeth Else –IlseElse –Ilse IngeborgIngeborg IngerInger KarenKaren Kirsten –Christen –Kirstine –Kirsten –Christen –Kirstine –

Christine –Kirstine –ChirstineChristine –Kirstine –Chirstine MaleneMalene MarenMaren

Page 9: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

9

DATEDATE Lexicon Lexicon Adds Thesaurus of SynonymsAdds Thesaurus of Synonyms

MONTHSMONTHS January –Jan –Januar -11brJanuary –Jan –Januar -11br Februrary –Feb –Februar -12brFebrurary –Feb –Februar -12br March –Mar –MartsMarch –Mar –Marts April – Apr –AplApril – Apr –Apl May –MaiMay –Mai June –Jun –JuniJune –Jun –Juni July –Jul –Juli -5brJuly –Jul –Juli -5br August –Aug –Augst -6brAugust –Aug –Augst -6br September –Sep –Sept -7br –SeptembreSeptember –Sep –Sept -7br –Septembre October –Oct -8br –OctobreOctober –Oct -8br –Octobre November –Nov -9br –NovembreNovember –Nov -9br –Novembre December –Dec -10brDecember –Dec -10br

TIMETIME Year –yr –aar –årYear –yr –aar –år Month –mo –maaned –m.Month –mo –maaned –m. Week –uge –ug.Week –uge –ug. Day –dag –d.Day –dag –d. Hour – h. –hr.Hour – h. –hr.

FEAST DATESFEAST DATES Easter – Paaske –Påske –Paasche –Påsche –P.Easter – Paaske –Påske –Paasche –Påsche –P. Pentecost – Pent –Pinse -PinPentecost – Pent –Pinse -Pin Trinity –Tr –Trin –TrinitatisTrinity –Tr –Trin –Trinitatis

DAYS OF WEEKDAYS OF WEEK Sunday –Sun –Dominico –Dom.Sunday –Sun –Dominico –Dom. Monday –Mon –Mondag –Mond.Monday –Mon –Mondag –Mond. Tuesday –Tue –Tirsdag –Tirsd.Tuesday –Tue –Tirsdag –Tirsd. Wednesday –Wed -Onsdag –Onsd.Wednesday –Wed -Onsdag –Onsd. Thursday – Thur –Tørsdag –Tørsd.Thursday – Thur –Tørsdag –Tørsd. Friday –Fri –Fredag –Fred.Friday –Fri –Fredag –Fred. Saturday –Sat –Lørsdag –LørsSaturday –Sat –Lørsdag –Lørs

Page 10: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

10

CONVERSION FUNCTIONSCONVERSION FUNCTIONSinside the ontologyinside the ontology

Compute birth date from age at deathCompute birth date from age at death

Death date – 22 Mar 1743 Death date – 22 Mar 1743

Age - 23 yr 2 mAge - 23 yr 2 m

->-> BIRTH Jan 1720BIRTH Jan 1720

Compute dates from feast dates Sunday 23rd after Trinity 1751

->-> 14 Nov 1751

Page 11: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

11

Collect Machine-Readable Collect Machine-Readable RecordsRecords

Page 12: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

12

English Parish – English Parish – Wirksworth, DerbyWirksworth, Derby

1608-18131608-1813

Page 13: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

13

Danish Parish – Maglebye1646-1813

Page 14: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

14

Sample Danish Sample Danish marriagesmarriages

Page 15: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

15

New England – Beverly, New England – Beverly, Mass.Mass.

1668-18491668-1849

Page 16: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

16

2 Run a 12 Run a 1stst pass to extract pass to extract the informationthe information

Annotate the genealogical record Annotate the genealogical record with the ontologywith the ontology

Populate RDF data filePopulate RDF data file

Page 17: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

17

Annotated Town RecordAnnotated Town Record SOURCE –SOURCE –Beverly town recordsBeverly town records

[PAGE HEADER][PAGE HEADER] BirthsBirths page 391 page 391

[BODY][BODY] WOODBURY, Benjamin,WOODBURY, Benjamin, s.s. NickolasNickolas and and Anne,Anne, bp.bp. 26 : 2 m : 1668.26 : 2 m : 1668.

NAMENAME <NAME><NAME>DATEDATE <DATE><DATE>PLACEPLACE <PLACE><PLACE>RELATIONSHIPRELATIONSHIP <RELATION><RELATION>OCCUPATIONOCCUPATION <OCCUPATION><OCCUPATION>RECORD_TYPERECORD_TYPE <RTYPE><RTYPE>SOURCESOURCE <SOURCE><SOURCE>

Page 18: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

18

Annotated Danish ParishAnnotated Danish Parish

SOURCE -SOURCE -Tvilum Parish RegisterTvilum Parish Register

[PAGE HEADER][PAGE HEADER] FøddeFødde 17511751 page 3 page 3

[BODY][BODY] TruustTruust Dom. 23 p: Trinit: Dom. 23 p: Trinit: laest laest over over Niels BachesNiels Baches SØRENSØREN fadd.fadd. Johannes MichelsensJohannes Michelsens og og NielsNiels Mollers Mollers hustruerhustruer af af SøebyevadSøebyevad, , Peder Peder RasmussenRasmussen af af SøebyevadSøebyevad, , Jens BachisJens Bachis sønsøn PederPeder og og Niels ThylkesNiels Thylkes s.s. PederPeder af af TruustTruust

Page 19: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

19

Populate RDF-data filePopulate RDF-data file

Hilton Campbell’s designHilton Campbell’s design

PERSONPERSON

EVENTEVENT

LINKS – PERSON(S) to EVENTLINKS – PERSON(S) to EVENT

Page 20: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

20

EVENT – EVENT – birth of Rachelbirth of RachelPERSON’s – PERSON’s – SarahSarah and and

RachelRachel

Page 21: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

21

3 Run a SECOND PASS to 3 Run a SECOND PASS to match individuals and to match individuals and to

link familieslink families FORMULATE RULES FORMULATE RULES

in Rule Engine language for RDF-data file in Rule Engine language for RDF-data file

Match individualsMatch individuals

Check family dataCheck family data

Link families upLink families up

APPLY RULES through the Java Rules APIAPPLY RULES through the Java Rules API

Page 22: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

22

44 Evaluate and Optimize Evaluate and Optimize ResultsResults

Evaluate the preliminary resultsEvaluate the preliminary results

Optimize the rulesOptimize the rules

Improve the whole processImprove the whole process

Page 23: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

23

VALIDATION IVALIDATION IClassification by Record Type:Classification by Record Type:

RECALL = .769 RECALL = .769 240 entries CORRECTLY LABELED ‘BIRTH’240 entries CORRECTLY LABELED ‘BIRTH’

________________________________________________________________________________

312 entries ACTUAL BIRTHS312 entries ACTUAL BIRTHS

PRECISION = .976PRECISION = .976240 entries CORRECTLY LABELED ‘BIRTH’240 entries CORRECTLY LABELED ‘BIRTH’

________________________________________________________________________________

246 Entries TOTAL LABELED ‘BIRTH’246 Entries TOTAL LABELED ‘BIRTH’

The higher the number, the betterThe higher the number, the better

Page 24: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

24

VALIDATION IIVALIDATION IICorrectness of the Extraction:Correctness of the Extraction:

RECALL = .95RECALL = .95950 entries CORRECTLY LABELED ‘NAME’950 entries CORRECTLY LABELED ‘NAME’

________________________________________________________________________________

1000 entries ACTUAL NAMES1000 entries ACTUAL NAMES

PRECISION = .969PRECISION = .969950 entries CORRECTLY LABELED ‘NAME’950 entries CORRECTLY LABELED ‘NAME’

________________________________________________________________________________

980 Entries TOTAL LABELED ‘NAME’980 Entries TOTAL LABELED ‘NAME’

The higher the number, the betterThe higher the number, the better

Page 25: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

25

Isaac WOODBURYIsaac WOODBURY ChildrenChildren

1.1. Robert 4 Jul 1672Robert 4 Jul 16722.2. Mary 6 Oct 1674Mary 6 Oct 16743.3. Christian 3 Mar 1677/8Christian 3 Mar 1677/84.4. Isaac 6 Apr 1680Isaac 6 Apr 16805.5. Deliverance 1 Feb 1682/3Deliverance 1 Feb 1682/36.6. Joshua 1 Jan 1684/5Joshua 1 Jan 1684/57.7. Elizabeth 17 Jan 1688Elizabeth 17 Jan 16888.8. Nickolas 12 Aug 1688Nickolas 12 Aug 16889.9. AnnAnn 29 Jun 168929 Jun 168910.10. Lidia 1 Feb 1691/2Lidia 1 Feb 1691/211.11. Elisabeth about 1694Elisabeth about 169412.12. Isaac 20 Jul 1697Isaac 20 Jul 169713.13. Benjamin 20 Aug 1699Benjamin 20 Aug 1699

Page 26: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

26

Isaac WOODBURYIsaac WOODBURY SON of SON of HUMPHREYHUMPHREY

Mary WILKESMary WILKES MARRIAGE 9 Oct MARRIAGE 9 Oct

16711671

1.1. Robert 4 Jul 1672Robert 4 Jul 16722.2. Mary 6 Oct 1674Mary 6 Oct 16743.3. Christian 3 Mar Christian 3 Mar

1677/81677/84.4. Isaac 6 Apr 1680Isaac 6 Apr 16805.5. Deliverance 1 Feb Deliverance 1 Feb

1682/31682/36.6. Joshua 1 Jan 1684/5Joshua 1 Jan 1684/57.7. Elizabeth 17 Jan Elizabeth 17 Jan

16881688

Isaac WOODBURYIsaac WOODBURY SON of SON of NICHOLASNICHOLAS

ElizabethElizabeth MARRIAGE ________MARRIAGE ________

1.1. Nickolas 12 Aug Nickolas 12 Aug 16881688

2.2. AnnAnn 29 Jun 168929 Jun 16893.3. Lidia 1 Feb 1691/2Lidia 1 Feb 1691/24.4. Elisabeth about Elisabeth about

169416945.5. Isaac 20 Jul 1697Isaac 20 Jul 16976.6. Benjamin 20 Aug Benjamin 20 Aug

16991699

Page 27: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

27

VALIDATION IIIVALIDATION III

Grouping by FAMILY:Grouping by FAMILY:

total # merges + splits to correct families total # merges + splits to correct families after after 22ndnd PASS PASS

______________________________________________________________________total # merges + splits to correct families total # merges + splits to correct families

after after 11stst PASS PASS

The lower the number, the betterThe lower the number, the better

Page 28: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

28

Optimize the RulesOptimize the Rules AddAdd

RemoveRemove

Fine-tuneFine-tune

Change the order Change the order

Improve the whole processImprove the whole processUntil the metrics no Until the metrics no

longer improvelonger improve

Page 29: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

29

AUTOMATIC AUTOMATIC EXTRACTIONEXTRACTION

Unstructured Unstructured genealogical genealogical datadata

Searchable Searchable annotated annotated genealogical genealogical datadata

Families in Families in

RDF-data fileRDF-data file

Page 30: Automatic Extraction of Individual and Family Information from  Primary Genealogical Records

Questions?Questions?