the archaeotools project, faceted classification and natural language processing in an...
TRANSCRIPT
The Archaeotools project, faceted The Archaeotools project, faceted classification and natural language processing classification and natural language processing in an archaeological context.in an archaeological context.
University of York, April 2008
AHRC-EPSRC-JISC eScience research grants scheme:AHRC-EPSRC-JISC eScience research grants scheme:
AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks
BUILDS UPON: Common Information Environment Enhanced Geospatial browser
PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield
Joint Information Systems Committee
• Workpackage 1 - Advanced Faceted Classification /Geo-spatial Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media).and Media).
• Workpackage 2 – Natural language processing /Data-mining of Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingGrey Literature; plus tagging
• Workpackage 3 – Data-mining of Historic Literature; plus Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk
Three distinct Workpackages:
• Datasets include:– National Monuments Records (Scotland, Wales, England)– Excavation Index (EH)– Archive Holdings– Local Authority Historic Environment Records
• Thesauri include:– Thesaurus of Monuments Types (TMT)– Thesaurus of Object Types – MIDAS Period list– UK Government list of administrative areas, County,
District, Parish (CDP) – Not MIDAS
Work package 1Work package 1
OracleRDBMS
MIDAS XML Record
Information Extraction RDF Resource
Knowledge triple store
XML Docs of Thesaurus
Query
User Interface
Information Extraction
When, Where, What ontologiesas entries to faceted index
Input
Input
“WHAT”
• Records that have no subject information
• Records that use terms not found in TMT, so these records cannot be indexed (6,442 unique terms)
Records (1,001,407)
19,269 records (2%)
Records (1,001,407)
101,507 records (10.1%)
“WHEN”
• Records that have no temporal information
• Records that use period terms not found in MIDAS so these records cannot be indexed (457 types of irresolvable dates)
Records (1,001,407)
292,793 records (29.2%)
Records (1,001,407)
114,505 (11.4%)
1066, 1001-1100,11th Centuary, C11, 11C, Eleventh Century
“WHERE”
• Records that have no spatial information
• Records that use terms not found in CDP, so these records cannot be indexed.
Records (1,001,407)
11,126(1.1%)
Records (1,001,407)
245,601 records (24.5%)
linear
• Workpackage 1 - Advanced Faceted Classification /Geo-spatial Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media).and Media).
• Workpackage 2 – Natural language processing /Data-mining of Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingGrey Literature; plus tagging
• Workpackage 3 – Data-mining of Historic Literature; plus Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk
Three distinct Workpackages:
XML tagging of semantic content
CIDOC: CRM
University Researchers
Local authority curators
• Workpackage 1 - Advanced Faceted Classification /Geo-spatial Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media).and Media).
• Workpackage 2 – Natural language processing /Data-mining of Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus taggingGrey Literature; plus tagging
• Workpackage 3 – Data-mining of Historic Literature; plus Workpackage 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk
Three distinct Workpackages:
http://ads.ahds.ac.uk/project/archaeotools/