biographynet biographynet project review, year-1 september, 18th, 2013 escience center 18 september...

Download BiographyNet BiographyNet Project review, year-1 September, 18th, 2013 eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre,

If you can't read please download the document

Upload: francis-whitehead

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • BiographyNet BiographyNet Project review, year-1 September, 18th, 2013 eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 2
  • Agenda Project objectives and first year results (Piek) Methodology and historian perspective (Serge) Model, conversions and interface (Niels) NLP tools and research (Antske) Discussion BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 3
  • Starting point http://www.biografischportaal.nl Academic discipline of writing histories: computational tools marginally used, long scholarly tradition of study by reading, single authored historical narratives, while more and more historical sources digitally available. Project challenges: Computational thinking in history: Narrative historians not used to frame research problems in computational terms, while computer-science researchers understand little of the subtleties of historical analysis Strong multi-disciplinary cooperation of front runners in both fields & demonstrator development to achieve common understanding. Methodological and tool support BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 4
  • Contribution to historical research New research on the Dutch nation building and a revaluation of biographical information. Bridging a gap between life histories, qualitative historical research, and quantitative historical research. Open research on less static objects and relations such as events: most important pieces of information capturing changes and processes that matter. Capture historiographic perspective: Requires a model that takes different framings of the same event into account. Adds to the who-knows-who, when, where and how did the lives of people cross; how did they affect each others lives and the world they lived in. How do and did we conceive historic events, how are different narratives created around the same history? BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 5
  • Expected outcome Demonstrator on top of the Biography Portal. Cyclic development. links within the Biography Portal among the various (textual and visual) datasets Open-source release of the e-science platform for analyzing biographical texts about people. Adherence to all relevant Web standards and APIs, maximizing reusability. Proposal for methodology for extraction of a network of relations between people and (historic) events. BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 6
  • Short term goals 1.Building a richer data repository by connecting different distributed sources of data through formalized links and metadata. 2.Detection of (co-referenced) named-entities (persons, places and dates) and events. 3.Harmonize the texts that vary from 19th century Dutch to contemporary Dutch, where the OCR-ed texts also contain errors. 4.Development of visualization, analytic tools, as well as computational historiographical methods on the structured data that is generated for 1. through 3. BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 7
  • Results first year Methodology: Use cases and the anticipation of data- and process-driven biases Formal modeling of provenance Sustainability, replication, reproducibility Software: Design of interfaces and analytic tools Text mining and evaluation Linked Data conversion scripts Data: Linked Data version of the Portal Linking to Agora Discussions with Wikimedia/Wikipedia/Dbpedia & Bibliotheek.nl Verrijkt Koninkrijk HuygensING exploitation to extend the Portal with enriched data produced 6 accepted papers BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 8
  • BiographyNet and historical approaches to big and heterogeneous data eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 9
  • The historians role 1.Methodology: Work on a methodology to extract information, relationships and events from short biographical texts 2.Question the data: develop use cases 3.Contribute to the design of a user interface that challenges historians to dig deeper into the data 4.Sensitize target user groups (historians) for both the possibilities and the limitations of computational methods in historical research.
  • Slide 10
  • 1: Methodology Year 1 - Historians focus: how reliable and representative are the texts from this particular dataset? Which questions can and cannot be answered? How well do tools perform, as compared to the performance of a real historian? See also publications (below). Year 1 - Interdisciplinary focus: what is the provenance of the information, how is it manipulated in order to arrive at the answer to a query, and who are responsible for the tools that manipulate those data?
  • Slide 11
  • 2: Use Cases 12 cases developed, ranging from simple to highly complex Simple: Group analysis of Governors-general of the Dutch Indies More complex: when did Dutch elites get involved with the New World? Complex: What can we say about nationalism in biographical dictionaries from the nineteenth and twentieth century?
  • Slide 12
  • Governors-General of the Dutch Indies Highest Official in the Dutch Indies 1610-1949 71 men What can we say about these men as a group? Who was appointed and what qualities did he have to have? Etc .
  • Slide 13
  • 3: User friendly interface Mainly work in progress, Discussion about the impact of a design metaphor (like time line , house of, building blocks for, family tree) on the type of questions raised by the user presentation Niels.
  • Slide 14
  • The House of History
  • Slide 15
  • Time line
  • Slide 16
  • Family Tree
  • Slide 17
  • 4: Sensitize target user groups Publication in Tijdschrift voor Biografie (reaching the nearest target user group of the demonstrator): Serge ter Braake, Het individu en zijn tijdgenoten. Wat een biograaf kan doen met prosopografie en biografische woordenboeken, Tijdschrift voor Biografie 2 (summer 2013) vol. 2, 52-61. Biography and Computational Methods, joint paper in preparation (to be submitted before the end of the month to Journal for Historical Biography (Ter Braake, Ockeloen and Fokkens) Research on nationalism and national biographies, to be published in 2014
  • Slide 18
  • 4: Sensitize target user groups Presentation at Huygens ING, 10 October 2013 (for circa 50 professional historians) Presentation on provenance at KNAW Digital Humanities Workshop, 14-15 November 2013 Introduction in e-Humanities in the current curriculum of BA1 students at the Vrije Universiteit (what is e- Humanities, how does one use a source like the Oxford Dictionary of National Biography?) Design and development of a series of electives and a minor on e-history and an e-humanities (BA 2-3; starting 2014/2015). Dataset of BiographyNet will be used in a lab for history bachelor students.
  • Slide 19
  • BiographyNet Towards the demonstrator eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 20
  • Main components of the demonstrator Schema to structure the data Conversion of the BP to Linked Data NLP system setup Interface Overview
  • Slide 21
  • Online machine readable data with links Simple facts called RDF Triples Thorbecke > hasBirthPlace > Zwolle Some technology concepts: Schemas: To structure LD RDF Stores: To store LD SPARQL: To access LD Huge growth in the past years: More than 300 data sources More than 30 billion triples A crash course on Linked Data
  • Slide 22
  • Purely syntactic conversion Preserve the original structure of the data Prevent los of information Allow for reinterpretation of the original data in the future The conversion process
  • Slide 23
  • Conversion steps: Retrieval of XML dump of the Biography Portal Initial conversion to crude RDF Using ClioPatria and the XMLRDF tool for ClioPatria RDF restructuring Linking to other sources Essential step in the Linked Data philosophy The conversion process
  • Slide 24
  • Data schema: Based on the structure of the original XML files Needs to facilitate the coupling of different biographies of the same person, without compromising the original data Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc. Compatible with existing schemas such as the Europeana Data Model, PROV, P-PLAN, DC terms, etc. The conversion process
  • Slide 25
  • BiographyNet: Schema illustration http://www.biographynet.nl/schema
  • Slide 26
  • Provenance information is information on how Entities come into existence What are entities? Documents, Articles, Pictures, etc. Basically anything that can be produced by something or someone What kind of information? Who did what? Using which entities? In which processes? Provenance: What is it?
  • Slide 27
  • For the demonstrator, provenance needs to be modeled: From several perspectives: Information involved Processes involved People involved At multiple levels: An aggregated level, i.e. per enrichment Detailed level, i.e. all individual processes Provenance in BiographyNet
  • Slide 28
  • Needed to ensure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool Historians need to be able to validate results Replication: Retrieving the same results later using the demonstrator Reproducibility: Manually by the historian The aggregated level Targeted at the historian Which original sources where involved? Who to contact in case results are pulled into question? The detailed level Targeted at the computer scientist Detailed information on each individual step Allows for debugging the internal processing pipeline Why is provenance info important for BiographyNet?
  • Slide 29
  • Johan Rudolph Thorbecke werd in 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse BiographyNet Enrichment example Thorbecke Biographical Description Provenance Meta Data NNBW Person Meta Data Thorbecke Biography Parts Birth 1798 Event Biographical Description EnrichmentNLP Tool Person Meta Data Event Birth Johan Rudolph Thorbecke werd in 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse Zwolle 1798-01-14
  • Slide 30
  • P-PLAN is not only used to model what actually happened, but also what was supposed to happen Plans describe the original idea behind an activity Describe what should happen in a certain activity Each Plan corresponds with an Activity Variables describe the input/output of an activity Structure, format, quantity, etc. Each Variable corresponds with an input/output Entity of an Activity Plans have their own provenance info E.g. who was responsible for the creation of a plan? More than just Provenance
  • Slide 31
  • The benefits of modeling plans: Forces the recording of what an activity and its input/output should look like Provides information on the original idea behind an activity As such, can provide info on possible assumptions and biases Allows for comparing between the actual activity and its input/output and the original plan and its variables Do they differ from each other and to what extend? Makes finding errors much easier, as more information is available about what the input/output should look like Why model plans besides provenance?
  • Slide 32
  • BiographyNet: Schema illustration
  • Slide 33
  • Activity Plan Entity Variable Agent Association Activity Plan Person NLP Tool
  • Slide 34
  • Main components of the demonstrator Initial schema available (publication LISC @ISWC 2013) Schema models enrichments and aggregations alongside original sources Allows for storing various levels of provenance information Model will be adapted while progressing with building the demonstrator Initial conversion to Linked Data available Structure according to schema presented Next step is linking to external sources NLP system setup available (Antske) Interface Presentation of general outline and ideas Recap / Current Status
  • Slide 35
  • The interface should be easy to use The demonstrator should inspire historians to undertake new research and give direction, rather than being the closing factor in their research The interface should allow users to fine tune results returned upon an initial action Interface: Focus
  • Slide 36
  • Query composition Faceted browsing A combination Interface: Options
  • Slide 37
  • Drop down boxes to select Verbs, data elements and relations Interface: Query composition
  • Slide 38
  • No explicit querying, but convergence of the data through browsing and selecting Provides better feedback to the user Allows for more direct and easier adjustment of the selected data Interface: Faceted browsing
  • Slide 39
  • Slide 40
  • Query composition combined with faceted browsing Create new facets by defining a query The result of the query is available as a subset of the data by selecting the defined facet As such, combinable with other facets Method to integrate open querying of the data into a general interface and visualization Interface: A combination
  • Slide 41
  • Question Analysis Selection Process Results Data Facets
  • Slide 42
  • Time and place are primary elements Interface: Demonstrator Results ?
  • Slide 43
  • Slide 44
  • BiographyNet BiographyNet Text Mining eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 45
  • First year goals for Text Mining Methodology Requirements Approach Basic System for data enrichment in text Identify metadata in text Setup that can easily be improved and extended (co-referenced) named entities, events Deal with alternative spelling BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 46
  • Methodology Requirements Reproducing results in Natural Language Processing is non-trivial Details in implementations or experimental setup can influence results up to a point where they tell a different story BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 47
  • Reproducing results Example: Performance of WordNet similarity scores compared to human ranking: BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 48
  • Reproducing results Clear registration of all steps involved and storage of (intermediate) system output can improve reproducibility Systematic testing can help to gain insight into the variation of the outcome of our systems and hence lead to more insight in their performance Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen and Nuno Freire (2013) Offspring from Reproduction Problems: What Replication Failure Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August 2013. BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 49
  • Methodology requirements The method used to extract information may introduce a bias that has unintended influence on the outcome of the historians questions For example: location identification with GeoNames Heuristic: when multiple locations with the same name, take the one in or closest to the Netherlands High precision, but `America, `Willemstad: what if the historian investigates trips to the Netherlands by officials overseas? BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 50
  • Methodology requirements Maximize reuse of existing tools for BiographyNet Maximize reuse of tools developed within BiographyNet by other researchers How can we create a setup that facilitates this? BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 51
  • Methodology approach Provenance modeling: Can help to improve reproducibility of research Can support systematic testing Can model the exact steps taken Flexible formats that support this: NLP Annotation Format (NAF) to manage output and input of NLP tools Grounded Annotation Framework (GAF) for the final output of the NLP pipeline BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 52
  • NLP Annotation Format Sustainable, because close to existing linguistic formats (e.g. LAF, GRAF, NIF) Joint work across projects and with other institutes (notably University of the Basque Country, Fondazione Bruno Kessler) Flexible, because the output of individual tools is added in separate layers BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 53
  • Grounded Annotation Framework RDF compliant framework Introduces the denotedBy relation that links mentions in text to formal representations of their instances Provenance is marked using Named Graphs This allows us to accumulate information from different sources and represent alternative perspectives BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 54
  • Slide 55
  • Provenance Modeling It must be clear where information comes from (original source, opinion holder, automatically retrieved or from metadata) For NLP research: Model each step of the process Resources used (preprocessing + version), system output For historic research: What may introduce biases? How can the process be represented in an understandable manner? BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 56
  • Basic System Identifying metadata in text Linguistically nave supervised machine learning Linguistic processing: Named Entity recognition (time and location) Concept identification BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 57
  • First Evaluation Use case: Governor Generals of the Dutch Indies 129 Biographies describing 71 individuals Serge ter Braake extracted information manually BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 58
  • Metadata versus text mining BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 59
  • Preliminary outcome of text mining CategoryCorrectIncorrectBothCorrect textIncorrect Text Education202 Father0092 Mother0125 Occupation1462214 Birthdate212359 BiographyNet Review Meeting, eScience centre, September 18th, 2013 Recall problems (for birthdate): 1.Sentence not found (35): typical for wikipedia, bwn, vdaa 2.Value not found (7) 3.Wrong sentence (1), wrong date (1): date of marriage, date of death
  • Slide 60
  • Observations Recall problems (for birthdate): Sentence identification Easy ways to improve: Parents: named entity recognition Occupation, Education: concept tagged corpus Source specific training More difficult problems: Relations, functions of other people Negations or factuality (e.g. refused positions for occupations) BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 61
  • NLP outlook Evaluation: Text based annotations Metadata extraction: Supervised with linguistically rich features Rule-based approaches Beyond Metadata: Time lines of peoples lives (2 nd year) Networks between people (2 nd year) Complex event modeling (3 rd year) BiographyNet Review Meeting, eScience centre, September 18th, 2013
  • Slide 62
  • Questions? http://www.biographynet.nl/ http://www.biographynet.nl/ eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013