biographynet project review, year-1 september, 18th, 2013

Download BiographyNet Project review, year-1 September, 18th, 2013

If you can't read please download the document

Upload: lorin

Post on 25-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

BiographyNet Project review, year-1 September, 18th, 2013. eScience Center 18 September 2013. Agenda. P roject objectives and first year results (Piek) Methodology and historian perspective (Serge) Model, conversions and interface ( Niels ) NLP tools and research ( Antske ) Discussion. - PowerPoint PPT Presentation

TRANSCRIPT

Biography Portal of the Netherlands. The Data

BiographyNet

Project review, year-1September, 18th, 2013eScience Center 18 September 2013BiographyNet Review Meeting, eScience centre, September 18th, 2013AgendaProject objectives and first year results (Piek)Methodology and historian perspective (Serge)Model, conversions and interface (Niels)NLP tools and research (Antske)DiscussionBiographyNet Review Meeting, eScience centre, September 18th, 2013Starting pointhttp://www.biografischportaal.nlAcademic discipline of writing histories:computational tools marginally used, long scholarly tradition of study by reading,single authored historical narratives,while more and more historical sources digitally available.Project challenges:Computational thinking in history:Narrative historians not used to frame research problems in computational terms, while computer-science researchers understand little of the subtleties of historical analysisStrong multi-disciplinary cooperation of front runners in both fields & demonstrator development to achieve common understanding.Methodological and tool support

BiographyNet Review Meeting, eScience centre, September 18th, 2013Contribution to historical researchNew research on the Dutch nation building and a revaluation of biographical information. Bridging a gap between life histories, qualitative historical research, and quantitative historical research.Open research on less static objects and relations such as events:most important pieces of information capturing changes and processes that matter.Capture historiographic perspective:Requires a model that takes different framings of the same event into account. Adds to the who-knows-who, when, where and how did the lives of people cross; how did they affect each others lives and the world they lived in.How do and did we conceive historic events, how are different narratives created around the same history?BiographyNet Review Meeting, eScience centre, September 18th, 2013Expected outcomeDemonstrator on top of the Biography Portal. Cyclic development. links within the Biography Portal among the various (textual and visual) datasetsOpen-source release of the e-science platform for analyzing biographical texts about people. Adherence to all relevant Web standards and APIs, maximizing reusability.Proposal for methodology for extraction of a network of relations between people and (historic) events.BiographyNet Review Meeting, eScience centre, September 18th, 2013Short term goalsBuilding a richer data repository by connecting different distributed sources of data through formalized links and metadata.Detection of (co-referenced) named-entities (persons, places and dates) and events.Harmonize the texts that vary from 19th century Dutch to contemporary Dutch, where the OCR-ed texts also contain errors.Development of visualization, analytic tools, as well as computational historiographical methods on the structured data that is generated for 1. through 3.BiographyNet Review Meeting, eScience centre, September 18th, 2013Results first yearMethodology:Use cases and the anticipation of data- and process-driven biasesFormal modeling of provenanceSustainability, replication, reproducibilitySoftware:Design of interfaces and analytic toolsText mining and evaluationLinked Data conversion scriptsData:Linked Data version of the PortalLinking to AgoraDiscussions with Wikimedia/Wikipedia/Dbpedia & Bibliotheek.nlVerrijkt KoninkrijkHuygensING exploitation to extend the Portal with enriched data produced6 accepted papersBiographyNet Review Meeting, eScience centre, September 18th, 2013

BiographyNet and historical approaches to big and heterogeneous dataeScience Center 18 September 2013BiographyNet Review Meeting, eScience centre, September 18th, 2013The historians roleMethodology: Work on a methodology to extract information, relationships and events from short biographical textsQuestion the data: develop use casesContribute to the design of a user interface that challenges historians to dig deeper into the dataSensitize target user groups (historians) for both the possibilities and the limitations of computational methods in historical research.1: MethodologyYear 1 - Historians focus: how reliable and representative are the texts from this particular dataset? Which questions can and cannot be answered? How well do tools perform, as compared to the performance of a real historian? See also publications (below).

Year 1 - Interdisciplinary focus: what is the provenance of the information, how is it manipulated in order to arrive at the answer to a query, and who are responsible for the tools that manipulate those data?2: Use Cases12 cases developed, ranging from simple to highly complexSimple: Group analysis of Governors-general of the Dutch Indies More complex: when did Dutch elites get involved with the New World?Complex: What can we say about nationalism in biographical dictionaries from the nineteenth and twentieth century?11Governors-General of the Dutch IndiesHighest Official in the Dutch Indies 1610-194971 menWhat can we say about these men as a group?Who was appointed and what qualities did he have to have? Etc .

3: User friendly interfaceMainly work in progress,

Discussion about the impact of a design metaphor (like time line , house of, building blocks for, family tree) on the type of questions raised by the user

presentation Niels.The House of History

Time line

Family Tree

4: Sensitize target user groupsPublication in Tijdschrift voor Biografie (reaching the nearest target user group of the demonstrator):

Serge ter Braake, Het individu en zijn tijdgenoten. Wat een biograaf kan doen met prosopografie en biografische woordenboeken,Tijdschrift voor Biografie2 (summer 2013) vol. 2, 52-61.

Biography and Computational Methods, joint paper in preparation (to be submitted before the end of the month to Journal for Historical Biography (Ter Braake, Ockeloen and Fokkens)

Research on nationalism and national biographies, to be published in 20144: Sensitize target user groupsPresentation at Huygens ING, 10 October 2013 (for circa 50 professional historians)Presentation on provenance at KNAW Digital Humanities Workshop, 14-15 November 2013Introduction in e-Humanities in the current curriculum of BA1 students at the Vrije Universiteit (what is e-Humanities, how does one use a source like the Oxford Dictionary of National Biography?)Design and development of a series of electives and a minor on e-history and an e-humanities (BA 2-3; starting 2014/2015). Dataset of BiographyNet will be used in a lab for history bachelor students.

BiographyNetTowards the demonstratoreScience Center 18 September 2013BiographyNet Review Meeting, eScience centre, September 18th, 2013Main components of the demonstratorSchema to structure the dataConversion of the BP to Linked DataNLP system setupInterface

OverviewOnline machine readable data with links Simple facts called RDF TriplesThorbecke > hasBirthPlace > Zwolle

Some technology concepts: Schemas: To structure LDRDF Stores: To store LD SPARQL: To access LD

Huge growth in the past years: More than 300 data sourcesMore than 30 billion triples

A crash course on Linked Data

Purely syntactic conversionPreserve the original structure of the dataPrevent los of informationAllow for reinterpretation of the original data in the future

The conversion processData PreservationConversion steps: Retrieval of XML dump of the Biography PortalInitial conversion to crude RDFUsing ClioPatria and the XMLRDF tool for ClioPatriaRDF restructuringLinking to other sourcesEssential step in the Linked Data philosophy

The conversion process

Data schema: Based on the structure of the original XML filesNeeds to facilitate the coupling of different biographies of the same person, without compromising the original dataNeeds to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc.Compatible with existing schemas such as the Europeana Data Model,PROV, P-PLAN, DC terms, etc.

The conversion process

BiographyNet: Schema illustration http://www.biographynet.nl/schemaProvenance information is information on how Entities come into existenceWhat are entities?Documents, Articles, Pictures, etc.Basically anything that can be produced by something or someoneWhat kind of information?Who did what?Using which entities?In which processes?

Provenance: What is it?

For the demonstrator, provenance needs to be modeled: From several perspectives:Information involvedProcesses involvedPeople involvedAt multiple levels:An aggregated level, i.e. per enrichmentDetailed level, i.e. all individual processes

Provenance in BiographyNetNeeded to ensure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the toolHistorians need to be able to validate resultsReplication: Retrieving the same results later using the demonstratorReproducibility: Manually by the historianThe aggregated level Targeted at the historianWhich original sources where involved?Who to contact in case results are pulled into question?The detailed level Targeted at the computer scientistDetailed information on each individual stepAllows for debugging the internal processing pipeline

Why is provenance info important for BiographyNet?

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-DuitseJohan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-DuitseBiographyNetEnrichment exampleThorbeckeBiographical DescriptionProvenanceMeta DataNNBWPersonMeta DataThorbeckeBiographyParts

Birth1798EventBiographical DescriptionEnrichmentNLP ToolPersonMeta DataEventBirthJohan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-DuitseZwolle1798-01-14P-PLAN is not only used to model what actually happened, but also what was supposed to happenPlans describe the original idea behind an activityDescribe what should happen in a certain activityEach Plan corresponds with an ActivityVariables describe the input/output of an activityStructure, format, quantity, etc.Each Variable corresponds with an input/output Entity of an ActivityPlans have their own provenance infoE.g. who was responsible for the creation of a plan?

More than just ProvenanceThe benefits of modeling plans:Forces the recording of what an activity and its input/output should look likeProvides information on the original idea behind an activityAs such, can provide info on possible assumptions and biasesAllows for comparing between the actual activity and its input/output and the original plan and its variablesDo they differ from each other and to what extend?Makes finding errors much easier, as more information is available about what the input/output should look like

Why model plans besides provenance?

BiographyNet: Schema illustration

ActivityPlanEntityEntityVariableVariableAgentAgentAssociationActivityPlanPersonNLP ToolMain components of the demonstratorInitial schema available (publication LISC @ISWC 2013)Schema models enrichments and aggregations alongside original sources Allows for storing various levels of provenance informationModel will be adapted while progressing with building the demonstratorInitial conversion to Linked Data availableStructure according to schema presentedNext step is linking to external sourcesNLP system setup available (Antske)InterfacePresentation of general outline and ideas

Recap / Current StatusThe interface should be easy to useThe demonstrator should inspire historians to undertake new research and give direction, rather than being the closing factor in their researchThe interface should allow users to fine tune results returned upon an initial action

Interface: FocusQuery compositionFaceted browsingA combination

Interface: Options

Drop down boxes to select Verbs, data elements and relations

Interface: Query composition

No explicit querying, but convergence of the data through browsing and selectingProvides better feedback to the userAllows for more direct and easier adjustment of the selected data

Interface: Faceted browsing

Interface: Faceted browsing

Query composition combined with faceted browsingCreate new facets by defining a queryThe result of the query is available as a subset of the data by selecting the defined facetAs such, combinable with other facetsMethod to integrate open querying of the data into a general interface and visualization

Interface: A combinationInterface: A combinationQuestion Analysis

SelectionProcessResultsData

FacetsTime and place are primary elements

Interface: DemonstratorResults?

BiographyNet

Text MiningeScience Center 18 September 2013BiographyNet Review Meeting, eScience centre, September 18th, 2013First year goals for Text MiningMethodologyRequirementsApproachBasic System for data enrichment in textIdentify metadata in textSetup that can easily be improved and extended(co-referenced) named entities, eventsDeal with alternative spellingBiographyNet Review Meeting, eScience centre, September 18th, 2013Methodology RequirementsReproducing results in Natural Language Processing is non-trivialDetails in implementations or experimental setup can influence results up to a point where they tell a different storyBiographyNet Review Meeting, eScience centre, September 18th, 2013Reproducing resultsExample: Performance of WordNet similarity scores compared to human ranking:BiographyNet Review Meeting, eScience centre, September 18th, 2013

Reproducing resultsClear registration of all steps involved and storage of (intermediate) system output can improve reproducibilitySystematic testing can help to gain insight into the variation of the outcome of our systems and hence lead to more insight in their performance

Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen and Nuno Freire (2013) Offspring from Reproduction Problems: What Replication Failure Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August 2013.BiographyNet Review Meeting, eScience centre, September 18th, 2013Methodology requirementsThe method used to extract information may introduce a bias that has unintended influence on the outcome of the historians questionsFor example: location identification with GeoNamesHeuristic: when multiple locations with the same name, take the one in or closest to the NetherlandsHigh precision, but `America, `Willemstad: what if the historian investigates trips to the Netherlands by officials overseas?

BiographyNet Review Meeting, eScience centre, September 18th, 2013Methodology requirementsMaximize reuse of existing tools for BiographyNetMaximize reuse of tools developed within BiographyNet by other researchersHow can we create a setup that facilitates this?BiographyNet Review Meeting, eScience centre, September 18th, 2013Methodology approachProvenance modeling:Can help to improve reproducibility of researchCan support systematic testingCan model the exact steps takenFlexible formats that support this:NLP Annotation Format (NAF) to manage output and input of NLP toolsGrounded Annotation Framework (GAF) for the final output of the NLP pipelineBiographyNet Review Meeting, eScience centre, September 18th, 2013NLP Annotation FormatSustainable, because close to existing linguistic formats (e.g. LAF, GRAF, NIF)

Joint work across projects and with other institutes (notably University of the Basque Country, Fondazione Bruno Kessler)

Flexible, because the output of individual tools is added in separate layers

BiographyNet Review Meeting, eScience centre, September 18th, 2013Grounded Annotation FrameworkRDF compliant frameworkIntroduces the denotedBy relation that links mentions in text to formal representations of their instancesProvenance is marked using Named GraphsThis allows us to accumulate information from different sources and represent alternative perspectivesBiographyNet Review Meeting, eScience centre, September 18th, 2013BiographyNet Review Meeting, eScience centre, September 18th, 2013

Provenance ModelingIt must be clear where information comes from (original source, opinion holder, automatically retrieved or from metadata)For NLP research:Model each step of the processResources used (preprocessing + version), system outputFor historic research:What may introduce biases? How can the process be represented in an understandable manner?BiographyNet Review Meeting, eScience centre, September 18th, 2013Basic SystemIdentifying metadata in textLinguistically nave supervised machine learning

Linguistic processing:Named Entity recognition (time and location)Concept identificationBiographyNet Review Meeting, eScience centre, September 18th, 2013First EvaluationUse case: Governor Generals of the Dutch Indies129 Biographies describing 71 individualsSerge ter Braake extracted information manually

BiographyNet Review Meeting, eScience centre, September 18th, 2013Metadata versus text miningBiographyNet Review Meeting, eScience centre, September 18th, 2013Preliminary outcome of text miningCategoryCorrectIncorrectBothCorrect textIncorrect TextEducation202Father0092Mother0125Occupation1462214 Birthdate212359BiographyNet Review Meeting, eScience centre, September 18th, 2013Recall problems (for birthdate):

Sentence not found (35): typical for wikipedia, bwn, vdaaValue not found (7)Wrong sentence (1), wrong date (1): date of marriage, date of deathObservationsRecall problems (for birthdate):Sentence identificationEasy ways to improve:Parents: named entity recognitionOccupation, Education: concept tagged corpusSource specific trainingMore difficult problems:Relations, functions of other people Negations or factuality (e.g. refused positions for occupations)

BiographyNet Review Meeting, eScience centre, September 18th, 2013NLP outlookEvaluation:Text based annotationsMetadata extraction:Supervised with linguistically rich featuresRule-based approaches Beyond Metadata:Time lines of peoples lives (2nd year)Networks between people (2nd year)Complex event modeling (3rd year)

BiographyNet Review Meeting, eScience centre, September 18th, 2013

Questions?http://www.biographynet.nl/eScience Center 18 September 2013BiographyNet Review Meeting, eScience centre, September 18th, 2013