heideltime at tempeval-3 - heidelberg university · heideltime at tempeval-3 tuning english and...
TRANSCRIPT
HeidelTime at TempEval-3Tuning English and Developing Spanish Resources
Jannik Strotgen, Julian Zell, Michael GertzDatabase Systems Research Group, Heidelberg University, Im Neuenheimer Feld 348, 69120 Heidelberg, Germany
Motivation HeidelTime: a multilingual, cross-domain temporal tagger
Temporal Tagging•extraction & normalization
of temporal expressionsMain Challenge•normalizing relative and un-
derspecified expressions
News 1998-04-18... for the United States,he said today. ... OnMay 22, 1995, Farkas wasmade a brigadier general,and the following year ...However, cited by police inDecember for driving underthe influence of alcohol ...
Different Domains [1]•pose different challenges• require different strategiesExisting Approaches• focus on English• focus on news documents
Narrative 2009-12-191979: Soviet invasion... land in Kabul onDecember 25 ... they werecomplying with the 1978Treaty of Friendship ... en-tered Afghanistan from thenorth on December 27. Inthe morning, the 103rd ...
Key Features [2]• rule-based system• required: sentence, token,
and POS information
ResourcesSource Code
Language-independent• resource interpreter•domain-dependent normal-
ization strategies→ reference time→ relation to reference time
•extraction: regular expres-sions & NLP features•normalization: knowledge
resources & linguistic clues
Language-dependent•pattern files
month=(...|April|May|...)•normalization files
normMonth(April)=04• rule files
TempEval-3 Developing Spanish Resources
Temporal Tagging (Task A)•English and Spanish news documents•annotation according to TimeMLEvaluation•strict and relaxed extraction• type and value normalization• ranking attribute: value F1 (relaxed)
From HeidelTime 1.2 to 1.3
• improved weekday normalization•annotations closer to TimeMLEnglish Adaptations•X REF values• improved negative rules for ambiguous
expressions (e.g., may, fall, march)
Four Steps to Add a New Language:(1) Preprocessing:•sentence, token, PoS information•HeidelTime uses TreeTagger•Spanish TreeTagger module available(2) Translation of Pattern Files:
// reMonthLong[Ee]nero[Ff]ebrero[Mm]arzo...
// reMonthLongJanuaryFebruaryMarch…
(3) Translation of Normalization Files:// “normMonth”“January”,”01”“Jan\.?”,”01”“0?1”,”01”“February”,”02”...
// “normMonth”“[Ee]nero”,”01”“[Ee]ne\.?”,”01”“0?1”,”01”“[Ff]ebrero”,”02”...
(4) Iterative Rule Development
•starting with (simple) English rules
•checking training corpus for errors (FP,FN, partial matches, incorrect values)
•adapting patterns, normalizations, andrules to improve results on training data
// example: “el 20 de enero de 2012” (2012-01-20)Name=“date_r1”Extract=“[Ee]l %reDayNum de %reMonthLong “ + “de %reYear4Digit”Value=“group(3)-%normMonth(group(2))-” + “%normDay(group(1))”
// example: “January 20th, 2012” (2012-01-20)Name=“date_r1”Extract=“%reMonthLong %reDayNum, “ + “%reYear4Digit”Value=“group(3)-%normMonth(group(1))-” + “%normDay(group(2))”
TempEval-3 Evaluation Results
Englishstrict F1 relaxed F1 value F1
HeidelTime 1.3 81.34 90.30 77.61HeidelTime 1.2 78.07 86.99 72.12NavyTime 79.57 90.32 70.97(next best system by value F1)
Spanishstrict F1 relaxed F1 value F1
HeidelTime 1.3 85.33 90.13 85.33TipSemB 82.57 87.40 71.85jrc 49.53 65.20 50.78
Summary•English: 8 teams, 21 submissions•Spanish: 3 teams, 3 submissions•HeidelTime best system for English &
Spanish for extraction + normalization
Error Analysis & Conclusions
False Negatives•expressions that cannot be normalized
with high probability (some time)False Negatives / False Positives• trade-off due to X REF expressions; an-
notated inconsistently (currently)Incorrect Value Normalization•due to partial matches• incorrect relation to reference timeSpanish Resources•benefit from high quality English re-
sources as starting point•contain many patterns and rules not in
Spanish training data
Availability
HeidelTime’s Current Version
•as UIMA component
•as standalone version (Java)
•online demo
•@ Google code
Languages
•English, German, Dutch,Spanish, Italian, Arabic,Vietnamese
Ongoing Work
• further languages to come
1
Contact Information:Jannik [email protected]://dbs.ifi.uni-heidelberg.de/
References[1 ] J. Strotgen and M. Gertz: Temporal Tagging on Different Domains: Challenges, Strategies, and
Gold Standards. LREC, 2012.[2 ] J. Strotgen and M. Gertz: Multilingual and Cross-domain Temporal Tagging.
Language Resources and Evaluation, 47(2), 269–298, 2013.
This work was presented at SemEval 2013, the 7th International Workshop on Semantic Evaluation, June 14-15, 2013, Atlanta, Georgia, USA.