CKLCKL------
Center for Center for Computational Computational
LinguisticsLinguisticsProjeProjecct MŠMT LC536t MŠMT LC536
(LC05)(LC05)Univerzita Karlova v Praze, ÚFAL MFFUniverzita Karlova v Praze, ÚFAL MFFZápadočeská univerzita Plzeň, KKY Západočeská univerzita Plzeň, KKY
FAVFAVMasarykova Univerzita Brno, FIMasarykova Univerzita Brno, FI
Ústav pro jazyk český AV ČR PrahaÚstav pro jazyk český AV ČR Prahahttp://www.centrumkomputacnilingvistiky.czhttp://www.centrumkomputacnilingvistiky.cz
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 22
Center’s Advisory Board MeetingCenter’s Advisory Board Meeting 3131..11.20.201111
MFF UK, Malostranské nám. 25MFF UK, Malostranské nám. 25RoomRoom S S11, , 44thth floor floor
10:00 Introduction to the Center, history, results (Jan Hajic)10:00 Introduction to the Center, history, results (Jan Hajic) 10:25 Charles University research and results (Jan Hajic)10:25 Charles University research and results (Jan Hajic) 10:40 Break10:40 Break 11:00 Institute for Czech Language research and results 11:00 Institute for Czech Language research and results
(Karel Oliva)(Karel Oliva) 11:15 Masaryk University research and results (Karel Pala)11:15 Masaryk University research and results (Karel Pala) 11:30 University of West Bohemia research and results 11:30 University of West Bohemia research and results
(Pavel Ircing)(Pavel Ircing)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 33
The CenterThe Center
Goals:Goals:– Research in all areas of computational Research in all areas of computational
linguistics and speechlinguistics and speech– Close cooperation in speech and langaugeClose cooperation in speech and langauge– Create annotated data Create annotated data – Algorithms and SW Tools for NL analysis Algorithms and SW Tools for NL analysis
and generationand generation– Create and integrate lexical resources Create and integrate lexical resources
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 44
History of the History of the CentCenterer
Former Former CentCenteerr for Computational for Computational LinguisticsLinguistics (program MŠMT LN) (program MŠMT LN)– 2000-20042000-2004– UK, ÚJČ, ZČUUK, ÚJČ, ZČU: fundamental research type (B): fundamental research type (B)
NowNow: Cent: Centeerr for Computational for Computational LinguisticsLinguistics – ((againagain) ) fundamental research,fundamental research, MŠMT LC MŠMT LC– Masaryk University in Brno added, now 4 Masaryk University in Brno added, now 4
sitessites
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 55
The The CeCenter: some figuresnter: some figures
Budget and timeframeBudget and timeframe– 2.92.9 mil. mil. €€, 2005-2009, 2005-2009[-2011][-2011] ( (6 yrs +6 yrs + 9 9 mosmos))
Personální obsazení (20Personální obsazení (201010):):– 1 1 PIPI (prof (professoressor))– 7 7 Co-PIs and key presons Co-PIs and key presons ((full/assoc. prof.)full/assoc. prof.)– 1111 PostdocsPostdocs (Ph.D.) (Ph.D.)
99 of them graduated with CKL supportof them graduated with CKL support
– 24 24 graduate studentsgraduate students Reduced to about 2/3 for 2011Reduced to about 2/3 for 2011
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 66
The sites The sites (1)(1)
UK Praha (UK Praha (ÚFALÚFAL MFF MFF / Charles University / Charles University))– Formal language theory and algorithmsFormal language theory and algorithms– SW SW tools for NLU / NLGtools for NLU / NLG– Raw, Annotated data (incl. parallel)Raw, Annotated data (incl. parallel)
ZČU Plzeň, KKY FAZČU Plzeň, KKY FAV (University of West V (University of West Bohemia in Pilsen)Bohemia in Pilsen)– Speech recognition and TTSSpeech recognition and TTS– Data collection and annotationData collection and annotation
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 77
The sitesThe sites (2) (2)
MU Brno, FI, NLP laMU Brno, FI, NLP lab (Masaryk b (Masaryk University)University)– LexiLexical issuescal issues
LexiLexical databases, incl. SWcal databases, incl. SW
ÚJČ AV ČRÚJČ AV ČR (Institute of the Czech (Institute of the Czech Language, Academy of Sciences of Language, Academy of Sciences of the CR)the CR)– Digitization of historical dataDigitization of historical data– Lexical databasesLexical databases
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 88
20052005
Start of work, after some “gap”Start of work, after some “gap”– Apr. 1, Apr. 1, 2005 – 2005 – three months vacuumthree months vacuum– [Got back the name…][Got back the name…]– Reduced budget for 2005 (300k Reduced budget for 2005 (300k €)€)
Durable equipment / future computing clusterDurable equipment / future computing cluster
– Cooperation: Cooperation: EU grant proposalsEU grant proposals continuing work on Malach (U.S.)continuing work on Malach (U.S.) Start of the PIRE NSF project (JHU, Brown Univ.)Start of the PIRE NSF project (JHU, Brown Univ.)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 99
20062006
First full yearFirst full year– Prague Dependency Treebank v2.0 finished (published at LDC)Prague Dependency Treebank v2.0 finished (published at LDC)– Speech reconstruction projectSpeech reconstruction project (UK, specifi (UK, specification with PIRE/JHUcation with PIRE/JHU))– Lexical issuesLexical issues (UK, MU (UK, MU, , ÚJČ)ÚJČ)– Speech (ASR, TTS - ZČU)Speech (ASR, TTS - ZČU)– IR – CLEF test collection, CLEF shared task, 1st partIR – CLEF test collection, CLEF shared task, 1st part– Digitization of historical material (ÚJČ)Digitization of historical material (ÚJČ)– Start of EU Integrated project „Companions“: UK, ZČUStart of EU Integrated project „Companions“: UK, ZČU– More More internationalinternational cooperation: EU, USA (JHU, Brown, Univ. of cooperation: EU, USA (JHU, Brown, Univ. of
PPennsylvaniaennsylvania))– Organization of Treebanks and Linguistics Theories, Dec. 2006 Organization of Treebanks and Linguistics Theories, Dec. 2006
(UK)(UK)– 40 „results40 „results”” in the government database („RIV in the government database („RIV”)”)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1010
20072007 Mid-projectMid-project
– LexiLexical resources, new Czech language lexical databasecal resources, new Czech language lexical database (MU+ÚJČ)(MU+ÚJČ)
– Added more students for English work, translationAdded more students for English work, translation English annotation specification, annotationEnglish annotation specification, annotation (ZČU, UK) (ZČU, UK)
– Integration of ASR and TTS with NLU/NLG Integration of ASR and TTS with NLU/NLG (UK, ZČU)(UK, ZČU) In the “Companions” projectIn the “Companions” project
– SW tools for analysis and generationSW tools for analysis and generation Speech, language Speech, language (UK, MU, ZČU)(UK, MU, ZČU)
– International collaborationInternational collaboration EU (3 projeEU (3 projectscts 6 6thth F FP: UK, UK+ZČU), USA (UK, UK+ZČU)P: UK, UK+ZČU), USA (UK, UK+ZČU)
– Local oLocal organirganisation of ACL 2007 and EMNLP 2007sation of ACL 2007 and EMNLP 2007 Still (2011) holds record in attendance (~1100 participants)Still (2011) holds record in attendance (~1100 participants)
– 66 66 results inresults in ““RIVRIV”” (16 (16 journalsjournals, 39 , 39 in-procin-proc., 5 SW/data ., 5 SW/data etcetc.).)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1111
20082008 Slightly modified goals (stress on MT)Slightly modified goals (stress on MT)
– LexiLexical resourcescal resources (MU, UK, ÚJČ) (MU, UK, ÚJČ) SSW toolsW tools
– SSemanticsemantics detection of plagiarism (detection of plagiarism (MU) MU) NLUNLU (UK, MU), (UK, MU), NLGNLG (UK (UK))
– NNew algorithms for ASRew algorithms for ASR ProProsody, language modeling, speech reconstructionsody, language modeling, speech reconstruction
– Data acquisition, annotation, corpus toolsData acquisition, annotation, corpus tools– Research (incl. data annotation) for machine translationResearch (incl. data annotation) for machine translation
The TectoMT SW and data platformThe TectoMT SW and data platform– Theoretical formal linguistics, language usageTheoretical formal linguistics, language usage
ResultsResults (RIV): 64 (RIV): 64:: 13 13 journal artjournal art., 32 ., 32 in-proc.in-proc., 5 , 5 booksbooks, 5 SW , 5 SW tools/data resources etc.tools/data resources etc.
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1212
20092009 Should have been the last year of CKL…Should have been the last year of CKL…
– Application for extension for 2010-11Application for extension for 2010-11 Granted for 2010Granted for 2010
– Research: English data, MT, ASR, DialogResearch: English data, MT, ASR, Dialog Work on the parallel Czech-English treebank (PTB)Work on the parallel Czech-English treebank (PTB) Companions project: integration workCompanions project: integration work
– Tight cooperation between UK and ZCUTight cooperation between UK and ZCU PIRE project – workshops, students from US at UKPIRE project – workshops, students from US at UK Euromatrix EU project on MT extended (-2012)Euromatrix EU project on MT extended (-2012)
– Organization of the CoNLL 2009 shared taskOrganization of the CoNLL 2009 shared task– Organization of session at FET 2009 (EU Organization of session at FET 2009 (EU
conference)conference)– Results: 62, journals: 8, in-proc.: 42, 3 books etc.Results: 62, journals: 8, in-proc.: 42, 3 books etc.
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1313
20102010 Last fully-funded year: ext. to 2011 granted in Nov.Last fully-funded year: ext. to 2011 granted in Nov.
– Continuation of research along the same linesContinuation of research along the same lines Wrap-up in data annotation: PCEDT, PDTSxWrap-up in data annotation: PCEDT, PDTSx Departures of people due to uncertaintyDepartures of people due to uncertainty
– International cooperation:International cooperation: Companions project finished (Nov. 2010)Companions project finished (Nov. 2010) PIRE continuing towards 2011, EuromatrixPlus renewed (UK)PIRE continuing towards 2011, EuromatrixPlus renewed (UK) New projects in 2010:New projects in 2010:
– Univ. of Pennsylvania – discourse representation, annotation (UK)Univ. of Pennsylvania – discourse representation, annotation (UK)– Khresmoi (EU IP) – medical IR and IE, UKKhresmoi (EU IP) – medical IR and IE, UK– Faust (STREP, machine translation, UK)Faust (STREP, machine translation, UK)– META-NET network of excellence in MT / data sharingMETA-NET network of excellence in MT / data sharing
Chairing the ACL 2010 conference (Uppsala, Sweden)Chairing the ACL 2010 conference (Uppsala, Sweden)– Results (prelim.): ~60 (12 journal articles, ~40 in-proc.)Results (prelim.): ~60 (12 journal articles, ~40 in-proc.)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1414
Quantitative Summary of Quantitative Summary of ResultsResults
RIV 2005-200RIV 2005-2009 (2010 pending)9 (2010 pending)– 274 records (+ ~ 60 in 2010)274 records (+ ~ 60 in 2010)
Mostly papers in proceedings of conferences and Mostly papers in proceedings of conferences and workshopsworkshops– ACL, EACL, NAACL, Coling, CoNLL; workshopsACL, EACL, NAACL, Coling, CoNLL; workshops– > 95% international, > 85% abroad> 95% international, > 85% abroad
Some journal articlesSome journal articles– LNCS, IEEE Transactions, LRELNCS, IEEE Transactions, LRE, Czech ling. Journals , Czech ling. Journals
(PBML, SaS – now in WoS)(PBML, SaS – now in WoS) Software aSoftware andnd data data
– Mostly Mostly „open source“„open source“; training, shared task (evaluation); training, shared task (evaluation)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1515
Most valued publicationsMost valued publications PapersPapers
– Semi-supervised POS tagging (EACL 2009)Semi-supervised POS tagging (EACL 2009) Best results in POS tagging so far, incl. EnglishBest results in POS tagging so far, incl. English Now taggers available in 5 languagesNow taggers available in 5 languages
– Extension of HVS Semantic Parser by Allowing Left-RightExtension of HVS Semantic Parser by Allowing Left-Right BranchBranching (ICASSP 2008)ing (ICASSP 2008) NNew result, drawing from S. Young’s workew result, drawing from S. Young’s work
– Large-scale Semantic Networks: Annotation and Large-scale Semantic Networks: Annotation and EvaluationEvaluation NAACL 2009; NAACL 2009; in cooperation with in cooperation with Google ResearchGoogle Research (Zurich, K. (Zurich, K.
Hall)Hall)– CoNLL 2009 Shared Task, CoNLL 2009CoNLL 2009 Shared Task, CoNLL 2009
Overall task and system descriptionOverall task and system description BookBook
– Valenční slovník českých sloves Valenční slovník českých sloves ((Valency Lexicon of Czech Valency Lexicon of Czech Verbs, Verbs, KarolinumKarolinum Press Press)) EleElectronic version availablectronic version available
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1616
Most valued dataMost valued data CorporaCorpora ( (language databases, publicly availablelanguage databases, publicly available))
– Prague Dependency Treebank 2.0, Linguistic Data Consortium Prague Dependency Treebank 2.0, Linguistic Data Consortium 20062006
– Prague Czech-English Dependency Treebank, to appear in 2011Prague Czech-English Dependency Treebank, to appear in 2011 Penn Treebank & translation to Czech, with semantic annotation Penn Treebank & translation to Czech, with semantic annotation
~PDT/style~PDT/style– Czech Wordnet 1.0 (ELRA, 2008)Czech Wordnet 1.0 (ELRA, 2008)– Sign Language, Audiovisual (ELRA, 2008)Sign Language, Audiovisual (ELRA, 2008)
TesTest / shared task collectionst / shared task collections– CLEF 2006, 2007CLEF 2006, 2007
Multilingual cross-langauge search competitionsMultilingual cross-langauge search competitions– Machine Translation Open Competition – EuroMatrixMachine Translation Open Competition – EuroMatrix/Plus/Plus 2006- 2006-
1010 Czech-English, German, French, Italian, Hungarian, SpanishCzech-English, German, French, Italian, Hungarian, Spanish
– CoNLL Shared Task 2007, 2009CoNLL Shared Task 2007, 2009 DepDep.. parsing, semantic role labeling ( parsing, semantic role labeling (unified for 7 languagesunified for 7 languages))
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1717
Most valued SW toolsMost valued SW tools SoftwareSoftware
– Corpus manager (client/server) Corpus manager (client/server) Bonito/ManateeBonito/Manatee Worldwide useWorldwide use: ČNK, SNK; Hu, Hr, GB: ČNK, SNK; Hu, Hr, GB
– Word Sketch EngineWord Sketch Engine Commercial use (Commercial use (Lexical ComputingLexical Computing))
– ComPOSTComPOST State-of-the-art POS tagger (Cz, En, State-of-the-art POS tagger (Cz, En, Dutch, Swedish, IcelandicDutch, Swedish, Icelandic))
– SyntaSyntacctiticc dependency dependency parser „MST“ (parser „MST“ (CzechCzech)) WithWith Univ. of Pennsylvania Univ. of Pennsylvania
– Improved Czec ASR and Emotional TTS Improved Czec ASR and Emotional TTS Used in the Companions projectUsed in the Companions project
– NLG and Dialogue Manager w/knowledge baseNLG and Dialogue Manager w/knowledge base Also for the Companions projectAlso for the Companions project
– The TectoMT SW and data handling platform The TectoMT SW and data handling platform MT, dialogue systems (now any NLU/NLG processing -> MT, dialogue systems (now any NLU/NLG processing ->
“Treex”)“Treex”)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1818
The Center provided…The Center provided…
Material benefitsMaterial benefits– 3/4 3/4 of budget: personnel (mainly graduate students)of budget: personnel (mainly graduate students)– Generous travel moneyGenerous travel money– Small equipmentSmall equipment– Durable equipment – clusters (30-200 CPUs)Durable equipment – clusters (30-200 CPUs)
Only in 2005/6 – need for renewalOnly in 2005/6 – need for renewal
– Small indirect costs (< Small indirect costs (< 12%12%, contribution of inst., contribution of inst.)) ““intangible” benefitsintangible” benefits
– (Sub)teams, even across institutions, flexible assignment (Sub)teams, even across institutions, flexible assignment of people to projects, of people to projects,
– dissertations, one assoc. professor promotiondissertations, one assoc. professor promotion
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 1919
The Center had to work The Center had to work under certain “restrictions”under certain “restrictions”
Employment of graduate students, postdocs, supervision of Employment of graduate students, postdocs, supervision of graduate studegraduate studentntss– NNow at all four sitesow at all four sites (2009: 10/4/9/1) (2009: 10/4/9/1)
RequirementRequirement: at least on site…: at least on site… →→ CheckCheck Requirement: Requirement: Participation of students (Participation of students (Bc./Mgr./Ph.D.)Bc./Mgr./Ph.D.)
– Total: 41Total: 41 student studentss →→ CheckCheck– 77 nationalitiesnationalities
Students - after graduation - went to (e.g.)…Students - after graduation - went to (e.g.)…– Petr Němec (UK): TextKernel, Hol.; Kiril Ribarov (UK): ČEZPetr Němec (UK): TextKernel, Hol.; Kiril Ribarov (UK): ČEZ– Jan Romportl, Aleš Pražák: SpeechTech (spinoff, ZČU)Jan Romportl, Aleš Pražák: SpeechTech (spinoff, ZČU)– VladimVladimír Kadlec (MU Brno): Acision (GB)ír Kadlec (MU Brno): Acision (GB)– Petr Pajas (UK): Google (Zurich)Petr Pajas (UK): Google (Zurich)– VVáclav Novák (UK): Ministry of Interior, then a small startupáclav Novák (UK): Ministry of Interior, then a small startup– FormerFormer CKL (LN CKL (LN, 00-04, 00-04): M. Čmejrek, J. Cuřín (UK): IBM Research): M. Čmejrek, J. Cuřín (UK): IBM Research
(Yorktown, Prague)(Yorktown, Prague)
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2020
““RestrictionsRestrictions”” ((cont.cont.’d)’d)
RequirementRequirement: : integration to EU “research space”integration to EU “research space” 99 projectsprojects EU, EU, 66thth aand 7nd 7thth F FPP
– All typesAll types: IP, STREP, NoE; SSA, Dig. Libraries: IP, STREP, NoE; SSA, Dig. Libraries Companions (IP) - ZČU, UK; Companions (IP) - ZČU, UK; Khresmoi (IP) - UKKhresmoi (IP) - UK EuroMatrix, EuroMatrixPlusEuroMatrix, EuroMatrixPlus, Faust, Faust (STREP) - UK (STREP) - UK Flarenet, META-NET (NoE) - UKFlarenet, META-NET (NoE) - UK Clarin (SSA) - UK, MU, ÚJČ; Clarin (SSA) - UK, MU, ÚJČ; KYOTO (Dig. Libraries) - MUKYOTO (Dig. Libraries) - MU
USAUSA– Malach (Malach (till till 2007; UK, ZČU): USC, JHU, IBM, UMD2007; UK, ZČU): USC, JHU, IBM, UMD– PIRE: rozpoznávání řeči a strojový překlad (UK, PIRE: rozpoznávání řeči a strojový překlad (UK, indirectlyindirectly ZČU): ZČU):
JHU, Brown Univ.JHU, Brown Univ.– Discourse: Univ. of PennsylvaniaDiscourse: Univ. of Pennsylvania– Treebanking: Univ. of Colorado Treebanking: Univ. of Colorado →→ CheckCheck
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2121
EU Project „Companions“EU Project „Companions“
GoalGoal– IntelIntellligent igent conversational companionconversational companion
Over photographs (Cz), Over photographs (Cz), „how was your day“„how was your day“ (En) (En)
TechnologiTechnologieses– ASR, emoASR, emotionaltional TTS TTS– Natural language understanding, NL generationNatural language understanding, NL generation– Naturalness of dialogue:Naturalness of dialogue: „user studies“ / „user studies“ /
„evaluation“„evaluation“ CKLCKL
– UK/ZČU: ASR, TTS, NLU, NLG, UK/ZČU: ASR, TTS, NLU, NLG, DDialogialogue ue managementmanagement
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2222
The Companions project The Companions project
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2323
Companions: System Companions: System DiagramDiagram
Other Other project project demos demos
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2525
Semantic annotationSemantic annotation (UK) (UK)
Některé kontury problému se však po oživení Havlovým projevem zdají být jasnější.
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2626
PDT 2.0:PDT 2.0:Annotation Annotation
layerslayers
„Byl by šel do lesa“(“he’d go to the forest”)
Linked layers of annotation
Stand-off annotation
Scheme (Relax NG) z-la
yer
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2727
Speech reconstruction Speech reconstruction (UK, (UK, ZČU)ZČU)
● Goal: Goal: „„TranslationTranslation““
SEM NEMOH SEM TO JIM DÁT TEN VOBRAZ
‘m couldn’t ‘m that them give the paintin’
Ten obraz jsem jim nemohl dát.
Ten obraz jsem jim nemohl dát.
I could not give them the painting.
?
Generation
● Annotation
Jan 31, 2011, ÚFAL MFF UJan 31, 2011, ÚFAL MFF UKK
CentCenter for Computational er for Computational LinguisticsLinguistics (LC536) (LC536) 2828
Speech Reconstruction Speech Reconstruction AnnotationAnnotation
Edited transcriptEdited transcript– All changes All changes
allowedallowed– ManuManualal an annotationnotation– Large dataLarge data
Malach dataMalach data Companions proj. Companions proj.
dialogues (> 100h)dialogues (> 100h)