introduction to...
TRANSCRIPT
Introduction to TectoMT
1/21
Zdeněk Žabokrtský, Martin Popel
Institute of Formal and Applied LinguisticsCharles University in Prague
CLARA Course on Treebank Annotation, December 2010, Prague
Outline
n PART 1n What is TectoMT?n TectoMT’s architecturen Overview of TectoMT’s tools and applications
2/21
n Overview of TectoMT’s tools and applications
n PART 2 - demo
What is TectoMT?
n multi-purpose NLP software frameworkn created at UFAL since 2005
n main linguistic featuresn layered language representationn linguistic data structures adopted from the Prague Dependency
3/21
n linguistic data structures adopted from the Prague Dependency Treebank
n main technical featuresn highly modular, open-sourcen numerous NLP tools already integrated (both existing and new)n all tools communicating via a uniform OO infrastructuren Linux + Perln reuse of PDT technology (tree editor TrEd, XML…)
Why “TectoMT” ?
n Tecto..n refers the (Praguian) tectogrammarn deep-syntactic dependency-oriented sentence representationn developed by Petr Sgall and his colleagues since 1960sn large scale application in the Prague Dependency Treebank
4/21
n .....MTn the main application of TectoMT is Machine Translation
n however, not only “tecto” and not only “MT” !!!
n re-branding planned for 2011: TectoMT → Treex
What is not TectoMT?
n TectoMT (as a whole) is not an end-user applicationn it is rather an experimental lab for NLP researchers
5/21
n however, releasing of single-purpose stand-alone applications is possible
Motivation for creating TectoMT
n First, technical reasons:n Want to make use of more than two NLP tools in your
experiment? Be ready for endless data conversions, need for other people's source code tweaking, incompatibility of source code and model versions…
6/21
code and model versions…
n ⇒ unified software infrastructure might help us in many aspects.
n Second, our long-term MT plan:n We believe that tectogrammar (deep syntax) as implemented in
Prague Dependency Treebank might help to (1) reduce data sparseness, and (2) find and employ structural similarities revealed by tectogrammar even between typologically different languages.
Main Design Decisions
n Linuxn Perl as the core language
n set of well-defined, linguistically relevant layers of language representation
7/21
n neutral w.r.t. chosen methodology ("rules vs. statistics")
n emphasis on modularityn each task implemented by a sequence of blocksn each block corresponds to a well-defined NLP subtaskn reusability and substitutability of blocks
n support for distributed processing
Data Flow Diagramin a typical application in TectoMT
INPUTDATAFILES
OUTPUTDATAFILES
MEMORY REPRESENTATION
OF SENTENCESTRUCTURES
input formatconverter
output formatconverter
8/21
FILES FILESSTRUCTURES
block 1 block 2 block n
non-Perltool X
non-Perltool Y
block 3 …
scenario:
Hierarchy of data-structure units
n documentn the smallest independently storable unit (~ xml file) n represents a text as a sequence of bundles, each representing one
sentence (or sentence tuples in the case of parallel documents) n bundle
n set of tree representationsof a given sentence
9/21
of a given sentence
n treen representation of a sentence on a given layer of linguistic
descriptionn noden attribute
n document's, node's, or bundle's name-value pairs
Tree types adopted from PDT
n tectogrammatical layern deep-syntactic dependency tree
n analytical layern surface-syntactic dependency tree
10/21
n 1 word (or punct.) ~ 1 node
n morphological layern sequence of tokens with their lemmas
and morphological tags
Trees in a bundle
n in each bundle, there can be at most one tree for each "layer"
n set of possible layers = {S,T} x {English,Czech,...} x {M,A,T,P, N}
n S - source, T-target (analysis vs. synthesis, MT perspective)
11/21
n M - morphological analysisn P - phrase-structure treen A - analytical treen T - tectogrammatical treen N - instances of named entities
n Example: SEnglishA - tectogrammatical analysis of an English sentence on the source-language side
Hierarchy of processing units
n blockn the smallest individually executable unitn with well-defined input and outputn block parametrization possible (e.g. model size choice)
n scenario
12/21
n scenarion sequence of blocks, applied one after another on given
documents
n applicationn typically 3 steps:n 1. conversion from the input formatn 2. applying the scenario on the datan 3. conversion into the output format source
languagetargetlanguage
MT triangle:interlingua
tectogram.
surf.synt.
morpho.
raw text.
Blocks
n technically, Perl classes derived from ������������neither method ������������ (if sentences are processed
independently) or method ����������� �� must be defined
n several hundreds blocks in TectoMT now, for various purposes:
nblocks for analysis/transfer/synthesis, e.g.
13/21
nblocks for analysis/transfer/synthesis, e.g.SEnglishW_to_SEnglishM::Lemmatize_mtree
SEnglishP_to_SEnglishA::Mark_heads
TCzechT_to_TCzechA::Vocalize_prepositions
nblocks for alignment, evaluation, feature extraction, etc.
n some of them only implement simple rules, some of them call complex probabilistic tools
nEnglish-Czech tecto-based translation currently composes of roughly 140 blocks
Tools available as TectoMT blocks
nto integrate a stand-alone NLP tool into TectoMT means to provide it with the standardized block interface
nalready integrated tools:ntaggers
nHajič's tagger, Raab&Spoustová Morče tagger, Rathnaparkhi MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's
14/21
MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's Lingua::EN::Tagger
nparsersnCollins' phrase structure parser, McDonalds dependency parser,
Malt parser, ZŽ's dependency parsernnamed-entity recognizer
nStanford Named Entity Recognizer, Kravalová's SVM-based NE recognizer
n miscel.nKlimeš's semantic role labeller, ZŽ's C5-based afun labeller, Ptáček's
C5-based Czech preposition vocalizer, ...
Other TectoMT componentsn "core" - Perl libraries forming the core of TectoMT infrastructure,
esp. for memory representation of (and interface to) to the data
structures
nnumerous file-format converters (e.g. from PDT, Penn treebank,
Czeng corpus, WMT shared task data etc. to our xml format)
n
15/21
nTectoMT-customized Pajas' tree editor TrEd
n tools for parallelized processing (Bojar)
ndata, esp. trained models for the individual tools, morphological
dictionaries, probabilistic translation dictionaries...
n tools for testing (regular daily tests), documentation...
Languages in TectoMT
n full-fledged sentence PDT-style analysis/transfer/synthesis for English and Czechn using state-of-the-art tools
16/21
n prototype implementations of PDT-style analyses for a number of other languagesnmostly created by students
n Polish, French, German, Tamil, Spanish, Esperanto…
English-Czech translation in TectoMT
deep syntax:tectogramatical layer
t-layer
ANALYSIS TRANSFER SYNTHESIS
17/21
source language (English) target language (Czech)
morphological layer
shallow syntax:analytical layer
a-layer
m-layer
w-layer
tectogramatical layer
ANALYSIS TRANSFER SYNTHESIS
t-layer
fill formems grammatemes usequeryfill morphological categories
English-Czech translation in TectoMT
rule based statistical & blocks
18/21
source language (English) target language (Czech)
morphological layer
analytical layer a-layer
m-layer
w-layertokenizationlemmatizationtagger (Morce)
parser (McDonald's MST)analytical functions
mark edges to contractbuild t-tree
fill formems grammatemes useHMTM
querydictionary
fill morphological categories
impose agreement
add functional words
generatewordforms
concatenate
segmentation
Real Translation ScenarioSEnglishW_to_SEnglishM::TokenizationNormalize_formsFix_tokenizationTagMorceFix_mtagsLemmatize_mtreeSEnglishM_to_SEnglishN::Stanford_named_entitiesDistinguish_personal_namesSEnglishM_to_SEnglishA::McD_parserFill_is_member_from_deprelFix_tags_after_parse
Mark_clause_headsMark_passivesAssign_functorsMark_infinMark_relclause_headsMark_relclause_corefMark_dsp_rootMark_parenthesesRecompute_deepordAssign_nodetypeAssign_grammatemesDetect_formemeRehang_shared_attrDetect_voice
Cut_variantsRehang_to_eff_parentsTranslate_LF_tree_ViterbiRehang_to_orig_parentsFix_transfer_choicesTranslate_L_female_surnamesAdd_noun_genderAdd_relpron_below_rcChange_Cor_to_PersPronAdd_PersPron_below_vfinAdd_verb_aspectFix_date_timeFix_grammatemes_after_transferFix_negation
Impose_pron_z_agrImpose_rel_pron_agrImpose_subjpred_agrImpose_attr_agrImpose_compl_agrDrop_subj_pers_pronsAdd_prepositionsAdd_subconjsAdd_reflex_particlesAdd_auxverb_compound_passiveAdd_auxverb_modalAdd_auxverb_compound_futureAdd_auxverb_conditionalAdd_auxverb_compound_past
19/21
Fix_tags_after_parseMcD_parser REPARSE=1 Fill_is_member_from_deprelFix_McD_topologyFix_nominal_groupsFix_is_memberFix_atreeFix_multiword_prep_and_conjFix_dicendi_verbsFill_afun_AuxCP_CoordFill_afunSEnglishA_to_SEnglishT::Mark_edges_to_collapseMark_edges_to_collapse_negBuild_ttreeFill_is_memberMove_aux_from_coord- _to_membersFix_tlemmasAssign_coap_functorsFix_either_orFix_is_member
Detect_voiceFix_imperativesFill_is_name_of_personFill_gender_of_personAdd_cor_actFind_text_corefSEnglishT_to_TCzechT::Clone_ttreeTranslate_LF_phrasesTranslate_LF_joint_staticDelete_superfluous_tnodesTranslate_F_try_rulesTranslate_F_add_variantsTranslate_F_rerankTranslate_L_try_rulesTranslate_L_add_variantsTranslate_LF_numerals_by_rulesTranslate_L_filter_aspectTransform_passive_constructionsPrune_personal_name_variantsRemove_unpassivizable_variantsTranslate_LF_compounds
Fix_negationMove_adjectives_before_nounsMove_genitives_to_postpositMove_relclause_to_postpositMove_dicendi_closer_to_dspMove_PersPron_next_to_verbMove_enough_before_adjFix_moneyRecompute_deepordFind_gram_coref_for_refl_pronNeut_PersPron_gender_from_antecOverride_pp_with_phrase_translationValency_related_rulesFill_clause_numberTurn_text_coref_to_gram_corefTCzechT_to_TCzechA::Clone_atreeDistinguish_homonymous_mlemmasReverse_number_noun_dependencyInit_morphcatFix_possessive_adjectivesMark_subject
Add_auxverb_compound_pastAdd_clausal_expletive_pronounsResolve_verbsProject_clause_numberAdd_parenthesesAdd_sent_final_punctAdd_subord_clause_punctAdd_coord_punctAdd_apposition_punctChoose_mlemma_for_PersPronGenerate_wordformsMove_clitics_to_wackernagelRecompute_orderingDelete_superfluous_preposDelete_empty_nounsVocalize_prepositionsCapitalize_sent_startCapitalize_named_entitiesTCzechA_to_TCzechW::Concatenate_tokensAscii_quotesRemove_repeated_tokens
Parallel analysis
n data needed for training the transfer phase models
n Czech-English parallel corpus CzEngn 8 mil. pairs of sentences with
automatic PDT-style analyses and
20/21�����
���
�� ��
automatic PDT-style analyses and alignment
Summary of Part I
n TectoMT (àTreex)n environment for NLP experiments
nmultipurpose, multilingual
n PDT-style linguistic structures
21/21
n PDT-style linguistic structures
n Linux+Perl, open-source
nmodular architecture (several hundreds of modules)
n capable of processing massive data
nwill be released at CPAN