introduction to...

Introduction to TectoMT

1/21

Zdeněk Žabokrtský, Martin Popel

Institute of Formal and Applied LinguisticsCharles University in Prague

CLARA Course on Treebank Annotation, December 2010, Prague

Outline

n PART 1n What is TectoMT?n TectoMT’s architecturen Overview of TectoMT’s tools and applications

2/21

n Overview of TectoMT’s tools and applications

n PART 2 - demo

What is TectoMT?

n multi-purpose NLP software frameworkn created at UFAL since 2005

n main linguistic featuresn layered language representationn linguistic data structures adopted from the Prague Dependency

3/21

n linguistic data structures adopted from the Prague Dependency Treebank

n main technical featuresn highly modular, open-sourcen numerous NLP tools already integrated (both existing and new)n all tools communicating via a uniform OO infrastructuren Linux + Perln reuse of PDT technology (tree editor TrEd, XML…)

Why “TectoMT” ?

n Tecto..n refers the (Praguian) tectogrammarn deep-syntactic dependency-oriented sentence representationn developed by Petr Sgall and his colleagues since 1960sn large scale application in the Prague Dependency Treebank

4/21

n .....MTn the main application of TectoMT is Machine Translation

n however, not only “tecto” and not only “MT” !!!

n re-branding planned for 2011: TectoMT → Treex

What is not TectoMT?

n TectoMT (as a whole) is not an end-user applicationn it is rather an experimental lab for NLP researchers

5/21

n however, releasing of single-purpose stand-alone applications is possible

Motivation for creating TectoMT

n First, technical reasons:n Want to make use of more than two NLP tools in your

experiment? Be ready for endless data conversions, need for other people's source code tweaking, incompatibility of source code and model versions…

6/21

code and model versions…

n ⇒ unified software infrastructure might help us in many aspects.

n Second, our long-term MT plan:n We believe that tectogrammar (deep syntax) as implemented in

Prague Dependency Treebank might help to (1) reduce data sparseness, and (2) find and employ structural similarities revealed by tectogrammar even between typologically different languages.

Main Design Decisions

n Linuxn Perl as the core language

n set of well-defined, linguistically relevant layers of language representation

7/21

n neutral w.r.t. chosen methodology ("rules vs. statistics")

n emphasis on modularityn each task implemented by a sequence of blocksn each block corresponds to a well-defined NLP subtaskn reusability and substitutability of blocks

n support for distributed processing

Data Flow Diagramin a typical application in TectoMT

INPUTDATAFILES

OUTPUTDATAFILES

MEMORY REPRESENTATION

OF SENTENCESTRUCTURES

input formatconverter

output formatconverter

8/21

FILES FILESSTRUCTURES

block 1 block 2 block n

non-Perltool X

non-Perltool Y

block 3 …

scenario:

Hierarchy of data-structure units

n documentn the smallest independently storable unit (~ xml file) n represents a text as a sequence of bundles, each representing one

sentence (or sentence tuples in the case of parallel documents) n bundle

n set of tree representationsof a given sentence

9/21

of a given sentence

n treen representation of a sentence on a given layer of linguistic

descriptionn noden attribute

n document's, node's, or bundle's name-value pairs

Tree types adopted from PDT

n tectogrammatical layern deep-syntactic dependency tree

n analytical layern surface-syntactic dependency tree

10/21

n 1 word (or punct.) ~ 1 node

n morphological layern sequence of tokens with their lemmas

and morphological tags

Trees in a bundle

n in each bundle, there can be at most one tree for each "layer"

n set of possible layers = {S,T} x {English,Czech,...} x {M,A,T,P, N}

n S - source, T-target (analysis vs. synthesis, MT perspective)

11/21

n M - morphological analysisn P - phrase-structure treen A - analytical treen T - tectogrammatical treen N - instances of named entities

n Example: SEnglishA - tectogrammatical analysis of an English sentence on the source-language side

Hierarchy of processing units

n blockn the smallest individually executable unitn with well-defined input and outputn block parametrization possible (e.g. model size choice)

n scenario

12/21

n scenarion sequence of blocks, applied one after another on given

documents

n applicationn typically 3 steps:n 1. conversion from the input formatn 2. applying the scenario on the datan 3. conversion into the output format source

languagetargetlanguage

MT triangle:interlingua

tectogram.

surf.synt.

morpho.

raw text.

Blocks

n technically, Perl classes derived from ��neither method �� (if sentences are processed

independently) or method �� must be defined

n several hundreds blocks in TectoMT now, for various purposes:

nblocks for analysis/transfer/synthesis, e.g.

13/21

nblocks for analysis/transfer/synthesis, e.g.SEnglishW_to_SEnglishM::Lemmatize_mtree

SEnglishP_to_SEnglishA::Mark_heads

TCzechT_to_TCzechA::Vocalize_prepositions

nblocks for alignment, evaluation, feature extraction, etc.

n some of them only implement simple rules, some of them call complex probabilistic tools

nEnglish-Czech tecto-based translation currently composes of roughly 140 blocks

Tools available as TectoMT blocks

nto integrate a stand-alone NLP tool into TectoMT means to provide it with the standardized block interface

nalready integrated tools:ntaggers

nHajič's tagger, Raab&Spoustová Morče tagger, Rathnaparkhi MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's

14/21

MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's Lingua::EN::Tagger

nparsersnCollins' phrase structure parser, McDonalds dependency parser,

Malt parser, ZŽ's dependency parsernnamed-entity recognizer

nStanford Named Entity Recognizer, Kravalová's SVM-based NE recognizer

n miscel.nKlimeš's semantic role labeller, ZŽ's C5-based afun labeller, Ptáček's

C5-based Czech preposition vocalizer, ...

Other TectoMT componentsn "core" - Perl libraries forming the core of TectoMT infrastructure,

esp. for memory representation of (and interface to) to the data

structures

nnumerous file-format converters (e.g. from PDT, Penn treebank,

Czeng corpus, WMT shared task data etc. to our xml format)

n

15/21

nTectoMT-customized Pajas' tree editor TrEd

n tools for parallelized processing (Bojar)

ndata, esp. trained models for the individual tools, morphological

dictionaries, probabilistic translation dictionaries...

n tools for testing (regular daily tests), documentation...

Languages in TectoMT

n full-fledged sentence PDT-style analysis/transfer/synthesis for English and Czechn using state-of-the-art tools

16/21

n prototype implementations of PDT-style analyses for a number of other languagesnmostly created by students

n Polish, French, German, Tamil, Spanish, Esperanto…

English-Czech translation in TectoMT

deep syntax:tectogramatical layer

t-layer

ANALYSIS TRANSFER SYNTHESIS

17/21

source language (English) target language (Czech)

morphological layer

shallow syntax:analytical layer

a-layer

m-layer

w-layer

tectogramatical layer

ANALYSIS TRANSFER SYNTHESIS

t-layer

fill formems grammatemes usequeryfill morphological categories

English-Czech translation in TectoMT

rule based statistical & blocks

18/21

source language (English) target language (Czech)

morphological layer

analytical layer a-layer

m-layer

w-layertokenizationlemmatizationtagger (Morce)

parser (McDonald's MST)analytical functions

mark edges to contractbuild t-tree

fill formems grammatemes useHMTM

querydictionary

fill morphological categories

impose agreement

add functional words

generatewordforms

concatenate

segmentation

Real Translation ScenarioSEnglishW_to_SEnglishM::TokenizationNormalize_formsFix_tokenizationTagMorceFix_mtagsLemmatize_mtreeSEnglishM_to_SEnglishN::Stanford_named_entitiesDistinguish_personal_namesSEnglishM_to_SEnglishA::McD_parserFill_is_member_from_deprelFix_tags_after_parse

Mark_clause_headsMark_passivesAssign_functorsMark_infinMark_relclause_headsMark_relclause_corefMark_dsp_rootMark_parenthesesRecompute_deepordAssign_nodetypeAssign_grammatemesDetect_formemeRehang_shared_attrDetect_voice

Cut_variantsRehang_to_eff_parentsTranslate_LF_tree_ViterbiRehang_to_orig_parentsFix_transfer_choicesTranslate_L_female_surnamesAdd_noun_genderAdd_relpron_below_rcChange_Cor_to_PersPronAdd_PersPron_below_vfinAdd_verb_aspectFix_date_timeFix_grammatemes_after_transferFix_negation

Impose_pron_z_agrImpose_rel_pron_agrImpose_subjpred_agrImpose_attr_agrImpose_compl_agrDrop_subj_pers_pronsAdd_prepositionsAdd_subconjsAdd_reflex_particlesAdd_auxverb_compound_passiveAdd_auxverb_modalAdd_auxverb_compound_futureAdd_auxverb_conditionalAdd_auxverb_compound_past

19/21

Fix_tags_after_parseMcD_parser REPARSE=1 Fill_is_member_from_deprelFix_McD_topologyFix_nominal_groupsFix_is_memberFix_atreeFix_multiword_prep_and_conjFix_dicendi_verbsFill_afun_AuxCP_CoordFill_afunSEnglishA_to_SEnglishT::Mark_edges_to_collapseMark_edges_to_collapse_negBuild_ttreeFill_is_memberMove_aux_from_coord- _to_membersFix_tlemmasAssign_coap_functorsFix_either_orFix_is_member

Detect_voiceFix_imperativesFill_is_name_of_personFill_gender_of_personAdd_cor_actFind_text_corefSEnglishT_to_TCzechT::Clone_ttreeTranslate_LF_phrasesTranslate_LF_joint_staticDelete_superfluous_tnodesTranslate_F_try_rulesTranslate_F_add_variantsTranslate_F_rerankTranslate_L_try_rulesTranslate_L_add_variantsTranslate_LF_numerals_by_rulesTranslate_L_filter_aspectTransform_passive_constructionsPrune_personal_name_variantsRemove_unpassivizable_variantsTranslate_LF_compounds

Fix_negationMove_adjectives_before_nounsMove_genitives_to_postpositMove_relclause_to_postpositMove_dicendi_closer_to_dspMove_PersPron_next_to_verbMove_enough_before_adjFix_moneyRecompute_deepordFind_gram_coref_for_refl_pronNeut_PersPron_gender_from_antecOverride_pp_with_phrase_translationValency_related_rulesFill_clause_numberTurn_text_coref_to_gram_corefTCzechT_to_TCzechA::Clone_atreeDistinguish_homonymous_mlemmasReverse_number_noun_dependencyInit_morphcatFix_possessive_adjectivesMark_subject

Add_auxverb_compound_pastAdd_clausal_expletive_pronounsResolve_verbsProject_clause_numberAdd_parenthesesAdd_sent_final_punctAdd_subord_clause_punctAdd_coord_punctAdd_apposition_punctChoose_mlemma_for_PersPronGenerate_wordformsMove_clitics_to_wackernagelRecompute_orderingDelete_superfluous_preposDelete_empty_nounsVocalize_prepositionsCapitalize_sent_startCapitalize_named_entitiesTCzechA_to_TCzechW::Concatenate_tokensAscii_quotesRemove_repeated_tokens

Parallel analysis

n data needed for training the transfer phase models

n Czech-English parallel corpus CzEngn 8 mil. pairs of sentences with

automatic PDT-style analyses and

20/21��

��

��

automatic PDT-style analyses and alignment

Summary of Part I

n TectoMT (àTreex)n environment for NLP experiments

nmultipurpose, multilingual

n PDT-style linguistic structures

21/21

n PDT-style linguistic structures

n Linux+Perl, open-source

nmodular architecture (several hundreds of modules)

n capable of processing massive data

nwill be released at CPAN

introduction to...

Documents