data-driven approaches to natural language processing ... approaches to... · data-driven approach...
TRANSCRIPT
DATA-DRIVENAPPROACHES TO
NATURAL LANGUAGEPROCESSING
Guy De Pauw
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHO AM I?http://aflat.org/guy• Born in Antwerp, 1975• Studied Germanic Languages & Literature (Dutch – English)• PhD (2002): An Agent-Based Evolutionary Computing Approach to
Memory-Based Syntactic Parsing of Natural Language
• 2002-2006: FLaVoR: flexible large vocabulary recognition (Dutch morphology and parsing)
• 2006-2012: African Language Technology- FWO postdoc at University of Antwerp- Machine learning approaches to language technology for
African languages- AfLaT.org
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
OUTLINE• What is Natural Language Processing?
• Introduction to CLiPS research- African language technology
• Data-driven Approaches- Paradigm Shift- Machine Learning Recap- Memory-Based Language Processing
• Data-Driven African Language Technology- Language Independence- Development Speed- Adaptability- Applicability- Empiricism
NATURAL LANGUAGEPROCESSING
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
LANGUAGE & SPEECH TECHNOLOGY
• Language is an important medium for- Communication- Storing knowledge
• Language & Speech Technology allowspeople to- Communicate with computers- Work with computers in natural language- Extract knowledge from speech and text- …
• Ultimate goal: natural language understanding
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
TASKS & APPLICATIONS• Tasks:
- Tokenization- grapheme-to-phoneme conversion- morphological analysis (segmentation, generation, lemmatization, stemming)- part-of-speech tagging- named-entity recognition- syntactic parsing- word sense disambiguation- Semantic-role labeling - Co-reference resolution- Discourse analysis
• Applications: - optical character recognition- spell-checking- text-to-speech- Predictive text (T9)- automatic summarization- question answering- sentiment analysis- information retrieval/extraction- Terminology extraction- speech recognition (AI-complete)- machine translation (AI-complete)
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
SOCIO-ECONOMIC IMPORTANCE
• Information explosion (internet)- Technostress- 2002: 5 exabytes newly stored information
• 1 exabyte = 1 million Terabyte- Doubles every 2-3 years
• Translation Explosion- EU (2005)
• 20+ official languages• > 1 billion euro per year• 2500 translators• 40% administrative budget
- Not just in Europe: South-Africa (11 official languages)
• Helpdesks, call centers, gaming, ...
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
SOCIO-ECONOMIC IMPORTANCE
• Text production- Spelling- and grammar check- “clear language”: governments, pharmaceutical
companies
• Language Teaching- Language tests, exercises, …
• “Business Intelligence”- Collect information on the competition- Opinion mining
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
SCIENTIFIC IMPORTANCE
• Artificial Intelligence• Language capacity = intelligence
Nim Chimpsky “me eat drink more” “banana eat me Nim”
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
http://www.youtube.com/watch?v=oUj9AzSE_9c
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHY NO HAL9000 IN 2001• Natural language processing is done at several levels
- Phonetics: speech soundsI can participate aj kæm patɪsəpe
- Phonology: how do phonemes combine into larger unitsI can participate aj kæn partɪsəpet
- Morfology: smallest meaning carrying unit in languageElle est joli+e
- Syntaxis: grammar, how do combinations of words express meaning
Elle est jolie vs *Il est jolie- Semantics: how is meaning expressed
He beat me vs I was beaten by him- Pragmatics: contextual knowledge
Can you pass me the salt?
• Each level introduces errors, contains ambiguity, …
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
AMBIGUITY
• orthographic/ phonological (cfr. “Eye halve a spell...)
Eye halve a spelling chequerIt came with my pea seaIt plainly marques four my revueMiss steaks eye kin knot sea.Eye strike a key and type a wordAnd weight four it two sayWeather eye am wrong oar writeIt shows me strait a weigh.As soon as a mist ache is maidIt nose bee fore two longAnd eye can put the error riteIts rarely ever wrong.Eye have run this poem threw itI am shore your pleased two noIts letter perfect in it's weighMy chequer tolled me sew.
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
AMBIGUITY
• orthographic/ phonological (cfr. “Eye halve a spell...)
• Lexical - morphological- The can will rust
• Syntactic- The prime minister reported his marriage to the king
• Semantic- My cat is on the television- All students know two languages
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
• Text level- Tom didn’t have a job. He grabbed the news paper.- Tom thought the fly was annoying. He grabbed the news
paperWhy did Tom take the newspaper?
• World Knowledge!- The mayors prohibited the students to demonstrate because
they preached the revolution- The mayors prohibited the students to demonstrate because
they feared violence
• Ellipsis- Alcohol is more damaging to women than men
AMBIGUITY
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
MACHINE TRANSLATION
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
http://www.news.com.au/breaking-news/world/predictive-text-error-leads-uk-man-to-fatally-stab-friend/story-e6frfkui-1226004032018
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHAT CAN WE DO?
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHAT CAN WE DO?
Phonetics/Phonology
• Speech Synthesis: “Pas de déclaration choc lors des débats dominicaux : sur les plateaux TV, les partis qui ont conclu l’accord sur BHV ont défendu énergiquement le résultat de leurs
travaux.”http://www.acapela-group.com/text-to-speech-interactive-demo.html
• Speech Recognitionhttp://www.youtube.com/watch?v=-0kDcUEDfmY
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHAT CAN WE DO?
Morphology• Morphological analysis: segmentation of
compounds, derivations, ….watermarking = ((water[N]+mark[N])[N]+ing[V|N.])[V]
• Lemmatization, stemming: get the base forms, roots of word forms
watermarking = watermark
• E.g. google search: bobcats
http://ilk.uvt.nl/mbma/
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
NOT ALWAYS EASY
uygarlastiramayabileceklerimizdenmissinizcesine
urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL mis/NARR siniz/2PL cesine/AS-IF
• Adverb meaning “(behaving) as if you were one of those whom we might not be able to civilize”
• [Turkish, from Oflazer & Guzey 1994]
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHAT CAN WE DO?
Morpho-syntax
• Part-of-Speech TaggingThe can will rustdeterminer noun modal verb
http://www.clips.ua.ac.be/cgi-bin/webdemo/MBSP-instant-webdemo.cgi
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHAT CAN WE DO?
Syntaxis• Sentence analysis
http://www.link.cs.cmu.edu/link/submit-sentence-4.htmlhttp://www.connexor.com/nlplib/?q=demo/syntax
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
APPLICATIONS
• Information Retrieval (data-mining)• INFORMATION = power• Document classification:
- SPAM filter- Intercept terrorist messages
• Document retrieval:- Web search
• Question-answering systems- How far is Brussels from Antwerp?- http://www.wolframalpha.com
• Text-mining: get facts from texts• Ontology learning
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
hepatitis
diseaseinfection HBV
cirrhosis
liver
immunizationantibody vaccination
culture antisera
related_to
related_to
sim
sim sim
produced by produced by
prevented by
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
APPLICATIONS
- Machine translation: http://translate.google.com- Automatic subtitling: e.g. youtube captions- Computational Stylometryhttp://www.clips.ua.ac.be/cgi-bin/kim/TACTiCSdemo.cgi- Automatic Summarizationhttp://www.clips.ua.ac.be/~iris/sumdemo.html- Spell-Checking (MS Word)- Grammar Checking (MS Word)
- Spoken Dialogue Systems (banking, airline, gaming, …)- T9, autocomplete - Google Adsense- …
CLIPSCOMPUTATIONAL LINGUISTICS GROUP
RESEARCH
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
CLIPS
www.clips.ua.ac.be
• Psycholinguistics Group- Language Acquisition (Steven Gillis)
- Language Processing (Dominiek Sandra)
• Computational Linguistics Group (Walter Daelemans)
- Text mining- Natural language understanding
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
COMPUTATIONAL STYLOMETRY
• Authorship Attribution, personality prediction from text
• Exploring feature sets, corpora, different types of tasks (few vs many authors, …)
• Stylene (EWI): Stylometry and Readability Environment for Dutch
• Stylometry experiments with Middle-Dutch sermons• Investigate Hugo Claus’ work to find evidence of Alzheimer’s
disease
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
AUTHOR AND COPYIST ATTRIBUTIONIN MEDIEVAL DUTCH TEXTS
• Use computational stylometry techniques to assign authorship/copyist to anonymous, texts
• Adapt and develop language technology tools for Medieval Dutch
People: Walter Daelemans, Mike Kestemont
Also: Dating (of texts)
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
BIOGRAPH
• adaptation of text analysis tools to biomedicallanguage
• handling of negation, modality, and quantification in medical language
• Extract accurate relations from text
People: Walter Daelemans, Roser Morante, Vincent Van Asch
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
DAPHNE: DETECTING “GROOMING” BY PEDOPHILES IN SOCIAL
NETWORKS
• Use text analysis tools to distinguish between children and adults posing as children on chatrooms
People: Walter Daelemans, Claudia Peersman
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
ARTIFICIAL CREATIVITY INGRAPHICAL DESIGN
• Develop a software algorithm that summarizes, interprets and processes textual content (or data sets) in the context of graphical design
People: Walter Daelemans, Tom De Smedt
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
TREND MINING
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
DELEARYOUS
• Develop serious 3d game to help with training of interpersonal communicative skills
• Use text analysis tools to associate human interaction with quadrants of Leary’s Rose and plan next interaction
People: Walter Daelemans, Frederik Vaassen
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
ALADIN
ALADIN: Adaptation and Learning for Assistive Domestic Vocal INterfaces
Goal: develop a robust, self-learning domestic vocal interface that adapts to the user instead of the other way around:- learn the user’s vocabulary & grammar constructs- learn the user’s voice & pronunciation characteristics
How? Unsupervised learning on the basis of training examples: vocal commands + associated controls (actions)
• People: Janneke van de Loo, Guy De Pauw, Walter Daelemans
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
AFRICAN LANGUAGE TECHNOLOGY(FWO)
• Explore machine learning techniques for building language technology applications and modules for (resource-scarce) African languages
• Data collection, annotation and deployment• Unsupervised learning of morphology
People: Guy De Pauw, Naomi Maajabu
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
OTHER PROJECTS
• NEON: automatic subtitling of television programs
• AMiCA: Automatic Monitoring for Cyberspace Applications
• STARLING: Statistical Relational Learning of Natural Language
DATA-DRIVENAPPROACHES
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
NATURAL LANGUAGE PROCESSING
• Early NLP: deductive, rule-based methods- Limited computational power- AI: expert systems
• Use linguistic experts to build rule-based NLP applications- Advantages: linguistically relevant
precise, fine-tuned for specific domain
- Disadvantages: expensive developmentnot robustnot domain/language independentKNOWLEDGE ACQUISITION BOTTLENECK
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
SHIFT TO INDUCTIVE PARADIGM
• Late 80s, early 90s: shift from deductive paradigm to inductive paradigm, i.e. from rule-based to corpus-based approaches- Exploit large, annotated language corpora with
statistical and machine learning methods to automatically induce NLP tools
- Intelligent NLP systems learning from examples
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
PARADIGM SHIFT
© Walter Daelemans
Deductive MethodsHard-coded solutions(Linguistic) Expert systemsRule-Based methods
Inductive MethodsInduced from (annotated) corporaStatistical, Machine Learning TechniquesData-driven methods
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHY THIS SHIFT?
• NLP was coming out of the toy domain• Disadvantages of rule-based methods (expense, lack of
robustness, domain dependence) were becoming too obstructive for effective NLP research
Fred Jelinek (1988) on working on a speech recognizer:
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHY THIS SHIFT?
• NLP was coming out of the toy domain• Disadvantages of rule-based methods (expense, lack of
robustness, domain dependence) were becoming too obstructive for effective NLP research
Fred Jelinek (1988) on working on a speech recognizer:“Every time I fire a linguist
the performance of the recognizer goes up”
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WHY THIS SHIFT?
• NLP was coming out of the toy domain• Disadvantages of rule-based methods (expense, lack of
robustness, domain dependence) were becoming too obstructive for effective NLP research
Fred Jelinek (1988) on working on a speech recognizer:“Every time I fire a linguist
the performance of the recognizer goes up”
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
HUGO BRANDT CORSTIUS
Three laws of computationallinguistics1. Whatever one does, semantics
will always interfere.2. Any linguistic description, no
matter how precise, will turn out to contain an error whenone attempts to implement it.
3. Law of diminishing returns.
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
PARADIGM SHIFT
© Walter Daelemans
Deductive MethodsHard-coded solutions(Linguistic) Expert systemsRule-Based methods
Inductive MethodsInduced from (annotated) corporaStatistical, Machine Learning TechniquesData-driven methods
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
DATA-DRIVEN NLP
• Statistical techniques- Number crunching- Exploit statistics for classification
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Probabilistic POS Tagging
• Requires annotated corpus
can/md the/dt tag/nn be/vb better/jjr
• Unigram: P(word|tag)
frequency of the tag for this word in corpus
• Bigram: P(wordi|tagi) P(tagi|tagi-1)
frequency of the tag for this word in corpus, given previous tag
• Trigram: P(wordi|tagi) P(tagi|tagi-1,tagi-2)
frequency of the tag for this word in corpus, given previous two tag
• Good Results, but data Sparseness Problems
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
DATA-DRIVEN NLP
• Statistical techniques- Number crunching- Exploit statistics for classification
• Machine-learning techniques:- Symbolic approaches- Use annotated corpora as example of a particular
classification task- The machine learning algorithm learns from
examples- Simple approach: Memory-based learning
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Memory-Based Learning
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Temp coughing headache nose class37 YES YES RUNNY COLD
39 NO YES OK FLU
40 YES NO STUFFY BRONCHITIS
… … … … …
Memory-Based Learning
• Describe problem• Gather data
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Classification in Memory-Based Learning:
New instance:
39,YES,YES,OK,?????
Temp coughing headache nose class37 YES YES RUNNY COLD
39 NO YES OK FLU
40 YES NO STUFFY BRONCHITIS
… … … … …
Memory-Based Learning
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Classification in Memory-Based Learning:
New instance:
39,YES,YES,OK,?????
Temp coughing headache nose class37 YES YES RUNNY COLD
39 NO YES OK FLU
40 YES NO STUFFY BRONCHITIS
… … … … …
overlap = 2
Memory-Based Learning
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Temp coughing headache nose class37 YES YES RUNNY COLD
39 NO YES OK FLU
40 YES NO STUFFY BRONCHITIS
… … … … …
Memory-Based Learning
Classification in Memory-Based Learning:
New instance:
39,YES,YES,OK,?????
overlap = 2
overlap = 3
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Temp coughing headache nose class37 YES YES RUNNY COLD
39 NO YES OK FLU
40 YES NO STUFFY BRONCHITIS
… … … … …
Memory-Based Learning
Classification in Memory-Based Learning:
New instance:
39,YES,YES,OK,?????
overlap = 2
overlap = 3
overlap = 1
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Temp coughing headache nose class37 YES YES RUNNY COLD
39 NO YES OK FLU
40 YES NO STUFFY BRONCHITIS
… … … … …
Memory-Based Learning
Classification in Memory-Based Learning:
New instance:
39,YES,YES,OK,FLU
overlap = 2
overlap = 3
overlap = 1
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
Memory-Based Language Processing
• Memory-Based Language Processing
• Lazy Learning Algorithm: no abstraction made during learning (↔ C5, Brill, Neural Networks,...)
• Data is stored in memory
• Nearest Neighbor search:
New data is classified by comparing the new instances to the instances in memory and extrapolating the class of the most similar instance
• Psycholinguistic Relevance
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
NLP as classification• Most tasks in NLP: mapping between representations
the can will rust ► dt nn md vb
• This mapping in NLP: very complex because of different levels in language
• cf. rule-based methods: only approximate
• machine-learning methods: also approximate, but less effort
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
NLP as classificationtransform description of your problem in fixed feature vector
e.g. Past tense of English verbs
work-worked sing-sang sting-stung
...
• describe verb in 3 features: onset-nucleus-code
w,o,rk s,i,ng st,i,ng ...
• describe output in finite set of classes
-ed i-a i-u ...
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
NLP as classificationcreate instance base from set of examples
And feed it to the machine learner: e.g. TiMBL
onset nucleus coda class
w o rk -ed
s i ng i→a
st i ng i→u
sh oo t oo→o
cr a mp -ed
r i ng i→u
cork
flingloot?
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
ClassificationApply this method to NLP tasks: e.g. POS-tagging
T-2 T-1 F T+1 TAG# # MD_NN DT_RB MD# MD DT_RB NN_VB DTMD DT NN_VB VB NNDT NN VB JJR_VB_RB VBNN VB JJR_VB_RB # JJR
e.g. Can this tag be betterMD VBDT NN JJR
DATA-DRIVEN AFRICANLANGUAGE TECHNOLOGY
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
DATA ACQUISITION
• Many advantages to data-driven NLP
• BUT: Methods need annotated data- Generally less expensive to develop- Same annotated data can be used by different
researchers, using different methods (publications)
• But: what about lesser-used languages, resource-scarce languages?
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
AFRICAN LANGUAGE TECHNOLOGY
• +2000 languages• Very limited work on BLARKs in Africa• Bridging the digital divide: need for Language
Technology:- Localization- Machine translation
But: computational Linguistics for African languages= resource-scarce language engineering
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
RESOURCE-SCARCE LANGUAGEENGINEERING
• Develop re-usable tools and methodology - Corpus-based methods (machine learning)- Develop annotated corpora- Develop automated methods that minimize the
amount of manual effort (and linguistic expertise) involved
Research Question: are data-driven methods applicable to African languages?
How to overcome Indo-European bias?
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
ADVANTAGES OF MACHINE LEARNING
• Language independence• Development Speed• Adaptability• Applicability• Empiricism
LANGUAGE INDEPENDENCE
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
CORPUS COLLECTION
• Hoogeveen & De Pauw (2011) A Web Corpus Mining Tool for Resource-Scarce Languages- Increasing amount of vernacular data for many sub-
Saharan African languages - web-mining- Language identification of over 500 languages (96%
accurate)
• Encoding Issues- Diacritics not or inconsistently used- Normalization
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
MEMORY-BASED DIACRITIC CORRECTION
• mbũri written as mburi• No digital lexicons available: simple look-up
approach not applicable• Alternative approach to normalization by defining
the problem on the character level.
L L L L L F R R R R R C- - - - - m b u r i - m- - - - m b u r i - - b- - - m b u r i - - - ũ- - m b u r i - - - - r- m b u r i - - - - - i
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
EVALUATION
10 fold cross validation
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
EVALUATION
SET
G
N
I
N
I
A
R
T
TEST SET
• 10 fold cross validation• compare output of automatic system to reference translation• Calculate accuracy scores for unseen data
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
MEMORY-BASED DIACRITIC CORRECTIONLanguage Types LLU MBT LLU+MBT
Africa Cilubà 20.0k 77.0 85.3 79.6
Gĩkũyũ 9.1k 77.3 92.4 91.5
Kĩkamba 9.7k 79.4 91.6 90.4
Northern Sotho 157.8k 97.6 99.2 99.4Tshivenda 9.6k 97.7 99.4 99.2
Yoruba 4.2k 67.8 76.8 68.5
Europe Czech 105.8k 61.8 89.2 90.1Romanian 146.9k 94.0 96.5 96.6French 258.6k 89.1 88.3 89.3Dutch 301.9k 99.9 99.8 99.9German 365.6k 96.2 95.3 96.8
Asia Vietnamese 50.9k 74.5 73.5 75.5Chinese Pinyin 12.0k 78.5 83.9 80.3
• Single set of scripts for each language• Limited linguistic expertise
DEVELOPMENT SPEED
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
CORPUS ANNOTATION
• Swahili Part-of-Speech Tagging (De Pauw, de Schryver & Wagacha, 2006): uses annotated Helsinki Corpus of Swahili as training material for data-driven tagger (>98% accurate)
• Northern Sotho: start from scratch- Minimize human linguistic expertise- No extended tagging protocol development phase- Maximize re-usability of methodology
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
CORPUS ANNOTATION
• Starting point: digital lexicon (word + possible tags)
• Annotator environment: Spreadsheet
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
TAGGED CORPUS
• 10,000 words annotated over the course of a couple of weeks• Use data as training material for a maximum entropy-based
tagger (advanced statistical modeling of data)
• Classification on the basis of - Contextual features: surrounding words and tags- Orthographic features: capitalization, prefix/suffix letters
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
TAGGED CORPUS
Instance Tag['W-1=#', 'T-1=#', 'FW=Ke', 'FT=SC_COPp', 'W+1=a',
'T+1=SC_PRES_PC_DEM_OC_HRTp', 'P1=K', 'S1=e', 'P2=Ke', 'S2=Ke', 'CAP‘]
SC
['W-1=Ke', 'T-1=SC', 'FW=a', 'FT=SC_PRES_PC_DEM_OC_HRTp', 'W+1=eletša', 'T+1=V', 'P1=a', 'S1=a']
PRES
['W-1=a', 'T-1=PRES', 'FW=eletša', 'FT=V', 'W+1=.', 'T+1=Punc', 'P1=e', 'S1=a', 'P2=el', 'S2=ša', 'P3=ele', 'S3=tša']
V
['W-1=eletša', 'T-1=V', 'FW=.', 'FT=Punc', 'W+1=#', 'T+1=#', 'P1=.', 'S1=.']
Punc
KeSC aPRES eletšaV .Punc
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
RESULTS
Known Words Unknown Words(8%)
Total
Baseline 75.8 35.1 73.5MaxTag 95.1 78.9 93.5
• Minimal Development Time• Tagging Protocol on-the-fly (grounded in performance)• Good tagging accuracy
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
LEARNING CURVE
70
80
90
100
1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
Acc
ura
cy o
n T
est
Set
Number of words in training set
Known
Unknown
Total
ADAPTABILITY
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
SWAHILI LEMMATIZATION
• Extract information from Helsinki Corpus of Swahili• Typical HCS annotation:
ulikanusha kanusha V deny, disprove, refute, negateUlikoanzia anza V begin, establish
• Perform pattern-matching of lemma onto word form to create two-level morphological segmentation:
Ulikanusha kanusha Surface: uli[P] + kanusha[R] Lexical: uli[P] + kanusha[R]
ulikonzia anza Surface: uliko[P]+anz[R]+ia[S] Lexical: uliko[P]+anza[R]+ia[S]
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
HCS PROBLEMS
• 97k word forms extracted from 9M word HCS• Noise in data:
- Remove hapaxes, English words- But still many inconsistencies and mistakes in automatic
annotation
• For proper evaluation: manually develop clean gold-standard evaluation set:- Take 10% of original word form list (9.7k word forms)- Manually annotate it with prefix-root-suffix protocol
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
GOLD STANDARD EVALUATION
• Annotation using Spreadsheet
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
MEMORY-BASED MORPHOLOGICALANALYSIS
• Extract instances from morphologically annotated word form list, e.g. for uliko[P] + anz[R] + ia[S])
L5 L4 L3 L2 L1 F R1 R2 R3 R4 R5 CLASS1 - - - - - u l i k o a 02 - - - - u l i k o a n 03 - - - u l i k o a n z 04 - - u l i k o a n z i 05 - u l i k o a n z i a P
6 u l i k o a n z i a - 07 l i k o a n z i a - - 08 i k o a n z i a - - - R+a9 k o a n z i a - - - - 0
10 o a n z i a - - - - - S
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
NLP-EVALUATION
Segmentation of surface representation
Further lemmatization
WER WER
Morfessor 70.7 % 73.6 %SALAMAx 11.7 % 12.0 %MBSMA-c 13.3 % 13.6 %MBSMA-s 11.6 % 11.7 %
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
NLP-EVALUATION
• Minimal Development time• Robust for unknown words (vs. SALAMA,
kamusiproject.org, …)
• How can a data-driven approach outperform the system that was used to build its training material? Generalization properties of the machine-
learning approach filters out noise in the HCS-induced training set.
APPLICABILITY
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
PARALLEL CORPUS ENGLISH - SWAHILI
• SAWA corpus: 2.5 million word orpus of translated texts
• Limited availability of parallel texts English –Swahili:- Smaller documents: investment reports,
political texts, e.g. Universal Declaration of Human Rights
“there is no data, like more data”- Bible, Quran, secular literature- New translations
• Experiment with statistical machine translation techniques
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
AVAILABLE DATA IN SAWA CORPUS(2010)
English Sentences
Swahili Sentences
EnglishWords
SwahiliWords
Bible 52.4k 51.2k 813.3k 653.7k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Draft Constitution 4.0k 3.8k 56.5k 51.1k
Total 90.2k 89k 1.2M 996.6k
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WORD ALIGNMENT
Most difficult task: relate words between languages
No she ‘s uh, , up north
La
,
, ,yuko ,aa juu kaskazini
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
WORD ALIGNMENT
You caught me skiving , I ‘m afraid .
Samahani , umenidaka nikihepa .
• Can be done automatically using established tools (GIZA++)• Provide manual reference to evaluate automatic word alignment
tools (5000 words, annotated with UMIACS alignment interface)
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
ALIGNMENT PROBLEMS
nimemkatalia
have turned him downI
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
MORPHOLOGICAL DECOMPOSITION
have turned him downI
ni+ me+ m+ katalia
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
SMT EXPERIMENTS
• Proof-of-the-principle experiments with MOSES• GIZA++ word alignment• SRILM language models
BLEU NISTGoogle English Swahili 0.15 4.56SAWA English Swahili 0.14 4.23Google Swahili English 0.18 4.54SAWA Swahili English 0.23 4.74
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
LUO
• Truly resource-scarce language• Nilotic language, 3M speakers• International Bible Society (2005) Luo New Testament.
Available at http://www.biblica.com/bibles/luo• Use English and Swahili New Testament data of SAWA
corpus to construct small trilingual parallel corpus• Preprocessing:
- Pdftext conversion- Tokenization- Sentence alignment
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
SMT EXPERIMENTS
OOV BLEU NISTLuo English 4.4% 5.39 0.23Luo English [F] 4.4% 6.52 0.29English Luo 11.4% 4.12 0.18English Luo [F] 11.4% 5.31 0.22Luo Swahili 6.1% 2.91 0.11Luo Swahili [F] 6.1% 3.17 0.15Swahili Luo 11.4% 2.96 0.10Swahili Luo [F] 11.4% 3.36 0.15
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
EXAMPLES
Source en ng’a moloyo piny ? mana jalo moyie ni yesu en wuod nyasaye
Translation who is more than the earth ? only he who believes that he is the son of god
Reference who is it that overcomes the world ? only he who believes that jesus is the son of god
Source atimo erokamano kuom thuoloniTranslation do thanks about this timeReference I am thankful for your leadership
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
FUTURE WORK
• More data!- Optimize techniques for smaller data sets
• Unsupervised machine learning- Implicit linguistics- Linguistic classification from scratch- Typically uses huge data sets- Spell checkers + morphological analysis project
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
CONCLUSION
• Our research on African Language Technology shows the typical advantages of the data-driven paradigm:
- Language independence- development speed- Adaptability- Applicability- Empiricism
• Every time I fire a linguist, the recognizer’s accuracy goes up- Fred Jelinek (2005) Some of my best friends are linguists- Anno 2012: linguists are not programmers per se, but domain
experts
Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012
HTTP://AFLAT.ORG
SPECIAL ISSUE ON AFRIAN LANGUAGE TECHNOLOGY OF LANGUAGERESOURCES AND EVALUATION 45(3). SEPTEMBER 2011
AFLAT 2012 (ISTANBUL, TURKEY)