basic natural language processing for swiss german texts
TRANSCRIPT
CorpusLab, URPP “Language and Space”
Basic natural languageprocessing for Swiss GermantextsTanja Samardzic
09/06/2017 Page 1
CorpusLab, URPP “Language and Space”
Long-term contributionNoemi AepliFatima StadlerYves ScherrerElvira Glaser
Funding
Hasler Foundation grant No 16038
UZH URPP ’Language and Space’
Agreement with Spitch
Specific tasks
Henning BeywlChristof BlessAlexandra BunzliMatthias FriedliAnne GohringNoemi GrafAnja Hasse
Gordon HeathAgnes KolmerMike LinggPatrick MachlerEva PetersUliana Petrunina
Janine Richner-SteinerHana RuchBeni RuefPhillip StrobelSimone UeberwasserAlexandra Zoller
09/06/2017 Basic natural language processing for Swiss German texts Page 2
CorpusLab, URPP “Language and Space”
Data
CorpusLab, URPP “Language and Space”
Oral history project ArchiMob
09/06/2017 Basic natural language processing for Swiss German texts Page 4
CorpusLab, URPP “Language and Space”
The ArchiMob corpus sample
09/06/2017 Basic natural language processing for Swiss German texts Page 5
CorpusLab, URPP “Language and Space”
Some numbers
44 documents selected by Janine Richner-Steiner and Matthias Friedli,supervised by Elvira Glaser
Release 1.0 (2016):
– 34 documents, around 500 000 word tokens
– 23/44 documents transcribed in the period 2004–2014
– 11/44 documents transcribed in 2015, in collaboration with Spitch
Next release (2017):
– 43 documents, around 650 000 word tokens
– 6/44 documents transcribed in 2016
– 3/44 in progress
09/06/2017 Basic natural language processing for Swiss German texts Page 6
CorpusLab, URPP “Language and Space”
Format
CorpusLab, URPP “Language and Space”
Current format
09/06/2017 Basic natural language processing for Swiss German texts Page 8
CorpusLab, URPP “Language and Space”
Content
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 9
CorpusLab, URPP “Language and Space”
Transcription
CorpusLab, URPP “Language and Space”
Transcription
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 11
CorpusLab, URPP “Language and Space”
Manual transcription
1. 16 documents - Nisus Writer– No segmentation (only turns)– No text to speech alignment– Converted into XML, added segmentation and alignment
2. 7 documents - FOLKER (Schmidt, 2012)– Segmented into chunks of 4-10 seconds– XML and alignment output
3. 11 documents - EXMARaLDA (Schmidt, 2012)– same as FOLKER, just more convenient
09/06/2017 Basic natural language processing for Swiss German texts Page 12
CorpusLab, URPP “Language and Space”
Some details
– Based on Dieth guidelines, but gradually simplified
– Utterance as the basic unit
– Turns not explicitly annotated
– Inconsistence in writing (pronouns and clitics)
– Pauses, repetitions
– Incomprehensible speech
09/06/2017 Basic natural language processing for Swiss German texts Page 13
CorpusLab, URPP “Language and Space”
Normalisation
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 14
CorpusLab, URPP “Language and Space”
Approach
– Manual normalisation of 6 documents, VARD2 and IGT
– Automatic normalisation– Character-level machine translation (CSMT) with MOSES– Training on the 6 manually normalised documents
09/06/2017 Basic natural language processing for Swiss German texts Page 15
CorpusLab, URPP “Language and Space”
CSMT
Translation model: p(normalised |transcribed)
i s c h s c h i s t
Language model: p(normalisedi |normalisedi−1)
i s t
09/06/2017 Basic natural language processing for Swiss German texts Page 16
CorpusLab, URPP “Language and Space”
Current state of the art
Yves Scherrer and Nikola Ljubesic (KONVENS 2016)
– Larger translation units (utterances instead of words)
– Language model augmented with German spoken data
– Improved tuning
– Result: 90.46 % accuracy
09/06/2017 Basic natural language processing for Swiss German texts Page 17
CorpusLab, URPP “Language and Space”
Part-of-speech tagging
CorpusLab, URPP “Language and Space”
Part-of-speech
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 19
CorpusLab, URPP “Language and Space”
Tagger development
STTS+ tag set
Train Test % Acc. % OOV
TuBa-D/S Normalised 70.31 24.21Starting NOAH Original 60.56 30.72
Removed TuBa-D/S Normalised 70.68 24.21punctuation NOAH Original 73.09 30.72
NOAH +Adapted ArchiMob Original 90.09 –
09/06/2017 Basic natural language processing for Swiss German texts Page 20
CorpusLab, URPP “Language and Space”
Current activities
Tagger adaptation:
– Active learning: gradually add ArchiMob data in the train set
– CRF tagger
09/06/2017 Basic natural language processing for Swiss German texts Page 21
CorpusLab, URPP “Language and Space”
Speech-to-text
CorpusLab, URPP “Language and Space”
Speech-to-text
Acoustic model: p(transcribed |sound)
/.../ /.../ /.../ /.../ /.../ /.../ /.../ das ischsch en ez
Language model: p(transcribedi |transcribedi−1)
das ischsch en ez
09/06/2017 Basic natural language processing for Swiss German texts Page 23
CorpusLab, URPP “Language and Space”
Approach
– Improving Spitch prototype with new language models
– Our own speech-to-text development with Kaldi
– Manual transcription
09/06/2017 Basic natural language processing for Swiss German texts Page 24
CorpusLab, URPP “Language and Space”
Next steps
CorpusLab, URPP “Language and Space”
Next steps
– Continue transcription, PoS tagging, normalisation
– Neural transducers (deep learning) for normalisation
– Subword language models for speech-to-text
– New data
09/06/2017 Basic natural language processing for Swiss German texts Page 26
CorpusLab, URPP “Language and Space”
Your feedback!