![Page 1: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/1.jpg)
CorpusLab, URPP “Language and Space”
Basic natural languageprocessing for Swiss GermantextsTanja Samardzic
09/06/2017 Page 1
![Page 2: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/2.jpg)
CorpusLab, URPP “Language and Space”
Long-term contributionNoemi AepliFatima StadlerYves ScherrerElvira Glaser
Funding
Hasler Foundation grant No 16038
UZH URPP ’Language and Space’
Agreement with Spitch
Specific tasks
Henning BeywlChristof BlessAlexandra BunzliMatthias FriedliAnne GohringNoemi GrafAnja Hasse
Gordon HeathAgnes KolmerMike LinggPatrick MachlerEva PetersUliana Petrunina
Janine Richner-SteinerHana RuchBeni RuefPhillip StrobelSimone UeberwasserAlexandra Zoller
09/06/2017 Basic natural language processing for Swiss German texts Page 2
![Page 3: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/3.jpg)
CorpusLab, URPP “Language and Space”
Data
![Page 4: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/4.jpg)
CorpusLab, URPP “Language and Space”
Oral history project ArchiMob
09/06/2017 Basic natural language processing for Swiss German texts Page 4
![Page 5: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/5.jpg)
CorpusLab, URPP “Language and Space”
The ArchiMob corpus sample
09/06/2017 Basic natural language processing for Swiss German texts Page 5
![Page 6: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/6.jpg)
CorpusLab, URPP “Language and Space”
Some numbers
44 documents selected by Janine Richner-Steiner and Matthias Friedli,supervised by Elvira Glaser
Release 1.0 (2016):
– 34 documents, around 500 000 word tokens
– 23/44 documents transcribed in the period 2004–2014
– 11/44 documents transcribed in 2015, in collaboration with Spitch
Next release (2017):
– 43 documents, around 650 000 word tokens
– 6/44 documents transcribed in 2016
– 3/44 in progress
09/06/2017 Basic natural language processing for Swiss German texts Page 6
![Page 7: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/7.jpg)
CorpusLab, URPP “Language and Space”
Format
![Page 8: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/8.jpg)
CorpusLab, URPP “Language and Space”
Current format
09/06/2017 Basic natural language processing for Swiss German texts Page 8
![Page 9: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/9.jpg)
CorpusLab, URPP “Language and Space”
Content
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 9
![Page 10: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/10.jpg)
CorpusLab, URPP “Language and Space”
Transcription
![Page 11: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/11.jpg)
CorpusLab, URPP “Language and Space”
Transcription
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 11
![Page 12: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/12.jpg)
CorpusLab, URPP “Language and Space”
Manual transcription
1. 16 documents - Nisus Writer– No segmentation (only turns)– No text to speech alignment– Converted into XML, added segmentation and alignment
2. 7 documents - FOLKER (Schmidt, 2012)– Segmented into chunks of 4-10 seconds– XML and alignment output
3. 11 documents - EXMARaLDA (Schmidt, 2012)– same as FOLKER, just more convenient
09/06/2017 Basic natural language processing for Swiss German texts Page 12
![Page 13: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/13.jpg)
CorpusLab, URPP “Language and Space”
Some details
– Based on Dieth guidelines, but gradually simplified
– Utterance as the basic unit
– Turns not explicitly annotated
– Inconsistence in writing (pronouns and clitics)
– Pauses, repetitions
– Incomprehensible speech
09/06/2017 Basic natural language processing for Swiss German texts Page 13
![Page 14: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/14.jpg)
CorpusLab, URPP “Language and Space”
Normalisation
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 14
![Page 15: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/15.jpg)
CorpusLab, URPP “Language and Space”
Approach
– Manual normalisation of 6 documents, VARD2 and IGT
– Automatic normalisation– Character-level machine translation (CSMT) with MOSES– Training on the 6 manually normalised documents
09/06/2017 Basic natural language processing for Swiss German texts Page 15
![Page 16: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/16.jpg)
CorpusLab, URPP “Language and Space”
CSMT
Translation model: p(normalised |transcribed)
i s c h s c h i s t
Language model: p(normalisedi |normalisedi−1)
i s t
09/06/2017 Basic natural language processing for Swiss German texts Page 16
![Page 17: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/17.jpg)
CorpusLab, URPP “Language and Space”
Current state of the art
Yves Scherrer and Nikola Ljubesic (KONVENS 2016)
– Larger translation units (utterances instead of words)
– Language model augmented with German spoken data
– Improved tuning
– Result: 90.46 % accuracy
09/06/2017 Basic natural language processing for Swiss German texts Page 17
![Page 18: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/18.jpg)
CorpusLab, URPP “Language and Space”
Part-of-speech tagging
![Page 19: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/19.jpg)
CorpusLab, URPP “Language and Space”
Part-of-speech
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
09/06/2017 Basic natural language processing for Swiss German texts Page 19
![Page 20: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/20.jpg)
CorpusLab, URPP “Language and Space”
Tagger development
STTS+ tag set
Train Test % Acc. % OOV
TuBa-D/S Normalised 70.31 24.21Starting NOAH Original 60.56 30.72
Removed TuBa-D/S Normalised 70.68 24.21punctuation NOAH Original 73.09 30.72
NOAH +Adapted ArchiMob Original 90.09 –
09/06/2017 Basic natural language processing for Swiss German texts Page 20
![Page 21: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/21.jpg)
CorpusLab, URPP “Language and Space”
Current activities
Tagger adaptation:
– Active learning: gradually add ArchiMob data in the train set
– CRF tagger
09/06/2017 Basic natural language processing for Swiss German texts Page 21
![Page 22: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/22.jpg)
CorpusLab, URPP “Language and Space”
Speech-to-text
![Page 23: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/23.jpg)
CorpusLab, URPP “Language and Space”
Speech-to-text
Acoustic model: p(transcribed |sound)
/.../ /.../ /.../ /.../ /.../ /.../ /.../ das ischsch en ez
Language model: p(transcribedi |transcribedi−1)
das ischsch en ez
09/06/2017 Basic natural language processing for Swiss German texts Page 23
![Page 24: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/24.jpg)
CorpusLab, URPP “Language and Space”
Approach
– Improving Spitch prototype with new language models
– Our own speech-to-text development with Kaldi
– Manual transcription
09/06/2017 Basic natural language processing for Swiss German texts Page 24
![Page 25: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/25.jpg)
CorpusLab, URPP “Language and Space”
Next steps
![Page 26: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/26.jpg)
CorpusLab, URPP “Language and Space”
Next steps
– Continue transcription, PoS tagging, normalisation
– Neural transducers (deep learning) for normalisation
– Subword language models for speech-to-text
– New data
09/06/2017 Basic natural language processing for Swiss German texts Page 26
![Page 27: Basic natural language processing for Swiss German texts](https://reader031.vdocument.in/reader031/viewer/2022012504/617e8e2df5c00120a34de9b6/html5/thumbnails/27.jpg)
CorpusLab, URPP “Language and Space”
Your feedback!