improving orthographic transcriptions using sentence similarities
DESCRIPTION
© NL Oosthuizen, MJ Puttkammer, M SchlemmerTRANSCRIPT
NL Oosthuizen, MJ Puttkammer & M SchlemmerCentre for Text Technology (CTexT®)
Research Unit: Languages and Literature in the South African Context
North-West University, Potchefstroom Campus (PUK)
South Africa
E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za
Improving orthographic
transcriptions using sentence
similarities
18 May 2010; AfLaT 2010; Valletta, Malta
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
Background
Problem Statement
Introduction: Background
18 May 2010; AfLaT 2010; Valletta, Malta
• Lwazi (knowledge) project
– 200 Mother tongue speakers per language
– 30 phrases – 14 open ended and 16 phoneme-
rich sentences
– 350 phoneme-rich sentences from various
corpora each recorded 6-10 times – Totalling
3200 phoneme-rich sentences
– Relatively small ASR corpus meant extremely
accurate transcriptions
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
Background
Problem Statement
Introduction: Problem Statement
18 May 2010; AfLaT 2010; Valletta, Malta
• Lwazi project – Issues
– 2 year running time
– 4-6 transcribers were employed per language
– Different quality control phases were
unsuccessful
– Another solution was needed to improve
quality
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
ConfusablesSplitsInsertionsDeletionsNon-words
Identified Differences: Confusables
18 May 2010; AfLaT 2010; Valletta, Malta
• English examples:– has it been tried on <too> small a scale
– has it been tried on <to> small a scale
• isiXhosa examples:– andingomntu <othanda> kufunda
• (I’m not a person <who loves> to read)
– andingomntu <uthanda> kufunda• (I’m not a person <you love> to read)
• Setswana examples:– bosa bo <jang> ko engelane ka nako e
– bosa bo <yang> ko engelane ka nako e• (How is the weather in England at this time?)
• <yang> in the second example is slang for “how”
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
ConfusablesSplitsInsertionsDeletionsNon-words
Identified Differences: Splits
18 May 2010; AfLaT 2010; Valletta, Malta
• English examples:– there’s <nowhere> else for it to go
– there’s <no_where> else for it to go
• isiXhosa examples:– alwela phi na <loo_madabi>
– alwelwa phi na <loomadabi>• (Where is it taking place, these challenges)
• Setswana examples:– le fa e le <gone> re ratanang tota
– le fa e le <go_ne> re ratanang tota• (Even though we have started dating)
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
ConfusablesSplitsInsertionsDeletionsNon-words
Identified Differences: Insertions
18 May 2010; AfLaT 2010; Valletta, Malta
• English examples:– so we took our way toward the palace
– so we <we_>took our way toward the palace
• isiXhosa examples:– ibiyini ukuba unga mbambi wakumbona
• (Why didn’t <you catch> him or her when you saw him or her)
– ibiyini <na_>ukuba unga mbambi wakumbona• (Why didn’t < you caught> him or her when you saw him or her)
• Setswana examples:– ba mmatlela mapai ba mo alela a robala
– ba mmatlela mapai <li-> ba mo alela a robala• (They have looked for blankets and made a bed for themselves to sleep)
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
ConfusablesSplitsInsertionsDeletionsNon-words
Identified Differences: Deletions
18 May 2010; AfLaT 2010; Valletta, Malta
• English examples:– as <to_>the first the answer is simple
– as the first the answer is simple
• isiXhosa examples:– yagaleleka impi <ke_>xa kuthi qheke ukusa
– yagaleleka impi xa kuthi qheke ukusa• (It started the battle at the beginning of the morning)
• Setswana examples:– ke eng gape se seng<we> se o se lemogang
– ke eng gape se seng se o se lemogang• (What else have you noticed?)
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
ConfusablesSplitsInsertionsDeletionsNon-words
Identified Differences: Non-words
18 May 2010; AfLaT 2010; Valletta, Malta
• English examples:– there is no <arbitrator> except a legislature fifteen thousand miles off
– there is no <abritator> except a legislature fifteen thousand miles off
• isiXhosa examples:– yile <venkile> yayikhethwe ngabathembu le
– yile <venkeli> yayikhethwe ngabathembu le • (It is this shop that was selected by the Bathembu)
• <venkeli> in the second example is a spelling mistake.
• Setswana examples:– lefapha la dimenerale le <eneji>
– lefapha la dimenerale le <energy>• (Department of minerals and energy)
• < energy > in the second example is a spelling mistake.
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Flowchart
18 May 2010; AfLaT 2010; Valletta, Malta
Map transcription to
original
Compare mapped
sentences
Cleanup
Average of 350 sentences recorded 6-10 times, transcribed by 4-6 people per language
For transcriptions Remove punctuation,noise markers & partialsConvert to LC
For original sentences Remove punctuationConvert to LC
Compute Levenshtein distance and map transcriptions to closest original sentence
slighter fault substance are numerous 90.20%
slighter faults of substances are numerous 97.60%
slighter faults of substance are numerous
String Similarity - Brad Wood“Look ahead” window finds differences
Original sentences &transcriptions
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Flowchart
18 May 2010; AfLaT 2010; Valletta, Malta
Replace errors with correct string
MarkupHTML markup to illustrate differences with colours
slighter faults of substance are numerousslighter fault substance are numerous
slighter faults of substance are numerousslighter faults of substances are numerous
I told him to make the charge at oncevs
I told him to make the change at once
Verify the differences in context by listening to recordings
Correct errors
Manual verification
Improved transcriptions
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Cleanup
18 May 2010; AfLaT 2010; Valletta, Malta
• Remove possible differences from the
sentences to improve matches:
– Punctuation
• Any commas, full stops, extra spaces ect.
– Noise Markers
• External noises [n] and speaker noises [s]
– Partials
• Any incomplete words (indicted by leading or trailing
hyphen)
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Transcription Mapping
18 May 2010; AfLaT 2010; Valletta, Malta
• Levenshtein mapping:
– Link each transcribed sentence (T) to an
original sentence (O) using Levenshtein
distance
– If no difference is found (DIFF (O, T) = 0)
• Do nothing
– If a difference is found (DIFF (O, T) = 1)
• Continue to next step
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Transcription Mapping
18 May 2010; AfLaT 2010; Valletta, Malta
• Levenshtein example:
– Original sentence (O):
• slighter faults of substance are numerous
– Transcriptions (T):
• slighter fault substance are numerous
– 90.20%
• slighter faults of substances are numerous
– 97.60%
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Sentences and Mark-up
18 May 2010; AfLaT 2010; Valletta, Malta
• String comparison algorithm developed by
Brad Wood (2008):
– Based on finding the Longest Common String
(LCS)
– Windowing compares the strings on character
level over a maximum search distance
– Differences found are annotated with HTML
• Repeat after swapping the string 1 with
string 2
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Manual Verification
18 May 2010; AfLaT 2010; Valletta, Malta
• If(DIFF (O, T) = 1)
– The spoken utterance (U) is compared to the
original sentence (O)
– If(DIFF (O, U) = 1 AND DIFF (T, U) = 0) then
• U = T (No change is needed)
– If(DIFF (O, U) = 1 AND DIFF (T, U) = 1) then
• The transcription is incorrect and needs to be
checked manually
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Manual Verification
18 May 2010; AfLaT 2010; Valletta, Malta
• Transcribed correctly
– Original sentence (O):
• “I told him to make the charge at once.”
– Spoken utterance (U)
• “I told him to make the change at once.”
– Transcriptions (T):
• “I told him to make the cha<n>ge at once.”
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification
Methodology: Manual Verification
18 May 2010; AfLaT 2010; Valletta, Malta
• Transcribed incorrectly
– Original sentence (O):
• “a heavy word intervened, between...”
– Spoken utterance (U)
• “a heavy word intervened, between...”
– Transcriptions (T):
• “a heavy wo<o>d intervened, between...”
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
Results
18 May 2010; AfLaT 2010; Valletta, Malta
Language Differences found Actual errors
Afrikaans 776 152
English 1143 337
isiNdebele 958 291
isiXhosa 1484 1081
isiZulu 1854 1228
Sepedi 1596 736
Sesotho 739 261
Setswana 1479 828
Siswati 1558 351
Tshivenda 814 191
Xitsonga 1586 456
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
Summary
Future Work
Questions?
Conclusion: Summary
18 May 2010; AfLaT 2010; Valletta, Malta
• We introduced a method for identifying
differences in ASR data
• Overall quality of the transcriptions were
increased
• The Lwazi project had an average
transcription accuracy of 98%.
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
Summary
Future Work
Questions?
Conclusion: Summary
18 May 2010; AfLaT 2010; Valletta, Malta
• Even with inexperienced transcribers high
accuracy is still possible
• Provide employment opportunities to
people with little linguistic skills but have
basic knowledge of their language
• Empowering people to learn skills that may
be invaluable in future projects
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
Summary
Future Work
Questions?
Conclusion: Future Work
18 May 2010; AfLaT 2010; Valletta, Malta
• If(DIFF (O, T) = 0 AND DIFF (O, U) = 1)
– This will indicate that DIFF (T, U) = 1
– For the current system DIFF (T, U) = 0 was
considered only, as the specifications required
it
– This will mean that one can check the reader’s
performance
– Future work will include this statement
Oosthuizen, Puttkammer & Schlemmer
IntroductionIdentified Differences
MethodologyResults
Conclusion
Summary
Future Work
Questions?
Conclusion: Questions?
18 May 2010; AfLaT 2010; Valletta, Malta
Centre for Text Technology (CTexT®)
Research Unit: Languages and Literature in the South African Context
North-West University, Potchefstroom Campus (PUK)
South Africa
E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za
NL Oosthuizen, MJ Puttkammer & M Schlemmer
Improving orthographic
transcriptions using sentence
similarities