Download - Automatic English text correction
![Page 1: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/1.jpg)
Automatic English Text Correction@tati_alchueyr
Automatic English Text CorrectionTatiana Al-Chueyr Martins@tati_alchueyr
Bratislava, 12 March 2016
PyCon SK 2016
![Page 2: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/2.jpg)
Automatic English Text Correction@tati_alchueyr
tati.__doc__
● Brazilian● Lives in London (United Kingdom)● Pythonista and Open Source activist● Computer Engineer by Unicamp (Brazil)● Develops software programs since 2002● Works at EF (Education First)
○ Backend & DevOps leader of CTX Team
![Page 3: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/3.jpg)
Automatic English Text Correction@tati_alchueyr
help(EF)
● EF: Education First● International education company
○ Language training○ Educational travel○ Academic degree level
● Funded in 1955 in Sweden by Bertil Hult● ~ 40,000 staff● ~ 500 offices and schools in more than 50 countries (including Slovakia ;))● Privately held by the Hult family
![Page 4: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/4.jpg)
Automatic English Text Correction@tati_alchueyr
help(EF.CTX)
● Classroom Technology Experience● Teaching and learning applications (Web & Mobile)● Authoring platform
![Page 5: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/5.jpg)
Automatic English Text Correction@tati_alchueyr
CTX.__team__
● CTX Team● Team travel
● Malta, November 2015
![Page 6: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/6.jpg)
Automatic English Text Correction@tati_alchueyr
CTX.backend
● Rafael Cunha de Almeida● and I● trying to master Italian culinary
● London, February 2016
● Although I’m presenting this project alone, Rafa has contributed to it as much as I :)
![Page 7: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/7.jpg)
Automatic English Text Correction@tati_alchueyr
objective
● Present a challenge● Introduce a useful dataset● Introduce a bunch of Python scripts● Collect ideas● Build collaboratively good quality open source tools which can help dealing
with this challenge
![Page 8: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/8.jpg)
Automatic English Text Correction@tati_alchueyr
the challenge
![Page 9: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/9.jpg)
Automatic English Text Correction@tati_alchueyr
The challenge
To assess (evaluate) students’ activities & exercises can be:
● Laborious● Repetitive● Slow● In other words... painful!
https://classteaching.files.wordpress.com/2013/10/marking-pile.gif
![Page 10: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/10.jpg)
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
![Page 11: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/11.jpg)
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
![Page 12: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/12.jpg)
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
spelling
![Page 13: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/13.jpg)
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
spelling
verb tense
![Page 14: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/14.jpg)
Automatic English Text Correction@tati_alchueyr
English Text Correction
hi ,
my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .
there are tow people in my family: my mother, my father.
my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811
capitalization
spelling
verb tense
There are “only” 89
writings left to access this
week...
![Page 15: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/15.jpg)
Automatic English Text Correction@tati_alchueyr
The challenge
Implement algorithms and tools which can help (teachers) assessing English written essays
Example of application available in several applications (including LibreOffice, Google Apps, MS Word):
● Highlight (potential) mistakes while user types in a text area
![Page 16: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/16.jpg)
Automatic English Text Correction@tati_alchueyr
The challenge
● Input:○ English text
● Output:○ List of items containing:
■ Position in text■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc)■ Proposal of correction
![Page 17: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/17.jpg)
Automatic English Text Correction@tati_alchueyr
the dataset
![Page 18: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/18.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
EFCamDAT
● 551,036 written essays○ 2,897,788 sentences○ 32,980,407 word tokens
● by 85,864 learners● 16 levels of proficiency● 172 nationalities● annotated with corrections by English teachers
![Page 19: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/19.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
Examples of essay topics
● Introducing yourself by email● Writing an online profile● Describing your favourite day● Telling someone what you’re doing● Replying to a new penpal● Writing about what you do● Writing a resume
● Giving instructions to play a game● Reviewing a song for a website● Writing an apology email● Writing a movie review● Turning down an invitation● Giving advice about budgeting● Covering a news story● Researching a legendary creature
![Page 20: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/20.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
Examples of learners nationalities
● 36.9% Brazilians● 18.7% Chinese● 8.5% Russians● 7.9% Mexicans● 5.6% Germans● 4.3% French● ...
![Page 21: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/21.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
EFCamDAT
● EF-Cambridge Open Language Database● Partnership between:
○ University of Cambridge (Department of Theoretical and Applied Linguistics)■ EF-Research Unit
○ EF Education First● Data collected from Englishtown
○ EF learning environment (online English school)
![Page 22: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/22.jpg)
Automatic English Text Correction@tati_alchueyr
The datasetTypes of mistakes annotated
● X >> y: change from x to y● AG: agreement● AR: article● CO: combine sentence● C: capitalization● D: delete● EX: expression of idiom● HL: highlight● I(x): insert x● MW: missing word● NS: new sentence
● NWS: no such word● PH: phraseology● PL: plural● PO: possessive● PR: preposition● PS: part of speech● PU: punctuation● SI: singular● SP: spelling● VT: verb tense● WC: word choice● WO: word order
![Page 23: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/23.jpg)
Automatic English Text Correction@tati_alchueyr
● 10 most common mistakes
The dataset
![Page 24: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/24.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
How to get it?
● https://corpus.mml.cam.ac.uk/efcamdat1/access.php
Licence:
● Use non-commercial research● Commercial use when agreed upon agreement● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf
![Page 25: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/25.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
![Page 26: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/26.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
![Page 27: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/27.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
Once you’ve registered
● It is possible to filter the dataset● export the dataset into a XML file
![Page 28: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/28.jpg)
Automatic English Text Correction@tati_alchueyr
The dataset
![Page 29: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/29.jpg)
Automatic English Text Correction@tati_alchueyr
a bunch of Python scripts
![Page 30: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/30.jpg)
Automatic English Text Correction@tati_alchueyr
A bunch of (Python) scriptsDisclaimer
Code developed using the Extreme Go Horse Methodology during Hackday moments
They are a POC and lack:
- Proper automated tests- Proper code design & API- Documentation
https://gist.github.com/banaslee/4147370
![Page 31: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/31.jpg)
Automatic English Text Correction@tati_alchueyr
A bunch of (Python) scripts
What do they do?
1. Fix the XML files2. Convert the XML files into good looking JSON files3. Implement heuristics to identify some common English mistakes
○ For now: spelling, capitalization and articles
4. Analysis of how efficient the algorithm was
![Page 32: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/32.jpg)
Automatic English Text Correction@tati_alchueyr
A bunch of (Python) scripts
How to download them?
● https://github.com/ef-ctx/righter
Licence
● Apache version 2.0
![Page 33: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/33.jpg)
Automatic English Text Correction@tati_alchueyr
Hands on
![Page 34: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/34.jpg)
Automatic English Text Correction@tati_alchueyr
Mistakes identification
We wrote functions that apply heuristics and rules to detect mistakes related to:
1. Spelling2. Capitalization3. Article
![Page 35: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/35.jpg)
Automatic English Text Correction@tati_alchueyr
Efficiency
In order to check their efficiency, we created:
● A few unit tests● Before committing any change, we’d evaluate
○ How close to the teacher’s annotations we reached, using:■ Precision■ Recall■ F-score
● We print a side-to-side comparison of what the teacher annotated and what the algorithm identified
![Page 36: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/36.jpg)
Automatic English Text Correction@tati_alchueyr
Efficiency
https://en.wikipedia.org/wiki/Precision_and_recall
![Page 37: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/37.jpg)
Automatic English Text Correction@tati_alchueyr
Efficiency
F-Score
![Page 38: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/38.jpg)
Automatic English Text Correction@tati_alchueyr
Spelling
![Page 39: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/39.jpg)
Automatic English Text Correction@tati_alchueyr
Spelling: heuristics
1. Remove unicode symbols (eg. —)
2. Transform diacritics (eg. é -> e)○ This is particularly important for names
3. Remove punctuation (eg. !, ?, .)
4. Check if word:○ Is inside dictionary (case insensitive)○ Has digits○ Is inside names file (created with domain specific names; eg. Englishtown)
5. If none of that is true, then word is probably misspelled
![Page 40: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/40.jpg)
Automatic English Text Correction@tati_alchueyr
Spelling: results
Summary:
● total essays: 85,629
● mean precision: 0.7128 (std: 0.3580)
● mean recall: 0.6535 (std: 0.4212)
![Page 41: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/41.jpg)
Automatic English Text Correction@tati_alchueyr
Spelling: precision and recall per learner level
![Page 42: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/42.jpg)
Automatic English Text Correction@tati_alchueyr
Spelling: F-score per nationality
![Page 43: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/43.jpg)
Automatic English Text Correction@tati_alchueyr
Capitalization
![Page 44: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/44.jpg)
Automatic English Text Correction@tati_alchueyr
Capitalization: heuristics
1. Check if word starts a sentence○ Split on punctuation (!, ., ?, etc)
2. Check if word is a known capital word○ First person (I)○ Day of the week○ Month○ Language (eg. English, Spanish, French, etc)○ Country○ Names (selected from corpus to match context-specific names)
![Page 45: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/45.jpg)
Automatic English Text Correction@tati_alchueyr
Capitalization: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
![Page 46: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/46.jpg)
Automatic English Text Correction@tati_alchueyr
Capitalization: precision and recall per learner level
![Page 47: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/47.jpg)
Automatic English Text Correction@tati_alchueyr
Capitalization: F-score per nationality
![Page 48: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/48.jpg)
Automatic English Text Correction@tati_alchueyr
Articles
![Page 49: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/49.jpg)
Automatic English Text Correction@tati_alchueyr
Articles: heuristics
1. Check words using a before vogals
2. Check words using an before consonants
![Page 50: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/50.jpg)
Automatic English Text Correction@tati_alchueyr
Articles: results
Summary:
● total items: 47,054
● average precision: 0.9724 (std: 0.1602)
● average recall: 0.0718 (std: 0.2463)
![Page 51: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/51.jpg)
Automatic English Text Correction@tati_alchueyr
Articles: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
![Page 52: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/52.jpg)
Automatic English Text Correction@tati_alchueyr
Article: precision and recall per learner level
![Page 53: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/53.jpg)
Automatic English Text Correction@tati_alchueyr
Articles: F-score per nationality
![Page 54: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/54.jpg)
Automatic English Text Correction@tati_alchueyr
Overview
![Page 55: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/55.jpg)
Automatic English Text Correction@tati_alchueyr
● efficiency of current heuristics
Mistakes identification per learner level
![Page 56: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/56.jpg)
Automatic English Text Correction@tati_alchueyr
ideas
![Page 57: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/57.jpg)
Automatic English Text Correction@tati_alchueyr
Next steps
● Clean up code● Spelling
○ Use probabilistic models■ http://norvig.com/spell-correct.html
● Capitalization○ POS-tagging to identify names of people, organizations, places
● Articles○ POS-tagging○ Deal with plurals○ Define heuristics for dealing with definite articles (the)
![Page 58: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/58.jpg)
Automatic English Text Correction@tati_alchueyr
Next steps
● Add to user-interface of EF Class● Collect feedback from end-users (teachers)● Algorithm for proposing the correct forms● Dealing with the other kinds of mistakes● Implement a classifier using NPL (natural language processing) so we can
have input from the end-users if the suggestions are good or not - and learn with them
![Page 59: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/59.jpg)
Automatic English Text Correction@tati_alchueyr
Ideas
●
![Page 60: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/60.jpg)
Automatic English Text Correction@tati_alchueyr
PyCon SK is not over...
![Page 61: Automatic English text correction](https://reader030.vdocument.in/reader030/viewer/2022020203/588453241a28ab903b8b572d/html5/thumbnails/61.jpg)
Automatic English Text Correction@tati_alchueyr
● Sunday (13/03)● 9:00 - 12:00● Organizer:
○ Rodolfo Carvalho
Join the Coding Dojo tomorrow! (13/03)
http://codingdojo.org/cgi-bin/index.pl?WhatIsCodingDojohttps://www.youtube.com/watch?v=vqnwQ3oVM1M