automatic english text correction

62
Automatic English Text Correction @tati_alchueyr Automatic English Text Correction Tatiana Al-Chueyr Martins @tati_alchueyr Bratislava, 12 March 2016 PyCon SK 2016

Upload: tatiana-al-chueyr

Post on 14-Apr-2017

706 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Automatic English Text CorrectionTatiana Al-Chueyr Martins@tati_alchueyr

Bratislava, 12 March 2016

PyCon SK 2016

Page 2: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

tati.__doc__

● Brazilian● Lives in London (United Kingdom)● Pythonista and Open Source activist● Computer Engineer by Unicamp (Brazil)● Develops software programs since 2002● Works at EF (Education First)

○ Backend & DevOps leader of CTX Team

Page 3: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

help(EF)

● EF: Education First● International education company

○ Language training○ Educational travel○ Academic degree level

● Funded in 1955 in Sweden by Bertil Hult● ~ 40,000 staff● ~ 500 offices and schools in more than 50 countries (including Slovakia ;))● Privately held by the Hult family

Page 4: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

help(EF.CTX)

● Classroom Technology Experience● Teaching and learning applications (Web & Mobile)● Authoring platform

Page 5: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

CTX.__team__

● CTX Team● Team travel

● Malta, November 2015

Page 6: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

CTX.backend

● Rafael Cunha de Almeida● and I● trying to master Italian culinary

● London, February 2016

● Although I’m presenting this project alone, Rafa has contributed to it as much as I :)

Page 7: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

objective

● Present a challenge● Introduce a useful dataset● Introduce a bunch of Python scripts● Collect ideas● Build collaboratively good quality open source tools which can help dealing

with this challenge

Page 8: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

the challenge

Page 9: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The challenge

To assess (evaluate) students’ activities & exercises can be:

● Laborious● Repetitive● Slow● In other words... painful!

https://classteaching.files.wordpress.com/2013/10/marking-pile.gif

Page 10: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

Page 11: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

Page 12: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

spelling

Page 13: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

spelling

verb tense

Page 14: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

English Text Correction

hi ,

my name is crystal.im nine years old. im form china,im live in jiang xi xing yu .

there are tow people in my family: my mother, my father.

my mother is thirty-six years old, my father is thirty-seven years oldEFCamDat - C219811

capitalization

spelling

verb tense

There are “only” 89

writings left to access this

week...

Page 15: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The challenge

Implement algorithms and tools which can help (teachers) assessing English written essays

Example of application available in several applications (including LibreOffice, Google Apps, MS Word):

● Highlight (potential) mistakes while user types in a text area

Page 16: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The challenge

● Input:○ English text

● Output:○ List of items containing:

■ Position in text■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc)■ Proposal of correction

Page 17: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

the dataset

Page 18: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

EFCamDAT

● 551,036 written essays○ 2,897,788 sentences○ 32,980,407 word tokens

● by 85,864 learners● 16 levels of proficiency● 172 nationalities● annotated with corrections by English teachers

Page 19: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Examples of essay topics

● Introducing yourself by email● Writing an online profile● Describing your favourite day● Telling someone what you’re doing● Replying to a new penpal● Writing about what you do● Writing a resume

● Giving instructions to play a game● Reviewing a song for a website● Writing an apology email● Writing a movie review● Turning down an invitation● Giving advice about budgeting● Covering a news story● Researching a legendary creature

Page 20: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Examples of learners nationalities

● 36.9% Brazilians● 18.7% Chinese● 8.5% Russians● 7.9% Mexicans● 5.6% Germans● 4.3% French● ...

Page 21: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

EFCamDAT

● EF-Cambridge Open Language Database● Partnership between:

○ University of Cambridge (Department of Theoretical and Applied Linguistics)■ EF-Research Unit

○ EF Education First● Data collected from Englishtown

○ EF learning environment (online English school)

Page 22: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The datasetTypes of mistakes annotated

● X >> y: change from x to y● AG: agreement● AR: article● CO: combine sentence● C: capitalization● D: delete● EX: expression of idiom● HL: highlight● I(x): insert x● MW: missing word● NS: new sentence

● NWS: no such word● PH: phraseology● PL: plural● PO: possessive● PR: preposition● PS: part of speech● PU: punctuation● SI: singular● SP: spelling● VT: verb tense● WC: word choice● WO: word order

Page 23: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

● 10 most common mistakes

The dataset

Page 24: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

How to get it?

● https://corpus.mml.cam.ac.uk/efcamdat1/access.php

Licence:

● Use non-commercial research● Commercial use when agreed upon agreement● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf

Page 25: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Page 26: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Page 27: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Once you’ve registered

● It is possible to filter the dataset● export the dataset into a XML file

Page 28: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

The dataset

Page 29: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

a bunch of Python scripts

Page 30: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

A bunch of (Python) scriptsDisclaimer

Code developed using the Extreme Go Horse Methodology during Hackday moments

They are a POC and lack:

- Proper automated tests- Proper code design & API- Documentation

https://gist.github.com/banaslee/4147370

Page 31: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

A bunch of (Python) scripts

What do they do?

1. Fix the XML files2. Convert the XML files into good looking JSON files3. Implement heuristics to identify some common English mistakes

○ For now: spelling, capitalization and articles

4. Analysis of how efficient the algorithm was

Page 32: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

A bunch of (Python) scripts

How to download them?

● https://github.com/ef-ctx/righter

Licence

● Apache version 2.0

Page 33: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Hands on

Page 34: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Mistakes identification

We wrote functions that apply heuristics and rules to detect mistakes related to:

1. Spelling2. Capitalization3. Article

Page 35: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Efficiency

In order to check their efficiency, we created:

● A few unit tests● Before committing any change, we’d evaluate

○ How close to the teacher’s annotations we reached, using:■ Precision■ Recall■ F-score

● We print a side-to-side comparison of what the teacher annotated and what the algorithm identified

Page 36: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Efficiency

https://en.wikipedia.org/wiki/Precision_and_recall

Page 37: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Efficiency

F-Score

Page 38: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling

Page 39: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: heuristics

1. Remove unicode symbols (eg. —)

2. Transform diacritics (eg. é -> e)○ This is particularly important for names

3. Remove punctuation (eg. !, ?, .)

4. Check if word:○ Is inside dictionary (case insensitive)○ Has digits○ Is inside names file (created with domain specific names; eg. Englishtown)

5. If none of that is true, then word is probably misspelled

Page 40: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: results

Summary:

● total essays: 85,629

● mean precision: 0.7128 (std: 0.3580)

● mean recall: 0.6535 (std: 0.4212)

Page 41: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: precision and recall per learner level

Page 42: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Spelling: F-score per nationality

Page 43: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization

Page 44: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: heuristics

1. Check if word starts a sentence○ Split on punctuation (!, ., ?, etc)

2. Check if word is a known capital word○ First person (I)○ Day of the week○ Month○ Language (eg. English, Spanish, French, etc)○ Country○ Names (selected from corpus to match context-specific names)

Page 45: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: results

Summary:

● total essays: 76,980

● mean precision: 0.5714 (std: 0.4005)

● mean recall: 0.5550 (std: 0.4472)

Page 46: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: precision and recall per learner level

Page 47: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Capitalization: F-score per nationality

Page 48: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles

Page 49: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: heuristics

1. Check words using a before vogals

2. Check words using an before consonants

Page 50: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: results

Summary:

● total items: 47,054

● average precision: 0.9724 (std: 0.1602)

● average recall: 0.0718 (std: 0.2463)

Page 51: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: results

Summary:

● total essays: 76,980

● mean precision: 0.5714 (std: 0.4005)

● mean recall: 0.5550 (std: 0.4472)

Page 52: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Article: precision and recall per learner level

Page 53: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Articles: F-score per nationality

Page 54: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Overview

Page 55: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

● efficiency of current heuristics

Mistakes identification per learner level

Page 56: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

ideas

Page 57: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Next steps

● Clean up code● Spelling

○ Use probabilistic models■ http://norvig.com/spell-correct.html

● Capitalization○ POS-tagging to identify names of people, organizations, places

● Articles○ POS-tagging○ Deal with plurals○ Define heuristics for dealing with definite articles (the)

Page 58: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Next steps

● Add to user-interface of EF Class● Collect feedback from end-users (teachers)● Algorithm for proposing the correct forms● Dealing with the other kinds of mistakes● Implement a classifier using NPL (natural language processing) so we can

have input from the end-users if the suggestions are good or not - and learn with them

Page 59: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Ideas

Page 60: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

PyCon SK is not over...

Page 61: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

● Sunday (13/03)● 9:00 - 12:00● Organizer:

○ Rodolfo Carvalho

Join the Coding Dojo tomorrow! (13/03)

http://codingdojo.org/cgi-bin/index.pl?WhatIsCodingDojohttps://www.youtube.com/watch?v=vqnwQ3oVM1M

Page 62: Automatic English text correction

Automatic English Text Correction@tati_alchueyr

Questions?Thanks :)

@[email protected]