overview for the second shared task on language...

24
Overview for the Second Shared Task on Language Identification in Code- Switched Data GIOVANNI MOLINA 1 SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Overview for the Second Shared Task on Language Identification in Code-Switched Data

GIOVANNI MOLINA

1SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 2: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

ContentTask Description

Data Sets

Survey of Shared Task Systems

Results

Conclusion

Future Work

2SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 3: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Task DescriptionThe task consists of labeling each token/word in the input test data with one out of 8 labels:

3SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 4: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Data SetsData collected from Twitter.

There were two language pairs:◦ SPA-EN

◦ MSA-DA

Tweet Statistics

4SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 5: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

SPA-ENTrain and Development from EMNLP 2014

Quality Improvements

Why?

56.6 million Hispanics in the US, 17% of the population.

38.4 million Spanish speakers. 58% of these fully bilingual.

5SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 6: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

SPA-EN Data Collection

6

SPA-EN

Spanish

21 Users62k Tweets

Collect

CS Users

Collect

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 7: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

SPA-EN Data Annotation

7

21 Users62k Tweets

In-Lab Annotators

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 8: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

SPA-EN Test Data Statistics

8

(43.2%) (56.8%)

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 9: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

MSA-DATrain and Development from EMNLP 2014

Quality Improvements

Why?

Closely related.

52 million native Egyptian speakers.

24 million L2 Egyptian speakers.

9SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 10: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

MSA-DA Data Collection

10

Collect

Egyptian Public Figure

26

Select 2014 Tweets Only

Remove URLs and re-tweets

AIDA2

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 11: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

MSA-DA Data Annotation

11

In-Lab Annotators

In-Lab Annotators

IAA < 90% ?10%

Overlapped Data

YES

NO

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 12: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

MSA-DA Test Data Statistics

12

(83%) (17%)

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 13: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Shared Task Submissions9 participating teams.

9 system submissions from 9 teams for SPA-EN

5 system submissions from 4 teams for MSA-DA

13SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 14: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Survey of Shared Task Systems

14SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 15: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Tweet Level Results (Top 3)Language Pair System Weighted F-1

SPA-EN

Baseline 0.607

Jaech et al. 0.898

Samih et al. 0.90

Shirvani et al. 0.913

MSA-DA

Baseline 0.44

Jaech et al. 0.73

Al-Badrishiny and Diab 0.75

Samih et al. 0.83

16SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 16: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Token Level Results (Top 3)

18

Language Pair System Average F-1

SPA-EN

Baseline 0.828

Samih et al. 0.968

IIIT Hyderabad 0.969

Shirvani et al. 0.973

MSA-DA

Baseline 0.463

Al-Badrishiny and Diab-2 0.828

Al-Badrishiny and Diab-1 0.851

Samih et al. 0.876

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 17: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

SPA-EN Average Errors

19SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 18: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

MSA-DA Average Errors

20SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 19: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

EMNLP 2014 vs EMNLP 2016

21

Token Level Results

Language Pair EMNLP 2014 F-1 Range EMNLP 2016 F-1 Range

SPA-EN 0.704 – 0.940 0.603 – 0.973

MSA-DA 0.385 – 0.799 0.594 – 0.876

Tweet Level Results

Language Pair EMNLP 2014 F-1 Range EMNLP 2016 F-1 Range

SPA-EN 0.634 – 0.822 0.77 – 0.913

MSA-DA 0.196 – 0.417 0.66 – 0.83

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 20: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Outstanding ChallengesSPA-EN:

Haciendo unos looks Muy cool para el shoot del Nuevo album de @username con #disquared#alexandermcqueen #johnvarvatos #nycshowrooms

Translated:

Making some very cool looks for the shoot of the new album by @username with #disquared#alexandermcqueen #johnvarvatos #nycshowrooms

MSA-DA:@username الزم اخذ رأيهرأى المفتى ورأيه استشارى لكنواخذباجماع هيئة المحكمةضرورى يكوناى حكم باالعدام

Translated:

@username Any death sentence must be confirmed unanimously by the members of the court and after consulting the mufti (i.e, an Islamic scholar who is an interpreter of Islamic law, entitled to issue fatwas). The mufti’s opinion should be considered as a consultation but it is important to consult him.

22SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 21: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

ConclusionWe had a very successful shared task!

Overall performance was higher than previous shared task.

Influence on system design from previous shared task.

New deep learning techniques.

There is great interest in the topic!

23SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 22: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Future WorkAnother Shared Task!

POS Tagging

Additional Language Pairs

We want to hear your suggestions at the panel discussion!

24SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 23: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Thank You for participating!

Fahad AlGhamdi

Mona Diab

Salim El Awad

Pascale Fung

Mahmoud Ghoneim

Julia Hirschberg

Nicolas Rey-Villamizar

Raymund Riegl

Thamar Solorio

Victor Soto

Simon Tice

In-lab annotators

CrowdFlower contributors

25

Special thanks:

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING

Page 24: Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

Questions?

26

Contact Information:

Giovanni [email protected]

SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING