acorns acquisition of communication and recognition skills the caregiver corpus toomas altosaar, l....

13
ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H. van den Heuvel

Upload: simon-kennedy

Post on 31-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

The CareGiver corpus

Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H. van den Heuvel

Page 2: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 2

Overview Background of the ACORNS project A speech corpus

Rationale Design

A few details Public availability

Page 3: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 3

Background of the ACORNS project Acquisition of COmmunication and RecogNition Skills

FP6 FET Project 2006-2009 www.acorns-project.org

Aim: to investigate language acquisition by young infants By simulating this learning process by designing and

testing a computational model Focus on word discovery Improve ASR

To that end, a speech corpus was created

Page 4: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 4

The ACORNS corpus - rationale ACORNS model takes part in a caregiver-learner

interaction loop Corpus is required for testing various computational

approaches for language learning Utterances in corpus ‘simulate’ the caregiver

Corpus keeps the balance in complexity between Real-life recordings of caretaker utterances in real-life

noisy child-caretaker interactions (CHILDES) Lab-fabricated speech-like stimuli (NEWPORT)

Page 5: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 5

ACORNS-corpus – design (1) Four languages (FIN, SWE, UK, NL) In total 10 speakers for FIN, UK, NL

4 speakers for SWE Speech from primary and secondary caregivers Speakers read aloud sentences

Simple grammatical structure Limited number of keywords

Two speaking styles Infant directed style (IDS)– adult directed style (ADS)

Page 6: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 6

Design (2) Utterances across languages are highly comparable with

respect to utterance length, syntactic structure, choice of keywords Allows a cross-linguistic comparison of computational approaches of

word discovery

Keyword selection was inspired by information about communicative development inventories (CDI) E.g. the MacArthur Bates CDI http://www.sci.sdsu.edu/cdi/

Page 7: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 7

Examples of Y1-utterances (UK) Where is Miriam now ? Do you see the shoe ? Show me the book ! That is the bottle The telephone is here Look, Daddy Here is the diaper That is a telephone Show me a shoe

Page 8: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 8

Examples of Y2-utterances (UK) I see a green turtle Can you hear the red square and the airplane?

50 keywords Up to 4 keywords per sentence Semantically free

But inconsistencies were avoided:* Look at the big small car, * red green ball

Page 9: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 9

Number of utterances

‘Y1’ 1 keyword/utt28000 cross-linguistically comparable utts

‘Y2’ multiple keywords/utt34800 cross-linguistically comparable utts

SWE 8000 --

FIN 8000 11600 (+1588)

UK 4000 (IDS only) 11600 (+1588)

NL 8000 11600

Page 10: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 10

Format Each utterance is available as single wav file

44.1 kHz, mono … and is accompanied by an xml file, with

Speaker information (gender) Speech style (IDS, ADS) Orthographic annotation (checked) Keyword (s) Duration And for FIN some more information about syntax

(see paper)

Total 12 GB

L. ten Bosch2, G. Aimetti3, C. Koniaris4, K. Demuynck5, H. van den Heuvel2 L. ten Bosch2, G. Aimetti3, C. Koniaris4, K. Demuynck5, H. van den Heuvel2 L. ten Bosch2, G. Aimetti3, C. Koniaris4, K. Demuynck5, H. van den Heuvel2

Page 11: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 11

Research purposes Simulation of word detection/word spotting Acquisition of word-like units Acquisition of (simple) syntax Across morphologically + syntactically different

European languages

Page 12: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 12

Public availability Corpus made available via ELRA Interested parties must contact ELRA

Page 13: ACORNS Acquisition of COmmunication and RecogNition Skills The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H

ACORNS Acquisition of COmmunication and RecogNition Skills

LREC 2010 19 May, 2010 Slide no. 13

Conclusion Corpus available with cross-language compatible utterances Speech based IDS & ADS modes Utterances have lexical and syntactic structure inspired by

infant-directed speech Primary & secondary caregivers Ideal for testing models of language acquisition and word

detection Made available through ELRA More information at www.acorns-project.org

Also software available – see website