malcah.faculty.arizona.edu€¦  · web viewdiscussions in marianne's immigrant russian...

63
CABank Database Guide This guide provides documentation regarding the CABank corpora in the TalkBank database. TalkBank is an international system for the exchange of data on spoken language interactions. The majority of the corpora in TalkBank have either audio or video media linked to transcripts. All transcripts are formatted in the CHAT system and can be automatically converted to XML using the CHAT2XML convertor. To jump to the relevant section, click on the page number to the right of the corpus. CallFriend........................................................................................................... 2 CallHome............................................................................................................ 8 CMU................................................................................................................... 27 DISCLAB............................................................................................................. 28 GulfWar............................................................................................................. 29 Jefferson............................................................................................................ 30 WaterGate....................................................30 NB...........................................................31 JOC..................................................................................................................... 33 MOVIN............................................................................................................... 34 Sakura............................................................................................................... 35 SamtaleBank.................................................................................................... 38 SBCSAE.............................................................................................................. 41 SCoSE.................................................... 44

Upload: others

Post on 08-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

CABank Database Guide

This guide provides documentation regarding the CABank corpora in the TalkBank database. TalkBank is an international system for the exchange of data on spoken language interactions. The majority of the corpora in TalkBank have either audio or video media linked to transcripts. All transcripts are formatted in the CHAT system and can be automatically converted to XML using the CHAT2XML convertor.

To jump to the relevant section, click on the page number to the right of the corpus.

CallFriend...........................................................................................................................2

CallHome............................................................................................................................8

CMU..................................................................................................................................27

DISCLAB..........................................................................................................................28

GulfWar.............................................................................................................................29

Jefferson............................................................................................................................30WaterGate.................................................................................................................................30

NB..............................................................................................................................................31

JOC....................................................................................................................................33

MOVIN..............................................................................................................................34

Sakura...............................................................................................................................35

SamtaleBank.....................................................................................................................38

SBCSAE............................................................................................................................41

SCoSE...............................................................................................................................44

Page 2: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

CallFriend

Malcah Yaeger-DrorCognitive ScienceUniversity of [email protected]

These corpora were contributed to TalkBank by the Linguistic Data Consortium. Thanks to Mark Liberman, Steven Bird, and Chris Cieri for sharing these audio data. Transcriptions in CA-CHAT were produced by Malcah Yaeger-Dror, working with four students: Alan Beaudrie, Sarah Beuadrie, Tania Granadillo,

File Sex age ed state calling to callingen_4504 M 14 10 NY Peekskill 914292upben_4708 M 28 22 Toronto Toronto 416635gbe en_4745 M 37 16 FL Key West 305966cmp en_4823 M 39 18 FL Key West 305583apj en_4874 M 49 20 IL Chicago 312539eka en_4919 F 43 15 NY Peekskill 914356xmt en_5051 M 59 15 USA Aspen, CO 970498yoo en_5615 F 31 17 NC Charolotte 704948xpc en_5984 M 18 13 PA Erie,PA 814862xcc en_6015 F 19 13 MI Detroit 313764qfu en_6058 M 21 15 CO CO Springs 719564mns en_6062 F 22 14 NJ Atlantic City 609883ogd en_6084 M 30 17 NYC NYC 212420mnhen_6092 M 18 13 NY Ithaca 607436qie en_6093 MxF 18 12 MI Detroit 313764eji en_6094 M 19 14 MS Kansas City 816543udj en_6102 F 43 19 FL Orlando 407451ucv en_6110 F 43 19 FL Orlando 407451ucv en_6126 MxF 34 17 AZ Phoenix 602395rbo en_6157 MxF 24 16 PA Harrisburgh 717228nfw en_6172 F 19 13 MA Boston 617352olf en_6193 M 18 12 IL Urbana, IL 217355mmf en_6200 MxF 43 16 NY UpstateNY 716625hgu en_6202 MxF 18 12 CA Sta Barbara 805872sld en_6205 F 34 16 GA Atlanta 404378shv en_6255 FxM 27 16 MN Rochester,mn 507534xdy en_6372 MxF 37 17 Toronto Toronto 416638hjf en_6379 FxM 25 20 VA Beltway, va 703790mhk en_6384 MxF 22 16 PA Allentown 610328yhr en_6401 MxF 23 21 MA Worchester, ma 508475nlj en_6402 FxM 17 12 NJ Elizabeth, nj 908238tcj en_6428 FxM 18 14 NY Peekskill? 914737tei

Page 3: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

en_6451 MxF 51 15 MD beltway, MD 301253oas en_6476 M 21 15 MI Detroit 313662sgw en_6503 FxM 18 12 CA San José 408471ulc en_6507 MxF 21 16 CA Freemont,CA 510623gon en_6508 F 27 21 NC? Durham, NC 919387tos en_6511 mix 45 12 Toronto Toronto 416789ueaen_6557 M 53 20 Toronto Toronto 416651dob en_6649 MxF 34 16 PA Pittsburgh 412661xkc en_6865 MxF 24 0 LA New Orleans 504861spm

File sex age ed #1Dialect Commentsja_0617 F ? ? kansai baby cryja_0921 FxM ? ? Tokyo?/st F sometimes uses English at the beginningja 1367 F ? 16 USAst F2 has children.; married to an American.ja 1605 F ? ? standard F2 has kansai accentja 1612 FxM ? ? standard F2 has accentja 1684 F 22 16 standard New York; university. They like dancing.ja 1722 F 19 16 Yamanashi F2 is 22 years old.Ja1758 F ? 16 standard F1; TX.F2 lived in Canada 3yrs, now US;

26yrs oldja 1733 FxM 19 14 standard F has kansai accentja 1841 FxM ? 16 standard in US; M grad of 'Phila. University.ja 2167 F ? 16 USA Living in US for business; F2 has two

children.ja_4044 FxM 20 15 Tokyo F lived in New York and Seattle before.ja_4164 M 30 16 Saitama M1 leaves for Japan soon. M2 is married.ja_4222 M 28 16 USA/Tohoku M1 is working.ja_4261 M 23 16 Tokyo/Kansai Both of them work.ja_4549 M 20 12 SuwaCity M2 studying for finals.ja_4573 M 31 18 Hiroshima M2 is M1's cousin. M1in Boston;M2 in San

Diego.ja_4608 M 25 19 Tokyo M2 is a graduate student in USA.ja_4725 M 23 15 Tokyo spraying cocroches in background.ja_4905 MxF 21 14 Numazuja_6149 FxM 23 18 Tokyo F1 is a student in UAA.ja_6166 M 21 14 Yamanashi They seem to live in Okurahama in USA.ja_6167 M 22 14 Tokyoja_6186 MxF 21 14 Tokyo F is washing dishes so there is water sound.ja_6221 M 30 19 Kyotoja_6228 M 29 16 Oitaja_6264 FxM 26 16 Kyotoja_6277 M 18 11 Enaja_6281 Mja_6354 MxF 26 23 Tokyoja_6414 F 32 16 Osaka

Page 4: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

ja_6416 MxF 27 16 Kyotoja_6422 F 54 `6 USA/Miyazaki F1 lives in Idaho. F2 lives in Illinois. ja_6434 M 36 14 Yokohamaja_6463 MxF 17 10 Shizuokaja_6465 M 34 18 IbarakiPrefja_6484 MxF 17 10 Shizuokaja_6490 MxF 30 16 Osakaja_6525 M 28 18 Tokyoja_6587 FxM 44 16 Yamagataja_6616 FxM 23 15 Tokyoja_6630 FxM 33 17 Osakaja_6632 M 20 15 Tokyoja_6666 F 26 16 USA/Osaka They are friends at diff US universities.ja_6688 F 38 18 Sapporo fr. from work. F2 has strong dialect.ja_6698 F 27 14 USA/Miyazaki F2 has heavy dialect. Both work in California.ja_6700 F 42 16 Sapporo F2 has children.ja_6707 F 57 12 CA/Hokkeido Both (F1& F2) have Hokkaido accents. ja_6716 FxM 22 15 Tokyoja_6717 F 35/34 18 NY/CA/Gifu Met in SF. F1 in New York. F2 is a teacher.ja_6738 F 34 17 Nagasakija_6739 F 53 16 Chiba F1 h/w and F2 works at a lingual center.ja_6742 F 31 16 Ichinomiyaja_6759 M 53 12 Tokyo

File Sex age ed countrysp_4019 F 24 18 Perusp_4053 F 29 16 Colombiasp_4057 F 32 18 Venezuelasp_4089 F 23 19 Spainsp_4095sp_4096 F 24 17 Venezuelasp_4100 M x F 23 16 Colombiasp_4106 Msp_4116 F 23 16 Dominican_sp_4148 M 23 Ecuadorsp_4352 M 56 18 Perusp_4358 F X N 24 17 Argentinasp_4400 F 26 12 Colombiasp_4414 M 28 22 Colombiasp_4422 M 41 12 Ecuadorsp_4427 F 24 17 Venezuelasp_4435 F 51 16 Colombiasp_4450 M x F 27 12 Colombiasp_4462 F 17 12 Venezuela

Page 5: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

sp_4463 M 24 20 Colombiasp_4466 M x F 25 17 Perusp_4468 M 42 12 Nicaraguasp_4492 M x F 24 19 Puerto_Ricosp_4500 M 33 19 Colombiasp_4524 F 26 17 Colombiasp_5034 F 26 10 Colombiasp_5052 M x F 23 17 Colombiasp_5055 F x M 21 16 Perusp_5070 F 25 13 Dominican_sp_5084 F 38 12 El_Salvadorsp_5112 M x F 34 20 Nicaraguasp_5175 F 27 16 Colombiasp_5258 F x M 24 11 Colombiasp_5316 F x M 27 22 Colombiasp_5340 F 20 12 Colombiasp_5354 F 34 14 El_Salvadorsp_5361 F 23 16 Venezuelasp_5367 M 34 6 Perusp_5418 F 22 16 Canadasp_5502 F 18 11 Hondurassp_5558 F 57 12 Colombiasp_5582 M 23 9 Mexicosp_5589 M 45 18 Perusp_5607 M 30 20 Colombiasp_5638 M 19 14 Chilesp_5641 M x F 27 16 Perusp_5650 F 32 18 Mexicosp_5685 F x M 26 15 Colombiasp_5704 M x F 24 17 Canada

Glossary of Spanish terms

ñángara: leftist or someone who does not care for his appearanceberraquera: colombian slang for 'good'.bicho: stuff, like vainabolazo: an informal word that means boredom.boludeces: A swearword that means unimportant things.bravo: to be angry.burda: venezuelan slang for 'a lot'.cabarullí:cachar: from the English word catch. lo cacharon= they caught himcagando: person is very very coldcarota blanca: white beancatire: blondecatzocauchos: tires.

Page 6: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

cazaba: understood.chabado: without luckchabienda: a group of friendschama: girl.chamo:guy.che: Interjection used in informal conversations to call the other speaker's attention.chevere: neat, cool, wonderfulchiches: bells and whistles.chimbo: venezuelan slang for 'bad'.chismear = to gossip, usually among girls/womenchismes: gossip.choto: "un choto" is a slang expression meaning "nothing".chusma (Mex?) = derogative way to refer to lower class people .coño= shit, cuntcochambroso = to think bad of someone without knowingcomodin: from comodo, to be comfortable.culecos: nervous indecisivenessculo: venezuelan male slang for girl.cutredar paja: venezuelan slang for 'feel bad'.de bolas: interjection 'of course'.de pinga: venezuelan slang for 'good'.de vaina: almost didn't make it, by very little.duracos:embustes: liesempatados: to be dating.güiro: venezuelan slang for 'head'.guacho: lucky personhinchar las pelotas: bother somebody.jangiador: leader of a groupjeva: venezuelan slang for girl.kilombo: slang word for "mess".koala: fanny pack.mamadera:a lot of drinking.mamar gallo: to joke.mameico: very easymanita, mana (Mex) = shortened from the word "hermana" marangos: stupid or monkiesmate: traditional drink in Argentina made with "yerba"me cago en la hostia= I don't really careme cago en la visagra= I don't really caremijita: pal mi hijitamojón: liemorochas: twins.murieseando: adaptation of "muriéndose"pacotilla: worthless.

Page 7: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

pajecito: the little boy that carries a symbolic dowry or the wedding ringspana: venezuelan slang, term of address for a good friend.pancho: comfortable, lazy.para la gravedad de la vaina= something very graveparar bolas: to pay attention.parla: the gift of good talking.pegastes el chicle = captivate the interest of someone of the opposite sex. pelada: girl or teenagerpuntaje= The correct word is 'puntuación.' Point average.que show? = what's new? what is going on?que vaina= (depending on intonation) sorry,rasca: to be drunk.ser del otro lado = to be homosexual.tocazo: a lot.tomar el pelo: to joke.tuquís: interjection used jokinglyvacilar: to pull someones leg, to joke.vaina: in venezuela it can mean "stuff".verga: interjection, meaning depends on intonation.verguero: venezuelan slang for 'lots of stuff'.vido The correct conjugation is 'visto." Past tense of to see.viste: a very common informal expression equivalent to "you know".zampó (zampear)= to eat

Page 8: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

CallHome

1. Summary abstract

The CallHome English corpus of telephone speech was collected and transcribed by the Linguistic Data Consortium primarily in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense.

This release of the CallHome English corpus consists of 120 unscripted telephone conversations between native speakers of English. The CD-ROM distribution contains the speech data only, along with essential documentation files and software for handling the compressed speech data. The transcripts and other text data and documentation are distributed separately (typically via electronic transmission from the LDC's ftp/web server), and will be subject to periodic updates. The transcripts cover a contiguous 5 or 10 minute segment (see section 2 below) taken from a recorded conversation lasting up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or closefriends overseas. All calls originated in North America; 90 of the 120 calls were placed to various locations overseas, while the remaining 30 were placed within North America. The distribution of call destinations can be found in the file "spkrinfo.tbl". The transcripts are timestamped by speaker turn for alignment with the speech signal, and are provided in standard orthography.

-----------------------------------------------------------------------2. Data acquisition

Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that theirtelephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call.

Although the goal of the call collection effort was to have unique speakers in all calls, a handful of repeat speakers are included in the corpus. Specific information on this can be found in the file "spkrinfo.doc".

Page 9: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

In all, 200 calls were transcribed. Of these, 80 have been designated as training calls, 20 as development test calls, and 100 as evaluation test calls. For each of the training and development test calls, a contiguous 10-minute region was selected for transcription; for the evaluation test calls, a 5-minute region was transcribed. For the present publication, only 20 of the evaluation test calls are being released; the remaining 80 test calls are being held in reserve for future LVCSR benchmark tests.

-----------------------------------------------------------------------3. Data verification

After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to select and describe the region to be transcribed. The description of the transcribed region provides information about channel quality, number of speakers, their gender, and other attributes. The information fromthis audit may be found in the file "callinfo.tbl", and its contents are described in greater detail in "callinfo.doc". -----------------------------------------------------------------------4. Speaker demographics

Information on speaker demographics can be found in the file spkrinfo.tbl, whose contents are described in the file spkrinfo.doc.

-----------------------------------------------------------------------5. Data transcription - General

All CallHome telephone conversations were transcribed using the general conventions described below. The finite set of "non-lexemes" (hesitation sounds) used in the transcripts are provided in section 6 below.

The transcription was carried out on Sun 4 workstations. The transcription was done using the emacs text editor which was linked to the visual and auditory soundwave from the telephone recording in an xwaves window. A program written at the LDC linked the xwaves signal to the emacs buffer so that a highlighted region of the soundwave could be brought into the emacs buffer as a timestamp via a simple keystroke. Similarly, the transcribers could listen to any timemarked turn in the transcript, and view the aligned soundwave as well. Thus, the transcribers had a visual as well as auditory signal that they were transcribing. Both the visual and auditory signal were broken into two separate channels that could be reviewed separately or together.

The transcribers were given the transcription conventions provided below as a guideline how to transcribe the telephone conversations.

CALLHOME TRANSCRIPTION CONVENTIONS - General

Page 10: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

What to transcribe: 10 contiguous minutes (600 seconds) from the recorded telephone conversations. This should not include the beginning of the conversation where the speakers are getting permission for being recorded. Definition of turns: Separate turns are defined by the following criteria:

(1) speaker change, e.g.

A: Well I was thinking about that

B: I know I talked to &Jan about it yesterday

(2) within one speaker's stretch of talk, a long turn should be broken up in terms of what makes grammatical/semantic sense, e.g.

A: And I told her %um I didn't I wasn't setting you up to be a spiritual director or anything {laugh} but I did say to her that if she were to talk if she felt that she wanted to talk about her prayer experience in Spanish

A: that you would probably be able to certainly to understand her but to empathize a little bit with what she was experiencing

(3) If there is an extra-long pause within a single speaker's turn, break the turn up into two turns, e.g.

B: When we were fishing out on &Lake &Travis last August I thought I saw, %uh [[long pause]]

B: %uh, &George &Martin, but I wasn't sure it was him.

Timestamps: Each speaker turn is marked with a unique timestamp (in seconds). The timestamps mark the beginning and end time of each turn relative to the beginning of the recording. Each timestamp is precise to the 100th of a second, and is in the format: beginning time [space] ending time, followed by the turn. Some samples:

27.98 28.72 A: You know so 137.49 139.47 A: yeah {breath} (( )) [distortion]

Page 11: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

284.54 286.79 B: %ah &Lydia &Van &Damme.

Special Conventions:

Acronyms Acronyms pronounced like a word are written in all caps with no spaces, e.g. AIDS NARAL

Acronyms pronounced like the individual letters are written in all caps with spaces between the letters: C I A H I V C E O

Numbers Write all numbers out, do not use digits: twenty-two nineteen-ninety-five

Interjections Use the most standard spelling (as given on the lexicon list, if it's there); don't try to represent lengthening by writing multiple consonants (like 'ooooh').

uh-huh mhm uh-oh okay jeez

Punctuation Transcribers are free to add any punctuation that they feel is helpful to someone reading the transcript.

Special symbols:

Noises, conversational phenomena, foreign words, etc. are marked with special symbols. In the table below, "text" represents any word or descriptive phrase.

{text} sound made by the talker

{laugh} {cough} {sneeze} {breath}

[text] sound not made by the talker (background or channel)

[distortion] [background noise] [buzz]

[/text] end of continuous or intermittent sound not made by the talker (beginning marked with previous [text])

[[text]] comment; most often used to describe unusual characteristics of immediately preceding or following speech (as opposed to separate noise event)

[[previous word lengthened]] [[speaker is singing]]

((text)) unintelligible; text is best guess at transcription

Page 12: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

((coffee klatch))

(( )) unintelligible; can't even guess text

(( ))

<language text> speech in another language <English going to California>

<? (( ))> ? indicates unrecognized language; (( )) indicates untranscribable speech

<? ayo canoli> <? (( ))>

-text partial word text-

-tion absolu-

#text# simultaneous speech on the same channel (simultaneous speech on different channels is not explicitly marked, but is identifiable as such by reference to time marks)

//text// aside (talker addressing someone in background)

//quit it, I'm talking to your sister!//

+text+ mispronounced word (spell it in usual orthography)

+probably+

**text** idiosyncratic word, not in common use

**poodle-ish**

%text This symbol flags non-lexemes, which aregeneral hesitation sounds. See the section onnon-lexemes below to see a complete list foreach language.

%mm %uh

&text used to mark proper names and place names

&Mary &Jones &Arizona &Harper's &Fiat &Joe's &Grill

Page 13: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

text -- marks end of interrupted turn and continuation -- text of same turn after interruption, e.g.

A: I saw &Joe yesterday coming out of --

B: You saw &Joe?!

A: -- the music store on &Seventeenth and &Chestnut.

-----------------------------------------------------------------------6. Data transcription - Non-lexemes

For LVCSR purposes, some of the speech sounds uttered by theconversational participants were deemed to be "non-lexemes" orperiodic sound sequences that are not listed as words in thepronunciation dictionary. The "non-lexemes" are distinct from the setof interjections such as "okay" and "jeez" which are considered as wordsin the lexicon. The "non-lexemes" can loosely be considered ashesitation sounds that a speaker makes while speaking. While thespelling of these sounds is somewhat arbitrary, the transcribers weregiven a finite list from which to choose in order to maintainorthographic consistency.

Below is the histogram of the token and frequency of non-lexemesoccurring in the 80 training and 20 devtest transcripts.

1530 %uh1470 %um310 %eh309 %mm209 %hm194 %ah166 %huh15 %ha3 %er2 %oof2 %hee2 %ach1 %eee1 %ew

-----------------------------------------------------------------------7. Quality control (QC) procedures

Page 14: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

The creation of the transcripts was made in an iterativemanner. The first step was to transcribe and timestamp theappropriate portion of each conversation. Once this was completed,proper formatting and spelling was checked and corrected. Once thiswas completed, a second pass over all of the transcripts was made,where both content and formatting was checked once more. Throughoutthis process, small improvements were constantly made and re-checkedfor accuracy. In most instances, a third (or even fourth) pass wasmade over the transcript to verify its accuracy.

Spelling:

As the telephone conversations were being transcribed, thewords found in the transcripts were being compiled for inclusion inpronunciation dictionaries also being prepared by the LDC. As thelexicon workers compiled lists of words, they checked (among otherthings) for spelling errors. The lists of spelling/typo errors foundin the transcripts were compiled, and a program was run over thetranscripts to replace a misspelled word with its correct spelling.Thus, work on the pronunciation dictionaries of the respectivelanguages helped to double-check the proper spelling of all words inthe transcripts.

Syntax:

To check the well-formedness of the bracketing, a program waswritten which goes over the transcripts and notes any apparentirregularities. This program was later adapted for on-line use by thetranscribers to be used while creating the transcripts. A finalsyntax check was run over all transcripts before the final release.

Timestamps:

To check the well-formedness of timestamps, a program wasdeveloped that checked for (1) overlapping timestamps, (2) start timesthat are greater than end times, (3) turns that are missingtimestamps, (4) the proper formatting of a blank line before eachtimestamp, (5) proper number of digits in each timestamp, and (6) theproper marking of the speaker id. This procedure was folded into thesyntax checking procedure to be used on-line by the transcribers.

Content:

To check that the properly spelled and formatted transcriptionactually matched the spoken signal, a second human pass was made overall of the transcripts. In many instances, three or more passes were

Page 15: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

made as well.

English Sex Age Age Place0638 F 20 15 PA6067 F 18 12 NJ4838 F 18 13 NY6079 F 19 13 NY6100 F 19 15 LA4092 F 22 16 PA5788 F 22 16 TX6479 F 22 17 OH5352 F 22 17 PA6107 F 23 16 NY5273 F 23 16 OH4432 F 23 17 NY4886 F 24 13 PA4624 F 24 16 MI5931 F 24 18 PA4490 F 24 18 SC4844 F 25 13 OH5777 F 25 16 MI4887 F 25 18 DC4365 F 25 21 WI5573 F 26 18 MA4913 F 26 18 NY4660 F 27 16 MO4157 F 28 16 MO4926 F 30 12 NY4077 F 30 16 WA4861 F 30 18 IN4628 F 30 19 WY4576 F 31 18 IL6467 F 31 18 OH6625 F 32 18 FL4145 F 32 21 CA4245 F 32 21 NE4248 F 32 21 PA6348 F 33 12 IL5388 F 33 13 NE4595 F 33 18 NY4610 F 33 21 NY4315 F 34 16 MA4564 F 34 16 VA4927 F 34 18 NY6071 F 35 18 FL5254 F 35 18 NY6047 F 36 16 WI4571 F 36 18 NE4431 F 36 20 IL4065 F 36 23 MD4459 F 37 16 CA

Page 16: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

4325 F 38 13 VT4580 F 38 18 NJ5907 F 38 20 ID5046 F 39 22 CA4104 F 40 16 CA5700 F 40 16 IA4234 F 42 17 NY5866 F 43 12 AR5888 F 44 16 ID4544 F 45 18 MA4822 F 46 18 CA5278 F 46 18 NY4310 F 47 18 NH4665 F 47 18 PA4666 F 48 16 PA4335 F 48 18 IA5551 F 49 16 WI4623 F 52 13 NY6161 F 53 18 TX6456 F 54 20 OH6033 F 54 20 WI4941 F 56 16 NY4112 F 56 19 CA5736 F 57 16 CA6274 F 57 16 WI4705 F 57 22 OR5495 F 61 20 WA5242 F 63 12 IL6314 F 63 14 IL6252 F 65 17 WI5712 F 65 20 MI5648 F 66 18 CT6045 F 66 18 WI4556 F 67 14 NJ5532 F 67 16 IL6447 F 71 20 WI5208 F 74 16 WI6408 F 77 12 MN4673 F 80 17 MI4569 F 80 17 MI4677 M 13 7 WV4521 M 19 13 NY6825 M 19 14 NY6265 M 20 15 NY6521 M 21 15 OH4967 M 26 7 UT4093 M 27 16 WA4801 M 27 17 WA4721 M 27 18 FL6861 M 27 19 UT5166 M 28 18 MD5872 M 29 15 CA

Page 17: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

4485 M 29 21 MA4686 M 30 18 FL4074 M 30 18 TX4371 M 30 20 NE5373 M 31 25 Canada6313 M 32 18 NY4415 M 33 17 NC4612 M 37 16 VA4829 M 38 18 CT6179 M 43 16 NY4247 M 43 17 varied5713 M 43 18 CT6785 M 46 16 AR4792 M 48 17 NY4807 M 54 17 NJ4184 M 54 20 NY4702 M 55 13 KA6298 M 74 12 WI4808 M 76 14 MA4629 M 8 3 IL

German Sex Age Age Place4002 F - - -4024 M 31 24 Pfullingen4028 M 25 18 Berlin4073 M 26 18 Kassel4076 F 37 17 Krefeld4111 M 40 23 Bensberg4123 M 27 19 Hagen4287 M 32 24 Buende4308 F 37 12 Hadmersleben4384 F 32 14 Mainz4458 M 30 20 Bad V slac4552 M 31 21 Freiburg4553 M 27 20 Gengenbach4630 F 49 16 Voerde4684 M 24 18 Bads-Alzunge4711 F 51 22 Nuremburg4755 F 57 12 Bremen4764 F 32 21 Hamburg4765 F 46 16 Bielefeld4777 F 56 12 Berlin4828 M 25 16 Cologne4857 M 34 20 Augsburg4866 M 25 17 Beckum4868 M 54 20 Stutgart4896 F 23 13 Leverkusen4921 M 26 15 Hildesheim4940 F 27 16 Hamburg4951 M 23 15 Stuttgart4957 M 23 16 UNK4965 F 29 13 Bad-Neuheim

Page 18: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

5016 M 34 21 Guendburg5088 F 26 14 Ommernheim5097 F 16 10 Munich5123 M 19 13 Berlin5143 F 29 16 Frankfurt5159 F 76 16 Bernburg5161 F 27 16 Goetting5168 M 24 16 Kahl5206 F 74 13 Breslau5207 F 52 12 Neunkirchen5223 F 68 12 Stuttgart5224 F 62 14 Berlin5248 F 60 16 Berlin5298 F 40 16 Frankfurt5351 F 60 18 Berlin5421 F 69 16 Munich5452 F 67 12 Leipzig5493 F 47 13 Heidelberg5518 M 66 12 Germany5519 M 24 15 Friederchshsen5566 F 54 17 Frankfurt5569 F 27 17 Salzburg5577 F 26 18 Munich5596 F 47 16 Zweibruecken5626 F 65 12 Hanover5661 M 25 19 Berlin5681 M 59 12 Berlin5699 F 68 16 Berlin5776 F 69 20 Cologne5778 F 21 12 South_A VA5832 F 68 12 Liga5900 F 23 16 Osnabrueck5909 F 69 16 Stuttgart5944 F 40 20 Switzerland5945 F 28 20 Frankenberg6069 F 29 21 Cleveland6140 F 31 20 Insbrook6144 M 60 16 Berlin6162 M 24 18 Waldshut6197 M 31 22 Mainz6199 F 31 20 Frankfurt6219 F 19 13 Wiesental6247 M 54 14 Buchel6248 F 41 18 Giessen6250 M 60 14 Chemnitz6251 F 60 12 Palastinate6297 M 26 19 Bibirbach6311 F 69 18 Saaz6312 M 25 17 Berlin6333 F 54 12 Hamburg6349 M 54 16 Braunschweig6350 M 37 10 Allersberg

Page 19: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

6352 F 24 16 Coesfeld6373 M 17 12 Munich6386 M 25 17 Tuebingen6388 M 24 18 Berlin6446 F 38 16 Gottingen6477 F 56 16 Berlin6506 M 75 15 Karlsruhe6517 M 28 18 Oldenburg6518 M 26 12 Dessau6545 F 41 18 Hamm6623 M 57 16 Stuttgart6639 M 26 17 Cologne6659 M 49 17 Heidenheim6691 M 31 23 Munich6692 F 65 14 Stuttgart6719 F 25 17 Berlin6838 M 26 18 Hannover6888 M 23 15 Stuttgart

Japanese Sex AGE Age Placeja_0856ja_0924 38 16ja_0930ja_1012 31 16ja_1032ja_1041ja_1048 41ja_1057 41 18ja_1099ja_1109ja_1123ja_1201ja_1237 37 21ja_1263 37 21ja_1277 33ja_1288 43 20ja_1290 16ja_1328 29 14ja_1369 25 12ja_1370 26 12ja_1418ja_1425 16ja_1428 30 16ja_1461 34 14ja_1509 30 16ja_1538 33 14ja_1541ja_1542 35 18ja_1557 16 10ja_1593 26 14ja_1604 28 19ja_1607

Page 20: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

ja_1608 45 12ja_1615 13ja_1628 40 14ja_1642 12ja_1667 44 16ja_1710 31 16ja_1713 24 17ja_1725 12ja_1731 63 12ja_1738 30 16ja_1741 27 15ja_1749 18ja_1889 16ja_1899 22 12ja_1925 19 14ja_1928 13ja_1999 31 16ja_2004 58 12ja_2041ja_2085 40 17ja_2096 36 17ja_2111 19 12ja_2134 18 13ja_2157 13ja_2180 36 16ja_2188 18ja_2199ja_2204 25 12ja_2206 22 15ja_2207 31 12ja_2208 33 16ja_2209 82 8ja_2210 28 14ja_2212 61 10ja_2215 45 12ja_2217 65 15ja_2218 47 16ja_2219 28 18ja_2220 43 22ja_2222ja_2224 29 20ja_2225 54 12ja_2231 31 20ja_2234 26 16ja_2235 50 12ja_2237 29 16ja_2239 29 14ja_2243 18 12ja_0743ja_0922ja_0988ja_1003

Page 21: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

ja_1069ja_1622ja_1629 54 14ja_1670 28 21ja_1688 21 12ja_1690 19 11ja_1967 16ja_2035 30 14ja_2214ja_2238 29 23ja_3002ja_3004ja_3005ja_3008ja_4061ja_4275ja_0696ja_0862ja_0986 32 16ja_1005ja_1072 34 16ja_1586 35 18ja_1674 54 16ja_1832 19 13ja_1867 30 16ja_1966 27 16ja_2053 14ja_2074 46 16ja_2196 28 19ja_2216 25 18ja_2223 36 16ja_2236 13ja_2242 48 14ja_3001ja_3006ja_3007

Mandarin Sex Age Agema_0003 F 40 13ma_0010 27 15ma_0022 F 1ma_0027 M 14ma_0028 0 19ma_0029 M 20ma_0030 M 29 16ma_0035 0 15ma_0104 F 26 16ma_0106 M 24 17ma_0110 F 29 16ma_0111 F 32 20ma_0117 30 16ma_0131 0 18

Page 22: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

ma_0626 Mma_0637 0 15ma_0651 31 10ma_0653 21 15ma_0667 Mma_0669 M 27 20ma_0671 F 40 20ma_0674 M 31 10ma_0679 M 32 18ma_0682 M 23 18ma_0691 M 31 11ma_0695 M 28 10ma_0698 M 25 10ma_0703 F 30 14ma_0704 Mma_0711 F 27 10ma_0716 M 31 12ma_0717 F 15ma_0718 M 27 20ma_0719 M 18ma_0721 F 24 15ma_0727 F 25 15ma_0735 F 26 20ma_0738 F 47 17ma_0742 M 25 13ma_0748 M 31 15ma_0750 F 23 16ma_0751 F 20 14ma_0752 M 24 16ma_0754 M 27 18ma_0755 M 42ma_0756 M 30 17ma_0758 M 92ma_0760 M 27 17ma_0761 F 29 20ma_0763 M 25 18ma_0764 M 20ma_0766 M 20ma_0768 M 26 19ma_0769 F 36 15ma_0771 F 30 18ma_0773 M 28 20ma_0774 M 26 14ma_0779 F 25 15ma_0782 F 28 22ma_0783 Mma_0785 F 27 20ma_0786 M 29 16ma_0790 M 25 16ma_0796 30 15ma_0799 M 32 16ma_0806 M 40 18

Page 23: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

ma_0807 M 49 22ma_0814 F 26 18ma_0815ma_0817 M 30 20ma_0821 M 25 16ma_0823 M 31 8ma_0827 F 30 16ma_0828 F 30 20ma_0829 41 19ma_0840 M 29 15ma_0844 F 24 10ma_0845 Mma_0846 F 24 15ma_0848 M 26 10ma_0851 M 35 20ma_0859 M 26 18ma_0860 M 20ma_0861 M 27 17ma_0871 35 16ma_0876 Mma_0880 M 13ma_0881 F 30 25ma_0882 Mma_0888 M 16ma_0894 M 36 25ma_0900 F 26 18ma_0906ma_0913 Mma_0915ma_0916 M 32 18ma_0920 Mma_0925 M 16ma_0932ma_0952 F 23 12ma_0958 M 23 17ma_0963 F 33 15ma_0975 F 34 16ma_0976 M 23 16ma_0977 Mma_1006 F 31 20ma_1008 Fma_1014 M 38 20ma_1022 25 12ma_1067 M 24 6ma_1077 M 30 20ma_1279 Fma_1280 Mma_1281 Mma_1283 M 16ma_1293 F 26 14ma_1303 F 12ma_1307 F 15

Page 24: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

ma_1346 M 16ma_1352 Mma_1357 Fma_1359 Fma_1376 Fma_1393 Fma_1396 Mma_1430 Mma_1525 Fma_1539 Mma_1582 M 26 19ma_1597 M 15ma_1603 Fma_1671 Fma_1700 Mma_1711 Fma_1728 Fma_1737 M

Spanish Sex Age Agesp_0053 30 16sp_0054 56 22sp_0082 39 14sp_0084 32 12sp_0088 37 15sp_0616sp_0681 15sp_0687sp_0699 29 17sp_0707sp_0737sp_0776sp_0857 20 17sp_0912 56 22sp_0934sp_0937 25 19sp_0943sp_0970sp_1015 30 10sp_1031sp_1046sp_1059sp_1074 22sp_1084 32 20sp_1100sp_1142 29 19sp_1143 34 20sp_1148 34 16sp_1156 32 10sp_1157 21sp_1163 35 19sp_1186 19

Page 25: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

sp_1212 22 16sp_1219 37 16sp_1295 27 15sp_1343 31 20sp_1345 29 14sp_1362sp_1427sp_1435 25 16sp_1438 21 16sp_1553sp_1577sp_1578sp_1587sp_1592sp_1594sp_1596 30 21sp_1643 40 16sp_1644sp_1648sp_1651sp_1654 39 16sp_1673sp_1720 28 20sp_1747 18sp_1748sp_1784 20 2sp_1785sp_1789sp_1807 26 17sp_1813 19 13sp_1814 19 13sp_1827sp_1829 20 15sp_1847 37 12sp_1850 74 12sp_1858 20 12sp_1904sp_1923 23 18sp_1926 14sp_1931sp_1933 58 14sp_1934sp_1940sp_1953 37 16sp_1954 20 15sp_1955 26 17sp_1963 19 13sp_2003 28 14sp_2010sp_2023 28 14sp_2024 25 20sp_2036 21 14

Page 26: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

sp_2046 20 14sp_2049sp_2061sp_2067sp_2069 20 14sp_2077sp_2078sp_2079 24 18sp_2082 20 14sp_2083 19 13sp_2086sp_2114 28 18sp_2155 48 14sp_2158sp_2164sp_2168sp_2173 40 10sp_2174sp_2175sp_2179 20 14

Page 27: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

CMUBrian MacWhinneyDepartment of PsychologyCarnegie Mellon UniversityPittsburgh, PA [email protected]

These are short conversations recorded by the students in Brian MacWhinney's class in Language and Thought in the Fall term of 2001. This assignment was worth 25% of the grade. The goal was to learn to record and transcribe spontaneous interactions using CLAN.

The transcripts are combined in one zip file, called cmu.zip. The audio and documentation can be downloaded separately. The materials include.

Dimitrios discussing the experience of coming to America in Greek with his mother.Elizabeth planning the evening's activities with her family.An anonymous student discussing past events with his friends from Singapore.Marina discussing travel with two Swiss friends.Yuki discussing her friend's dog.Discussions in Marianne's immigrant Russian family on translation, the word "flaky",

the idea of a 4-minute mile.

This is a parallel set from the class of the Spring of 2003.

Anna's recording of a discussion of dominance relations in a sorority: transcript and audio.

Amy discussing graduate schools with her friends with some code-mixingBeverly's workgroup devising a ball net.Courtney'sdiscussion of her childhood with her mother.David recording of a session of a CMU comedy group.Jai's art project group.Jing's discussion in Chinese with her statistics tutor.Kerry's recording of a discussion between a couple planning a move to Minneapolis

after graduation:Kirsten's recording of a conversation between two friends.Matthew's recording of a computerized route description task.Michael's transcripts from discussions of CMU life between friends.Monica's recording of friends in a cafe.Ryan's recording of his buddies discussing Spring Break and football season.Vanessa's recording of a discussion of a friend's dating life.

Page 28: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

DISCLABSusan Ervin-TrippDepartment of PsychologyUniversity of CaliforniaBerkeley, CA

The DISCLAB transcripts were collected by Susan Ervin-Tripp and John Gumperz in the 1980s from a variety of conversations at Berkeley. Permissions to use these data vary from conversation to conversation. Conversations with possible restrictions areCON03 (ambiguous), DIN17 (for linguistic and ethnographic research), KIDS01/02 (ambiguous), QUAKE (okay if anonymous), RAZAS (okay if not degrading), RPG01 (only by Psych department).

Lampert: In file names for this corpus, the T and L refer to schools. T is Sta. Teresa, a middle class Oakland parochial school and L is Longfellow, a working class school in Alameda with ethnic diversity. T2 is second grade. M4 means fourth male group.In the L school there were more identifiers because more classes, so those have to be figured out. In the larger L school, there were grade, room number, gender group, so identifiers like L2.12.F3

Escalera and Sprott did little kids.

Page 29: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

GulfWar

The GulfWar corpus is a set of 16 transcripts of calls to radio station WQED in Berkeley California during the Gulf War of 1987, contributed by Johannes Wagner.

Page 30: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

Jefferson

This segment of the Conversation Database is dedicated to the memory of Gail Jefferson.

WaterGate

GailGate is a collection of 22 transcripts of telephone conversations between President Richard Nixon and members of his top staff and their lawyers during April 1973, as the prosecution of the Watergate breakup and its subsequent cover-up were being conducted. All of the conversations are telephone calls with the exception of 3-21ndh.cha, which is a conversation recorded in the Oval Office. The National Archives provided the audiotapes; Gail Jefferson produced the transcripts in MS-Word format. These originals are available in PDF format in the media folder, which has MP3, WAV, and the original WAV from the National Archives. Johannes Wagner, Lone Laursen, and Brian MacWhinney then reformatted the transcripts to CHAT heritage format and linked the files to the audio. In addition, Gail Jefferson had created four transcripts (3-21ndh, 4-19ekalm, 4-25nh, and 72-colhunt) using the typewriter. Johannes Wagner and Lone Laursen computerized these directly into CHAT heritage format, so not PDF files are available for these four.

Conversation Tape/Segment Participants TimeBefore April 13: -72-colhunt.cha - Colson Hunt Nov 13, 19723-21ndh.cha - Oval Office: Nixon,

Dean, Haldeman March 21, 10-11 am

4-12nc.cha - Nixon Colson April 12, 7-8 pmApril 13-15, 1973: (253)4-13nehig.cha (1) 38-1 Nix Ehrl Higby April 13, 9-10 am4-13nh.cha (2) 38-9 Nix Hald April 13, 5pm4-13ne1.cha (3) 38-12 Nix Ehrl April 13, 6pm4-13neh.cha (4) 38-14 Nix Hald Ehrl April 13, 6pm4-13ne2.cha (5) 38-15 Nix Ehrl Apr 13, 7 p.m.4-14eklein.cha (6) 38-31 Ehrlichman Kleindienst April 14, 5pm4-15nz.cha (7) 38-39 Nix Ziegler April 15, 1 a.m.April l5-16: (254)4-15psilb.cha (1) 38-48 Petersen and Silbert April 15, 4pm (a).4-15np1.cha (2) 38-52 Nixon Petersen April 15, 8 pm (a)4-15hhig.cha (3) 38-53 Haldeman Higby April 15, 8pm4-15np2.cha (4) 38-55 Nixon Petersen April 15, 8 pm (b)4-15np3.cha (5) 38-58 Nixon Petersen April 15, 9pm4-15egray1.cha (6) 38-60 Ehrlichman Gray April 15, 10 pm (a)4-15egray2.cha (7) 38-62 Ehrlichman Gray April 15 10 pm (b)4-15np4.cha (8) 38-63,64 Nixon Petersen April 15, 11pm4-16np.cha (9) 38-82 Nixon Petersen April 16, 8 pm

Page 31: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

April 17-18: (255)4-17nd.cha (1) 38-84 Nixon Dean April 17, 9 am4-17ne.cha (2) 38-86 Nixon Ehrlichman April 17, 2 pm4-17etim.cha (3) 38-88 Ehrlichman Timmons April 17, 3-4 pm4-17nz.cha (4) 38-90 Nixon Ziegler April 17, 6 pm4-17nkiss.cha (5) 38-92 Nixon Kissinger April 17, 11pm-124-18nh.cha (6) 38-95 Nixon Haldeman April 18, 12 amAfter April 184-19ekalm.cha - Ehrlichman Kalmbach April 194-25nh.cha - Nixon Haldeman April 25

NB

NB (aka Newport Beach) is a collection of phone calls collected in the early years of Conversation Analysis. These transcripts have been the focus of much of the seminal work done in CA. NB includes 25 files from more than 30 files in the original data-corpus. All files have been typewriter-transcribed by Gail Jefferson. Several files have been re-transcribed over the years. Gail Jefferson produced four transcriptions (2countryclub, 3assistance, 4matter and 5directions) electronically in autumn 2007.

In 2003 and 2004, Kresten Nyman and Johannes Wagner retyped all of the files into the computer from the typewritten originals. Johannes Wagner is responsible for any errors in the electronic transcripts. Before making the data available, Lone Laursen, Brian MacWhinney and Johannes Wagner anonymized names and addresses. In the sound files, names and other personal information were replaced by silences. In the transcripts, personal names, place names and addresses were replaced by pseudonyms, which are syllabically equivalent to the original. During this process, we retained the various pseudonyms that had already been chosen by Gail Jefferson, while adding many new ones. The data in the NB corpus are numbered to indicate the succession of the calls. However, we do not know how or when the original recordings were made.

Files in directory

Transcript Earlier Name Number1golf.cha golf I:1:R2countryclub --- ---3assistance.cha --- ---4matter.cha --- ---5directions.cha --- ---6fungus.cha fishing/fungus picture I:6:R7assassination1.cha assassination i II:1:R8assassination2.cha assassination ii II:2:R9palmsprings.cha palm springs II:3:R10blinddate.cha blind date II:4:R

Page 32: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

11goldbridge.cha gold bridge II:5:R12invitation.cha invitation to the beach III:113hightide.cha high tide III:2:R15fishing.cha fishing III:4:R16dreary.cha dreary IV:1:R17tacos.cha tacos IV:2:R18clothing.cha clothing IV:3:R19paper.cha paper IV:5:R220marysinvitation.cha m’s invitation IV:9:R221swimnude.ca swim in the nude IV:10:R22thanksgiving.cha happy thanksgiving IV:11:R223marines.cha black marines IV:12:R224meatless,cha meatless IV:13:R25powertools.cha power tools V:R

Page 33: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

JOC

Curtis LeBaronOrganizational LeadershipBrigham Young UniversityMarriott School of ManagementProvo UT [email protected]

These six transcripts linked to video provide the content for articles published in a special issue of the Journal of Communication.

Page 34: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

MOVIN

Johannes WagnerLanguage and CommunicationOdense UniversityCampusvej 55Odense, [email protected]

The MOVIN project involves collaboration among various researchers in the fields of Discourse Analysis and Conversation Analysis with a focus on political dialog. The database includes a small number of sample files from these languages:

American English: A video recording of a story told by an American professor to four Danish listeners. The story is about a doctor who fixes a shoulder dislocation during a waterskiing accident.

Australian English Danish: Danish reality show clips. Estonian Finnish French: A divorce conciliation proceeding from the CLAPI corpus. German: A TV discussion of scandals in the building industry. Italian: A TV interview with Bettino Craxi. Norwegian Swedish

Page 35: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

Sakura

Miyata, Susanne [email protected], KyokoKonishi, SayaMatsui, AyumiMatsumoto, ShioriOogi, RieTakahashi, AkaneMuraki, KyokoAichi Shukotoku UniversityNagoya, Japan

This corpus of 18 conversations is the product of six graduation theses on gender differences in students' group talk. Each conversation lasted between 12 and 35 minutes (avg. 25 minutes) resulting in an overall time of 7 hours and 30 minutes. 31 Students (19 female, 12 male) participated in the study (Table 1). The participants gathered in groups of 4 students, either of the same or the opposite sex (6 conversations with a group of 4 female students, 6 with 4 male students, and 6 conversations with 2 male and 2 female students), according to age (first and third year students) and affiliation (two academic departments). In addition, the participants of each conversation came from the same small-sized class and were well acquainted.

The participants were informed that their conversations may be transcribed and a video recorded for use in possible publication when recruited. Additionally, permission was asked once more after the transcription in cases where either private information had been displayed, or a misunderstanding concerning the nature and degree of the publication of the conversations became apparent during the conversation.

The recordings took place in a small conference room at the university between or after lectures. The participants were given a card with a conversation topic to start with, but were free to vary (topic 1 "What do you expect from an opposite sex friend?" [isee ni motomeru koto]; topic 2 "Are you a dog lover or a cat lover?" [inuha ka nekoha ka]; topic 3 "About part-time work" [arubaito ni tsuite]). The investigator was not present during the recording. The combination of participants, the topic, and the duration of the 18 conversations are given in Table 2.

The participants produced 15,449 utterances overall (female: 8,027 utterances, male: 7,422 utterances). All utterances were linked to video and transcribed in regular Japanese orthography and Latin script (Wakachi2002), and provided with morphological tags (JMOR04.1). Proper names were replaced by pseudonyms.

Table 1 List of Participants, sex, age, and the number of their appearancesID Age Sex # ID Age Sex #A1F 19 female 3 I3F 21 female 1

Page 36: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

B1F 19 female 5 K3F 21 female 1C1F 19 female 2 L3F 21 female 3D1F 19 female 3 A1M 19 male 3E1F 19 female 2 B1M 19 male 3F1F 19 female 1 C1M 19 male 5G1F 19 female 1 D1M 19 male 3H1F 19 female 1 E1M 21 male 4A3F 21 female 2 G3M 21 male 5B3F 21 female 2 H3M 21 male 4C3F 21 female 2 I3M 21 male 4D3F 21 female 1 J3M 21 male 2E3F 21 female 2 K3M 22 male 1F3F 21 female 1 L3M 21 male 1G3F 21 female 2 M3M 21 male 1H3F 21 female 1

Table 2 Specifications of the 18 conversationsFile Participants Sex Topic Durationsakura01 G3M H3M K3F L3F MF 1 26'00"sakura02 A1F B1F C1M B1M MF 1 35'30"sakura03 H3F I3F J3M K3M MF 2 11'45"sakura04 H1F B1F C1M E1M MF 2 26'25"sakura05 I3M G3M L3F E3F MF 3 27'20"sakura06 G1F F1F E1M D1M MF 3 26'00"sakura07 D3F F3F L3F E3F FF 1 25'00"sakura08 E1F B1F C1F D1F FF 1 28'25"sakura09 E1F B1F A1F D1F FF 2 27'00"sakura10 A3F B3F C3F G3F FF 2 25'15"sakura11 A1F B1F C1F D1F FF 3 25'25"sakura12 A3F B3F C3F G3F FF 3 23'55"sakura13 G3M H3M I3M J3M MM 1 21'20"sakura14 B1M A1M E1M C1M MM 1 30'00"sakura15 I3M H3M G3M M3M MM 2 25'50"sakura16 E1M A1M C1M D1M MM 2 23'30"sakura17 L3M G3M I3M H3M MM 3 26'45"sakura18 A1M B1M C1M D1M MM 3 16'50"

Page 37: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

SamtaleBank

Johannes WagnerSouthern Denmark University

SamtaleBank is the Danish spoken language component of the DK/CLARIN project directed by Bente Maegaard. Participants in the spoken language component include Johannes Wagner, Lone Laursen, Patrizia Paggio, Frans Gregersen, and Peter Henrichsen. The current contents of the corpus include:

Password: Materials on the use of Business English by Danes and second language learners of Danish.

Radio: Danish call-in radio programs.

Sam2: Videotaped conversations between two participants.

Sam3: Videotaped conversations between three participants.

Telefon: Telephone conversations.

Filename Length Transsk. Comments LinesArun504A1 22:50 Kristian

MortensenSnak i frokoststuen – ”lille lort – værkstedssamtale”Ingen prosodimarkering i transskription

601

Arun504A2 5:29 KristianMortensen

Snak på lageret – ”værksstedsamtale”Ingen prosodimarkering i transskription

312

Arun509A3 9:50 Sofie Emmertsen

Lige efter arbejde(?) Mødes med veninde (NNS) der lige er blevet fastansat i Føtex. Meget muligt spændende.

745

Arun513A1 12 Susan Linke

Optager under arbejde i butik + lager. Ikke meget interaktion.

321

Arun513A3 25:40 Sofie Emmertsen

I frokoststue + butik. Snak med kunder (bl.a. bekendts forældre) og kolleger.

1289

Filnavn Length Situation LinesMUL534H1 0:53 Mulenga tager imod sin mand i døren, da han kommer

hjem fra arbejde.41

MUL536H1 6:07 Emma, Mulenga og hendes mand, Jørgen, spiser morgenmad (eller er det aftensmad?). De sidder vedsofabordet, fordi der står en masse ting på spisebordet. Mulenga fortæller hvilke bøger hun skal læse. Emma er 9, går i 3. klasse og er lige begyndt at lære engelsk i skole.

279

MUL536H3 16:12 Morgenmad: Mulenga, Emma og Jørgen. De taler om at 667

Page 38: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

M skal ud at spise med nogle venner om aftenen. Emma fortæller om filmen Madagaskar, som hun har set aftenen før.

MUL542H1 31:13 Mulenga, hendes mand og Emma spiser aftensmad. De sidder vedsofabordet, fordi der står en masse ting på spisebordet. Emma fortæller om sin fødselsdag hos sin mor. De sidder ved sofabordet og spiser fordi spisebordet er fyldt med ting der skal males. Faster Finn er en mand, som Jørgen arbejder for ind i mellem.

1426

MUL548H1 13:12 Mulenga taler med sin mand om, hvad de skal give hans forældre i julegave.

498

MUL551H 9:12 Mulenga er sammen med Emma i stuen i deres hjem. Emma bor hosMulenga og hendes mand hveranden weekend og enkelte gange på andre tidspunkter. Mulenga spørger, hvordan det går med Emmas dans.

394

MUL605H1 34:34 Emma, Jørgen, Mulenga hjemme hos sig selv 1075

MUL605H3 22:57 Mulenga, Jørgen og Emma spiser spiser aftensmad. sammen med hvem?

1067

MUL605H4 16:34 Mulenga, Jørgen og Emma. I baggrunden er lyden af et fjernsyn og plasken med vand.

606

MUL605H5 39:57 Mulenga, Jørgen og Emma. Mulenga er ved at klippe negle på Ronia, Jørgen er i nærheden ogfjernsynet kører i baggrunden. – snak om hvor meget slik Emma må spise.

1516

MUL605H6 3:54 Mulenga og Emma. under hele optagelsen er lyden ret langt væk. Fjernsynet kører i baggrunden.

115

MUL613H1 8:26 Mulenga taler med Emma 275MUL613H2 11:52 Mulenga taler med Emma 313MUL613H3 3:42 Mulenga taler med Emma 65MUL613H4 9:58 lektier Mulenga taler med Emma. Det lyder som om

Mulenga læser op. Hun spørger Emma om ordbetydninger.

300

MUL613H5 5:54 Jørgen er også hjemme. Han, Mulenga og Emma taler. Kun 1.65 minut transskriberet

70

MUL534F1 3:35 Mulenga er til fest med sine udenlandske venner: Charles (Ghana) og Nun (Thailand. I DK 9 måneder). Der er musik i baggrunden og de har fået lidt at drikke. De øver dansk sammen. De ser på Nuns familiebilleder. Nun forklarer, hvem der er på billederne.

MUL534H2 1:48 Mulenga og hendes mand øver sig på at bruge optageren

Page 39: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

MUL536H2 14:00 Emma og Mulenga ser på hendes gamle hjemmearbejde sammen og laver ansigter. M læser op af Askepot. Emma bruger tegn- kropssprog til at forklare, hvad ordene betyder. De ser på billederne sammen.

MUL536H4 14:02 Morgenmad: Mulenga, Emma og Jørgen.

MUL537T 17:34 Modultest 3.1. Mulenga er oppe sammen med Eva. Det er Eva der starter med at fortælle om ”Palle Alene i Verden”. Derefter fortæller Mulenga om ”Et år i Paris”

MUL542H2 22:08 Mulenga læser højt fra sine bøger for Emma

MUL605H2 35:58 Mulenga, Emma, Jørgen og ? i en bil

MUL620T 44:46 (15. maj 2006. Kirsten er der) Mulenga er til modultest 3.4 med Fie fra Kina. Det er deres lærer Jacob, der eksaminerer dem.

MUL534F1 3:35 Mulenga er til fest med sine udenlandske venner: Charles (Ghana) og Nun (Thailand. I DK 9 måneder). Der er musik i baggrunden og de har fået lidt at drikke. De øver dansk sammen. De ser på Nuns familiebilleder. Nun forklarer, hvem der er på billederne.

MUL622H 53:24 Mulenga diskuterer med sin mand om Emma og hendes lektier.

Page 40: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

SBCSAE

John DuBoisLinguisticsUniversity of California, Santa [email protected]

Robert EnglebretsonLinguisticsRice UniversityHouston, [email protected]

The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by the University of California, Santa Barbara Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB).

Each speech file is accompanied by a transcript in which phrases are time stamped with respect to the audio recording. Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. There are 4 .flt files which are empty because there was no information that needed to be filtered out from the audio files.The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz.

The TalkBank version of the corpus was constructed by Nii Martey of the Linguistic Data Consortium with help from Jack DuBois for Part 1 and from Robert Englebretson, now at Rice University, for Parts 2, 3, and 4. Personal names, place names, phone numbers, etc, in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from

Page 41: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. In the case of a phone number, which was not adequately disguised by the filter, the signal was set to zero, except for the 45 millisecond boundary regions which fade into and out of zero.

01 Actual Blacksmithing02 Lambada03 Conceptual Pesticides04 Raging Bureaucracy05 A Book About Death06 Cuz07 A Tree's Life08 Tell the Jury That09 Zero Equals Zero10 Letter of Concerns11 This Retirement Bit12 American Democracy is Dying13 Appease the Monster14 Bank Products

Age City Orig0001 LENORE f 30 Los Angeles CA CA BA 16 student white0002 DORIS f 50 Montana MT MT HS 12 horse ranc white0003 LYNNE f 19 Montana MT HS 12 student/ho white0004 HAROLD0005 JAMIE f 30 Walnut Cre CA CA college 16 dancer/da white0006 MILES m CA black0007 PETE m 36 San Leandr CA CA 18 grad student white0008 ROY m 34 CA designer white0009 MARILYN f 33 CA writer white0010 CAROLYN f 19 Santa Fe NM CO HS 12 student white0011 KATHY f 31 Boston/Santa

FeA/NM CA grad student white

0012 SHARON f 24 New Mexico NM TX college teacher white0013 SHANE m 23 Corp Christi TX TX grad med student chicano0014 PAM f 43 Massachusetts MA NM housewife white0015 WARREN m 34 Wenham MA IL DVM 23 veterinarian white0016 DARRYL m 33 San Francisco CA CA BA 16 comm./comp white0017 PAMELA f 38 Southern

CaliforniaCA CA BA 16 actress/fi white

0018 ALINA f 34 Los Angeles CA CA BA 16 housewife white0019 ALICE f 28 Pryor MT MT 4 years 16 student Crow Indian0020 MARY f 27 Pryor MT MT college 3 cook fire Crow Indian0021 RICKIE San Francisco CA CA HS 12 clerk black0022 JUNE f 21 Laguna Beach CA CA A MA 17 grad student white0023 REBECCA f 31 Saratoga A CA A J 22 attorney white

Page 42: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

0024 ARNOLD m Saginaw MI CA HS 12 S Army white0025 KATHY f 17 Mobile AL AL HS 10 student white0026 NATHAN m 19 Mobile AL AL HS 12 student white0027 BRAD m 45 MA 18 director o white0028 PHIL m 30 NM BA 16 designer hispanic0029 DORIS f 83 Indianapolis IN AZ MA 18 teacher white0030 ANGELA f 90 middle Wes MO AZ MS 18 teacher J white0031 SAM f 72 Arcadia IN AZ Nursing 15 retired white0032 BEV f 20 So California CA CA HS 15 student white0033 MONTOYO m 51 CA PhD political latino/chicano0034 MARIA f 26 Nicaragua CA HS 15 dispatcher hispanic0035 GILBERT m 22 So California CA CA HS student hispanic0036 CAROLYN f 18 So California CA CA HS 12 student white0037 LAURA f 23 San Jose CA CA HS student japanese/0038 FRANK m 24 So California CA CA BA 16 business o white0039 RAMON m 19 MoreValley CA CA HS 12 student hispanic0040 RUBEN m 27 So California CA CA 5 yrs 17 teacher hispanic0042 KENDRA f 25 midwest IN IN BA 16 administrator white0043 KEN m 51 midwest IN IN Phd M 23 director o white0044 MARCI f 50 midwest IN IN MA 19 counselor white0045 WENDY f 26 midwest IN IN BS 16 missionary white0046 KEVIN m 26 midwest IN IN S Cr 16 missionary white0047 JIM m 41 metro St.L. IL IL certified 16 banking white0048 FRED m 47 Chrisman IL IL masters 18 loan officer white0049 JOE m 45 Dupo IL IL 17 banking white0050 KURT m 70 Millstad IL IL 12 retired-co white0051 VIVIAN f 55 Shenandoah A IL HS 13 banking white

Page 43: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

SCoSE – Saarbrücken Corpus of Spoken English

Neal NorrickLinguisticsSaarland UniversitySaarbrücken, [email protected]

Lynne

Participants: Helen and her daughters: Annie, in her early thirties;Lynne, grad-student home from college; Jennifer, under-grad younger sister arrives later;their niece/cousin Jean also in her early thirties

They are gathered before a late-afternoon Thanksgiving dinner in the living room of the house where Helen and Annie and live. Both go into the adjacent kitchen from time to time.

Jason

Comments:Three under-grads sharing an apartment.One voice often louder than others; Frequent comments on recorder and recording process;Maybe cut A3 after first six minutes and edit other files, as you see fit

Steve

Grad-student George has invited three under-grads to talk about experiences they‚ve had which could provide the basis for writing assignments

Shelley

Pre-thanksgiving dinner with grandparents, mother, father, younger sister.

Page 44: malcah.faculty.arizona.edu€¦  · Web viewDiscussions in Marianne's immigrant Russian family on translation, the word "flaky", the idea of a 4-minute mile. This is a parallel set

Yiddish

Zelda [email protected]

   Every Thursday night at the Millenary synagogue in Manhattan, a group of young men and women meet starting at 10 Pm and way into the wee hours of the morning to hear music, occasionally listen to a lecture, eat, drink, and mix socially.  Because the hot dish known in Yiddish as the “(shabes) chulent” is served there, the group meeting is called Chulent”. [i] While the gathering is open to all, and some non-Jews do wander in, the great majority of the attendees are men and women who have been brought up in the Ultra-Orthodox (specifically, Hassidic) world.  While not all of them were brought up speaking Yiddish, many were.  The group composition changes from week to week.   Some are regulars; others are not.     I was contacted by one of the regular attendees and asked to speak on a topic relating to Yiddish.  I spoke about the research I did together with an Israeli colleague on an Old Yiddish poem.  Then I informally met some of the attendees.  Once I was familiar with the attendees, it was no problem getting them to agree to be informants.  I made it clear that whatever they told me about their individual/personal issues was between us.  In my research, I discuss their language only.  In some case, they told me which Hassidic group they belonged to; in other case, they didn’t.  Three of the nine informants were brought up in the Satmar community, and one is from the Tseylemer Hassidic community, a close relative to the Satmar.        The room in which the attendees meet is crowded and noisy.  Recording there is virtually impossible.  In one case only, I went to a back room with an informant, where we sat down and the informant spoke.  For the most part, it was too noisy there to get a good recording.  Fortunately, in the summer of 2009, Mayor Bloomberg’s administration allowed the placing of small tables and chairs outdoors, along Broadway between 34th street and 38th street.  There we sat, my informants and I, between midnight and 2 AM, in the dark that was illuminated solely by street lamps.                The recordings include very little conversational interaction between the informant and me.  I said to the young men: “Tell me a story- any story you want”, and they launched into a narrative.  Some gave me a ready-made anecdote and spoke without hesitation; some retold family stories which had known content but no pre-determined form; some spoke of things that happened to them; still others simply made up a story as they went along.