ircs workshop on linguistic databases, 11-13 december 2001 exmaralda thomas schmidt sfb 538...

24
IRCS Workshop on Linguist ic Databases, 11-13 Decem ber 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

Upload: howard-lockridge

Post on 31-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

EXMARaLDAThomas Schmidt

SFB 538 „Mehrsprachigkeit“

University of Hamburg

Page 2: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

• 2200 transcriptions of spoken language (30 min recording each)• Language acquisition data, interviews, expert discourse, classroom discourse, presentation discourse, interpreted discourse,...• 15 languages (German, English, Swedish, Norwegian, Danish, French, Spanish, Portuguese, Turkish, Italian, Basque, Japanese, Chinese, Russian, Luganda)• 9 different data formats (dBase, syncWriter, HIAT-DOS, Verbmobil, ...)• 3 different operating systems (MAC OS 9.x, Windows, Linux) + MAC OS X• research interests: phonetics, syntax, discourse, ...

Data Formats and Tools at the SFB

Page 3: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

syncWriter:• editor for interlinear text• MAC OS 9.x and earlier• outputs binary data

Data Formats and Tools at the SFB

Page 4: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

HIAT-DOS:• editor for HIAT-transcription• MS-DOS/Windows• outputs text files

Data Formats and Tools at the SFB

Page 5: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

Data Formats and Tools at the SFB

dBase/Access/4th Dimension• utterance databases

Page 6: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

Data Formats and Tools at the SFB

Verbmobil:• 7-bit ASCII files

Page 7: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

Database „Multilingualism“

Goals:1. To have one common tool for accessing (querying) the data

Data must come in one format (AG)Multilingual issues must be taken care of (UNICODE)Data format should be software independent (XML)Software should work across different OS (JAVA)

2. To have different tools reflecting the habits and needs of the different projects

different input methods (Score, column, vertical notation)

different output methods (dito)

Page 8: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

SyncWriter

HIAT-DOS

Verbmobil

SQL-Database

?

ACCESS / dBase

Database „Multilingualism“

Page 9: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

SyncWriter

HIAT-DOS

Verbmobil

SQL-Database

ACCESS / dBase

Database „Multilingualism“

Segmented Transcription

List Transcription

Basic Transcription

EX

MA

RaL

DA

Input / Editing Tools

Output / Visualization Tools

Page 10: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX

MAX

TOM

TOM

[v]

[v]

[nv]

[nv]

You keep interrupting me, Tom.

------ pointing at Tom -------------

Oh, I‘m sorry for that.

----- smiling ---------------

1. Score notation („Partitur“)

Page 11: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX

MAX

TOM

TOM

[v]

[v]

[nv]

[nv]

You keep interrupting me, Tom.

------ pointing at Tom -------------

Oh, I‘m sorry for that.

----- smiling ---------------

1. Score notation („Partitur“)

Tiers

Page 12: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX

MAX

TOM

TOM

[v]

[v]

[nv]

[nv]

You keep interrupting me, Tom.

------ pointing at Tom -------------

Oh, I‘m sorry for that.

----- smiling ---------------

1. Score notation („Partitur“)

TiersSpeakers

Categories

Page 13: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX

MAX

TOM

TOM

[v]

[v]

[nv]

[nv]

You keep interrupting me, Tom.

------ pointing at Tom -------------

Oh, I‘m sorry for that.

----- smiling ---------------

1. Score notation („Partitur“)

TiersSpeakers

Categories

0 1 2 3

Timeline

Page 14: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX

MAX

TOM

TOM

[v]

[v]

[nv]

[nv]

You keep interrupting me, Tom.

------ pointing at Tom -------------

Oh, I‘m sorry for that.

----- smiling ---------------

1. Score notation („Partitur“)

TiersSpeakers

Categories

0 1 2 3

Timeline

Events

Page 15: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

1. Score notation („Partitur“) Basic Transcription

TiersSpeakersCategories TimelineEvents

<transcription>

<speakertable> <speaker id=„SPK1“ name=„MAX“/> <speaker id=„SPK2“ name=„TOM“/> </speakertable>

<timeline> <timepoint id=„T0“/> <timepoint id=„T1“/> <timepoint id=„T2“/> <timepoint id=„T3“/> </timeline>

<tier speaker=„SPK1“ category=„v“>

<event start=„T0“ end=„T1“>You keep interrupting </event>

<event start=„T1“ end=„T2“>me, Tom. </event>

</tier>

<tier speaker=„SPK1“ category=„nv“>

<event start=„T0“ end=„T2“>pointing at Tom</event>

</tier>

</transcription>

Page 16: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX MAX TOM TOM[v] [v][nv] [nv]

You keep interrupting

me, Tom.

pointing at Tom

Oh, I‘m

sorry for that. smiling

2. Column notation

Page 17: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX MAX TOM TOM[v] [v][nv] [nv]

You keep interrupting

me, Tom.

pointing at Tom

Oh, I‘m

sorry for that. smiling

2. Column notation Basic Transcription

0

1

2

3

TiersSpeakersCategories TimelineEvents

Page 18: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

3. Vertical notation

MAX

TOM

You keep interrupting [me, Tom.]

(pointing at Tom)

[Oh, I‘m] sorry for that.

(smiling)

Page 19: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

MAX

TOM

[me, Tom.]

(pointing at Tom)

[Oh, I‘m] sorry for that.

(smiling)

3. Vertical notation

You keep interrupting

TiersSpeakersCategories TimelineEvents

Page 20: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

„Traditional“ layout principles

3. Vertical notation

MAX

TOM

You keep interrupting [me, Tom.]

(pointing at Tom)

[Oh, I‘m] sorry for that.

(smiling)

TiersSpeakersCategories TimelineEvents

Speaker-Turns

Page 21: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

Structure Of Annotated Data

You keep interrupting me, Tom.

Oh, I `m sorry for that

Events (temporal structure)

Page 22: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

Structure Of Annotated Data

You keep interrupting me, Tom.

Oh, I `m sorry for that

Events (temporal structure)Oh, das tut mir Leid.

Immer unterbrichst Du mich, Tom

Utterances (linguistic structure)

Page 23: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

Structure Of Annotated Data

You keep interrupting me, Tom.

Oh, I `m sorry for that

Events (temporal structure)Oh, das tut mir Leid.

Immer unterbrichst Du mich, Tom

Utterances (linguistic structure)

Pro V Vpart Pro PN.

Int ProV Adj Prep Pro

Words (linguistic structure)

........

Page 24: IRCS Workshop on Linguistic Databases, 11-13 December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg

IRCS Workshop on Linguistic Databases, 11-13 December 2001

0 a b 1 c 2

W: You W: keep W: interrupting W: me W: Tom

POS: pro POS: v POS: vpart POS: pro POS: pn

U: You keep interrupting me, Tom.

GER: Immer unterbrichst Du mich, Tom.

1 d 2

POS: int POS: pn

e

POS: v

W: Oh W: I W: 'm

U: Oh, I'm sorry for that.

3

GER: Oh, das tut mir Leid.

Structure Of Annotated Data