november 15, 2003clis alumni chapter talking to the future: the malach project douglas w. oard...

November 15, 2003 CLIS Alumni Chapter

Talking to the Future:The MALACH Project

Douglas W. OardJoanne Archer, Ammie Feijoo, Xiaoli Huang

College of Information Studies

Telling Our Stories

Shoah Foundation’s Collection• Enormous scale

– 116,000 hours; 52,000 interviews; 180 TB

• Grand challenges– 32 languages, accents, elderly, emotional, …

• Accessible– $100 million collection and digitization investment

• Annotated– 10,000 hours (~200,000 segments) fully described

• Users– A department working full time on dissemination

Who Uses the Collection?

• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement

• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use

Discipline Products

Based on analysis of 280 access requests

Question Types

• Content– Person, organization– Place, type of place (e.g., camp, ghetto)– Time, time period– Event, subject

• Mode of expression– Language– Displayed artifacts (photographs, objects, …) – Affective reaction (e.g., vivid, moving, …)

• Age appropriateness

Full-Description Cataloguing

Subject PersonLocation-Time

Berlin-1939 Employment Josef Stein

Berlin-1939 Family life Gretchen Stein Anna Stein

Dresden-1939 Schooling Gunter Wendt Maria

Dresden-1939 Relocation Transportation-rail inte

“Real-Time” Cataloguing

Subject PersonLocation-Time

Berlin-1939

Dresden-1939

Employment Josef SteinGretchen SteinAnna Stein

RelocationTransportation-rail

SchoolingGunter Wendt

Family Life

Thesaurus-Based Search

The Goal

Dramatically improve access to large multilingual spoken word Collections …

… by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews.

Joanne Archer

Observational Studies

• Four searchers– History/Political Science– Holocaust studies– Holocaust studies– Documentary filmmaker

• Sequential observation• Rich data collection

– Intermediary interaction– Semi-structured interviews– Observational notes– Think-aloud– Screen capture

• Four searchers– Ethnography

– German Studies

– Sociology

– High school teacher

• Simultaneous observation

• Opportunistic data collection– Intermediary interaction

– Semi-structured interviews

– Observational notes

– Focus group discussions

Workshop 1 (June) Workshop 2 (August)

Observed Selection Criteria

• Topicality (57%)Judged based on: Person, place, …

• Accessibility (23%)Judged based on: Time to load video

• Comprehensibility (14%)Judged based on: Language, speaking style

FunctionalityNeeded Function Boolean Search and Ranked Retrieval (13)

Testimony summary (12)

Pre-Interview Questionnaire search/viewer (9)

Rapid access (7)

Related/Alternative search terms (3)

Adding multiple search terms at once (2)

Keywords linked to segment number for easy access(1)

Multi-tasking (1)

Searching testimonies by places under ‘Experience Search’ (1)

Extensive editing within ‘My Project’ (1)

Desired Function Temporary saving of selected testimonies (4)

Remote access (3)

Integrated user tools for note taking (3)

Map presentation (2)

Reference tool (1)

More repositories (1)

Introductory video of system tutorial (1)

Help (1)

Xiaoli Huang

Supporting Information Access

SourceSelection

Search

Selection

Ranked List

Examination

Recording

Delivery

Recording

QueryFormulation

Search System

Query Reformulation and

Relevance Feedback

SourceReselection

AutomaticSearch

BoundaryDetection

InteractiveSelection

ContentTagging

SpeechRecognition

QueryFormulation

ASR SpontaneousAccentedLanguage switching

NLPComponents Multi-scale segmentation

Multilingual classificationEntity normalization Prototype

Evidence integrationMultilingual searchSpatial/temporal

UserNeeds

Observational studiesFormative evaluationSummative evaluation

Description Strategies• Transcription

– Manual transcription (with optional post-editing)

• Annotation– Manually assign descriptors to points in a recording– Recommender systems (ratings, link analysis, …)

• Associated materials– Interviewer’s notes, speech scripts, producer’s logs

• Automatic– Create access points with automatic speech processing

English ASR Error Rate

Training: 65 hours (acoustic model)/200 hours (language model)

system output

missfalsealarm

Effect of ASR Errors

Building a Test Collection

• Overall relevanceAssessment is informed by the assessments for the individual reasons for relevance (categories of relevance), but the relationship is not straightforward

• Provides direct evidence

• Provides indirect / circumstantial evidence

• Provides context(e.g., causes for the phenomenon of interest)

• Provides comparison (similarity or contrast, same phenomenon in different environment, similar phenomenon)

• Provides pointer to source of information

Ammie Feijoo

Some Statistics

• 2,000 U.S. radio stations Webcasting

• 250,000 hours of oral history in British Library

• 35,000,000 audio streams on the Web

Spoken Word Collections

• Broadcast programming– News, interview, talk radio, sports, entertainment

• Scripted stories– Books on tape, poetry reading, theater

• Spontaneous storytelling– Oral history, folklore

• Incidental recording– Speeches, oral arguments, meetings, phone calls

Building a Web of Spoken Words• Affordable storage

– For $1, you can store 1.5 million spoken words

• Adequate network capacity– Internet capacity: 30 million simultaneous programs

• Works with any modem– You can even read email while playing audio

• Replay capabilities– 38% of US users recently used streaming audio

• Effective search capabilities– Not quite yet …

Looking Forward: 2006

• Working systems in five languages– Real users searching real data

• Rich experience beyond broadcast news– Frameworks, components, systems

• Affordable application-tuned systems– Oral history, lectures, speeches, meetings, …

For More Information

• The MALACH project– http://www.clsp.jhu.edu/research/malach/

• NSF/EU Spoken Word Access Group– http://www.dcs.shef.ac.uk/spandh/projects/swag/

• Speech-based retrieval– http://www.glue.umd.edu/~dlrg/speech/

november 15, 2003clis alumni chapter talking to the future: the malach project douglas w. oard...

time periodevent

speech scripts

link analysis

relevance categ

type of place

ammie feijoo

oardjoanne archer

employment josef steinberlin

Documents

exp. n ° 99-09 (527-09) sentencia...

cartas eruditas, y curiosas iii - benito jeronimo feijoo

windows server 2008 core eyal malach senior instructor -...

counterbalancing for serial order carryover effects in...

contribución al epistolario de feijoo: una carta inédita

ts i l - north carolina medical boardmclemore, george ammie,...

scientific concept of change harp analogy management and...

padre feijoo school gijon spain

nombre - corporativo.cnt.gob.ec · ramirez romero angel...

momentum and funding conditions luis garcia...

feijoo, benito jeronimo obras escogidas

autora: ana pando feijoo directoras: prfa. dra. flor maría

genesis 22 - bible study resource...

informe regularizacion indefinidos salinero feijoo...

research article a motion aftereffect from still...

intersubject synchronization of cortical activity during...

chimpanzee accumulative stone throwing - dspacesnt reports...

demographic trends in israel - metzilah.org.il eng...

roger alexander rios feijoo - uao

deadmouse the musical (not featuring the music of deadmau5...