language and information
DESCRIPTION
Language and Information. LIS 610 November 6, 2002 Nina Wacholder [email protected]. Agenda. Role of language in information science Current research: Human Computer Interaction with Electronic Indexes and Index Terms. Textual information. - PowerPoint PPT PresentationTRANSCRIPT
Language and Information
LIS 610
November 6, 2002
Nina Wacholder
Language and Information 11/06/02 Nina Wacholder 2
Agenda
Role of language in information science
Current research: Human Computer Interaction with Electronic Indexes and Index Terms
Language and Information 11/06/02 Nina Wacholder 3
Textual information
Information conveyed by alphabets, digits and punctuation Organized into meaningful units recognized by some group of
people
Language and Information 11/06/02 Nina Wacholder 4
Other techniques for conveying information
Spoken language Gesture Facial expression Sound Images (drawings, photographs …)
Language and Information 11/06/02 Nina Wacholder 5
Language
Uniquely human Learned Conventional
Language and Information 11/06/02 Nina Wacholder 6
Understanding language is hard
Expresses complex concepts Ambiguity – words, phrases and sentences have more than one
meaning Synonymy – words, phrases and sentences have more than one
meaning
Language and Information 11/06/02 Nina Wacholder 7
Complex concepts
Pencil Face Directions to Alexander Library Theory of relativity U.S. election law
Language and Information 11/06/02 Nina Wacholder 8
Synonymy
child, kid, adolescent, baby
flammable, inflammable
I was walking up the street that day. I was walking down the street that day.
Moxie wrote that report. That report was written by Moxie.
Language and Information 11/06/02 Nina Wacholder 9
Ambiguity-- semantic
Bat
Make a bed
Moxie ate potatoes with a fork. Moxie ate potatoes with fish.
Language and Information 11/06/02 Nina Wacholder 10
Ambiguity– structural (syntactic)
Red airplane terminal
• [[red airplane] terminal]
• [red [airplane terminal]]
Moxie saw Toxie in the park with a telescope
• Moxie saw [Toxie in the park with a telescope]
• Moxie [saw] Toxie in the park [with a telescope]
Language and Information 11/06/02 Nina Wacholder 11
Natural language processing (NLP)
Natural language Computer language
Language and Information 11/06/02 Nina Wacholder 12
The NLP controversy: rules vs. statistics
Language and Information 11/06/02 Nina Wacholder 13
NLP by rule
Lexicon (vocabulary) Det: a ProperName: Moxie Noun: report Verb: wrote
Syntactic rules NounPhrase[a report] Det[a] Noun[report] NounPhrase[Moxie] ProperName[Moxie] VerbPhrase[wrote a report] Verb[wrote] NounPhrase[a
report] Sentence[Moxie wrote a report] NounPhrase[Moxie]
VerbPhrase[wrote a report]
Language and Information 11/06/02 Nina Wacholder 14
NLP by statistics
Luhn (1958) tf*idf (Salton and Buckley 1988) Maximum entropy (Berger, Della Pietra and Della Pietra 1996)
Language and Information 11/06/02 Nina Wacholder 15
Information-access tasks with significant natural language component
Information retrieval Information extraction Automatic summarization Question answering
Language and Information 11/06/02 Nina Wacholder 16
Sparck Jones (2001)
Task core vs. task context Information retrieval: 30-40% accuracy for systems in natural
environment Information extraction: 50% for core systems Automatic summarization: no sound basis for core evaluation
Language and Information 11/06/02 Nina Wacholder 17
Evaluation of Head Sorting Mechanism Wacholder, Klavans and Evans (2000)
Task
compare domain-independent, corpus-independent methods for automatic identification of terms to represent a document or collection of documents
Methods for term identification
Head-sorted NPs (HS) (Wacholder 1998)
Keywords (KW)
Technical Terms (TT) (Justeson and Katz 1995)
Language and Information 11/06/02 Nina Wacholder 18
Examples of terms identified by indexing method
Keywords Head-sorted NPs Technical terms
asbestos/asbestosis workers cancer deaths
worker/workers/worked asbestos workers lung cancer
cancer 160 workers kent cigarette
death cancer dr. talcott
make lung cancer cigarette filter
lorillard asbestos u.s.
fiber cancer causing asbestos
dr. lung cancer deaths
… ...
Language and Information 11/06/02 Nina Wacholder 19
Ranking of terms by cumulative percentage
00.10.20.30.40.50.60.70.80.9
1
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Cum
ulat
ive
Perc
enta
ge
KWD
TT
SNP
Language and Information 11/06/02 Nina Wacholder 20
Ranking by cumulative number of terms
1 = best; 5 = worst
Number of termsranked at or better than
Method 2 3 4 5KW 27 75 124 166HS 41 96 132 160TT 15 21 21 21
Language and Information 11/06/02 Nina Wacholder 21
Summary of results
Head-sorted terms mixed quality terms good document coverage
Technical terms high quality terms poor document coverage
Keywords low quality terms good document coverage
Language and Information 11/06/02 Nina Wacholder 22
ISATC Pilot Project
Nina Wacholder, PIPhD Students: Lu Liu, Mark Sharp, Peng Song,
Xiaojun Yuan
Language and Information 11/06/02 Nina Wacholder 23
Research question
Null hypothesis: Properties of index terms do not affect information seeker’s selection of terms
What properties of index terms affect the selection of terms?
What effects do these properties have?
Language and Information 11/06/02 Nina Wacholder 24
Material
TextRice, McCreadie and Chang (2001)
Index termsHead sorted terms (Wacholder 1998)Technical terms (Justeson and Katz)Human index terms
Language and Information 11/06/02 Nina Wacholder 25
Experimental Searching and Browsing Interface (ESBI)
http://www.scils.rutgers.edu/cgi-bin/indexer.cg
Language and Information 11/06/02 Nina Wacholder 26
Initial results
Language and Information 11/06/02 Nina Wacholder 27
Future work
Further analysis of experimental data Compare subjects by type (e.g., undergraduate, MLIS) Effectiveness of searches (ie did they get the right answer) Overlap of words in index terms with words in question …
Evaluation of ESBI interface Comparison of additional techniques for identifying terms Use of different texts