Information Retrieval and Information Retrieval and its Application in its Application in BiomedicineBiomedicine
Hong YuHong Yu1,21,2, PhD, PhDSusan McRoySusan McRoy11, PhD, PhD11Department of Computer ScienceDepartment of Computer Science22Department of Health SciencesDepartment of Health SciencesUniversity of Wisconsin-MilwaukeeUniversity of Wisconsin-Milwaukee
Sept 4 Introduction
What is Information What is Information Retrieval?Retrieval?
The field concerned with the acquisition, The field concerned with the acquisition, organization, and searching of knowledge-organization, and searching of knowledge-based information. (Hersh, 2003)based information. (Hersh, 2003)
InformationInformation
World Wide WebWorld Wide Web Company DocumentationsCompany Documentations Drug DescriptionsDrug Descriptions Medical RecordsMedical Records BooksBooks Everything that is text, image, Everything that is text, image,
video, and sound, and that can be video, and sound, and that can be transformed digitallytransformed digitally
Information in BiomedicineInformation in Biomedicine
Literature (over 17 million publications)Literature (over 17 million publications) WWWWWW Electronic medical recordsElectronic medical records Genomics dataGenomics data
– DNA sequences, etc.DNA sequences, etc.
Knowledge representationKnowledge representation– Gene OntologyGene Ontology
Company databases Company databases – Micromedex drug databaseMicromedex drug database
IR in BiomedicineIR in Biomedicine
Index Medicus (Billings 1879)Index Medicus (Billings 1879) MEDLARS (NLM 1966)MEDLARS (NLM 1966) SAPHIRE (Hersh 1990)SAPHIRE (Hersh 1990) PubMed (NLM 1996)PubMed (NLM 1996) Arrowsmith (Smalheiser 1998)Arrowsmith (Smalheiser 1998) BioText (Hearst 2003)BioText (Hearst 2003) BioMedQA (Yu 2006)BioMedQA (Yu 2006)
Electronic and Open Electronic and Open PublishingPublishing
Internet and Web have a profound impact on Internet and Web have a profound impact on the publishing of knowledge-based informationthe publishing of knowledge-based information
Most of literature can be electronically Most of literature can be electronically availableavailable
Open-accessOpen-access– The Bethesda Statement on Open Access Publishing (The Bethesda Statement on Open Access Publishing (
http://www.earlham.edu/~peters/fos/bethesda.htmhttp://www.earlham.edu/~peters/fos/bethesda.htm) ) (April 11, 2003)(April 11, 2003)
– The Berlin Declaration on Open Access to Knowledge The Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities (in the Sciences and Humanities (http://www.zim.mpg.de/openaccess-berlin/berlindeclahttp://www.zim.mpg.de/openaccess-berlin/berlindeclaration.htmlration.html). (2003)). (2003)
– PubMedCentra (NLM 2004)PubMedCentra (NLM 2004)
Quality of InformationQuality of Information
A lack of quality controlA lack of quality control– Anyone can publish onlineAnyone can publish online– A wealthy of studies concluded that A wealthy of studies concluded that
Web has a poor quality for Web has a poor quality for healthcare informationhealthcare information
ReadabilityReadability– Hard to readHard to read
Information Needs and Information Needs and SeekingSeeking
Unrecognized needsUnrecognized needs– Clinicians unaware of information needs or Clinicians unaware of information needs or
knowledge deficitknowledge deficit Recognized needsRecognized needs
– Clinicians aware of needs but may or may not Clinicians aware of needs but may or may not pursue thempursue them
Pursued needsPursued needs– Information seeking occurs but may or may not Information seeking occurs but may or may not
be successfulbe successful Satisfied needsSatisfied needs
– Information seeking successfulInformation seeking successful
What You Will LearnWhat You Will Learn
IR algorithmsIR algorithms– IndexingIndexing– Query and RetrievalQuery and Retrieval– EvaluationEvaluation– Text ClassificationText Classification– XML retrievalXML retrieval– Web retrievalWeb retrieval
What You Will Learn (Cont.)What You Will Learn (Cont.)
Open-Source IR toolsOpen-Source IR tools– What open-source IR tools are What open-source IR tools are
availableavailable Indexing/retrievalIndexing/retrieval Part-of-speech and syntactic parsingPart-of-speech and syntactic parsing Semantic parsingSemantic parsing Discourse relationsDiscourse relations Machine-learning classifiersMachine-learning classifiers
How to use the tools?How to use the tools?
What You Will Learn (Cont.)What You Will Learn (Cont.)
State of the art IR systemsState of the art IR systems– Baruch 1965 [BLIMP Baruch 1965 [BLIMP http://blimp.cs.queensu.ca/index.htmlhttp://blimp.cs.queensu.ca/index.html]]– SAPHIRE (Hersh 1990)SAPHIRE (Hersh 1990)
RetrievalRetrieval– MedLEE (Friedman 1994)MedLEE (Friedman 1994)
ExtractionExtraction– PubMedPubMed (NLM 1997) (NLM 1997)– ARROSMITH Systems ARROSMITH Systems (Smalheiser 1998)(Smalheiser 1998)
Hidden Relation Discovery ToolHidden Relation Discovery Tool– GENIES (Friedman 2001)GENIES (Friedman 2001)
ExtractionExtraction
BioText (BioText (Hearst 2003Hearst 2003 http://biotext.berkeley.edu/http://biotext.berkeley.edu/ ))– Retrieval+CategorizationRetrieval+Categorization
GeneWays (GeneWays (Rzhetsky 2004 Rzhetsky 2004
http://geneways.genomecenter.columbia.edu/http://geneways.genomecenter.columbia.edu/ ))– Extraction+VisualizationExtraction+Visualization
TextPresso (TextPresso (Muller 2004Muller 2004 http://www.textpresso.org/http://www.textpresso.org/))– Retrieval+ExtractionRetrieval+Extraction
iHOP (iHOP (Hoffman and Valencia 2005Hoffman and Valencia 2005 http://www.ihop-http://www.ihop-
net.org/UniPub/iHOP/net.org/UniPub/iHOP/))– Retrieval Retrieval
BioMedQABioMedQA ( (Yu 2006 Yu 2006 http://monkey.ims.uwm.edu/MedQAhttp://monkey.ims.uwm.edu/MedQA))– Question AnsweringQuestion Answering
BioNLP Systems
Beyond text: Image and Beyond text: Image and VideoVideo
Image classificationImage classification– Finding concepts in captions and annotationsFinding concepts in captions and annotations– Machine learning on textual & visual featuresMachine learning on textual & visual features– Determining salient features in text and Determining salient features in text and
image separately and merging the resultsimage separately and merging the results Extracting text from imageExtracting text from image
– Understanding and correcting OCR Understanding and correcting OCR (handwriting, equations)(handwriting, equations)
– Finding text in images Finding text in images Finding document text related to illustrationsFinding document text related to illustrations Video retrievalVideo retrieval
ResourcesResources Annotated collections (GENIA, Medstract, Yapex …)Annotated collections (GENIA, Medstract, Yapex …) Ontologies, tools, knowledge bases …Ontologies, tools, knowledge bases … Publications, Conferences, Evaluations …Publications, Conferences, Evaluations … Centres and web portalsCentres and web portals
What We ProvideWhat We Provide
TextbookTextbook– Christopher D. Manning, Prabhakar Raghavan Christopher D. Manning, Prabhakar Raghavan
and Hinrich Schutze. and Hinrich Schutze. Introduction to Introduction to Information Retrieval. Information Retrieval. Cambridge University Cambridge University Press, 2007Press, 2007
http://www-csli.stanford.edu/~schuetze/information-http://www-csli.stanford.edu/~schuetze/information-retrieval-book.htmlretrieval-book.html
Office hour:Office hour:– Tuesdays, 3-4 pm EMS 710 and by Tuesdays, 3-4 pm EMS 710 and by
appointmentappointment– Hong Yu, 414-229-3344Hong Yu, 414-229-3344– Susan McRoy, 414-229-6695Susan McRoy, 414-229-6695
What We ExpectWhat We Expect
Undergraduate:Undergraduate:– 30% Homework, 35% Midterm exam, 30% Homework, 35% Midterm exam,
35% Final exam or project 35% Final exam or project Graduate:Graduate:
– 20% Midterm exam, 40% Homework, 40% 20% Midterm exam, 40% Homework, 40% Project: The project may be done Project: The project may be done individually or in a team of 2-3 people. The individually or in a team of 2-3 people. The final project will include a software final project will include a software system, a 2-3 page written project report, system, a 2-3 page written project report, and an oral presentation. The report and an oral presentation. The report should describe the problem, the should describe the problem, the approach, and evaluation and should cite approach, and evaluation and should cite related work where appropriate.related work where appropriate.