course overview: an introduction to information retrieval and applications
DESCRIPTION
Course Overview: An Introduction to Information Retrieval and Applications. J. H. Wang Feb. 17, 2014. Instructor & TA. Instructor J. H. Wang ( 王正豪 ) Associate Professor, CSIE, NTUT Office: R1534, Technology Building E-mail: [email protected] Tel: ext. 4238 - PowerPoint PPT PresentationTRANSCRIPT
Course Overview: An Introduction to Information
Retrieval and Applications
J. H. WangFeb. 26, 2015
IR, Spring 2015 NTUT CSIE 2
Instructor & TA
• Instructor– J. H. Wang (王正豪 )– Associate Professor, CSIE, NTUT– Office: R1534, Technology Building– E-mail: [email protected]– Tel: ext. 4238– Office Hour: 9:00-12:00 am, every Tuesday and
Thursday
• TA– Mr. Tsai (@R1424, Technology Building)
IR, Spring 2015 NTUT CSIE 3
Course Description• Course Web Page: for the latest announcements
and updates of schedule, slides, and homeworks– http://www.ntut.edu.tw/~jhwang/IR/
• Time: 9:10-12:00am, Mon.• Classroom: R1322, Technology Building• Textbook:
– Christopher D. Manning, Prabhakar Raghavan and Hinrich Schuetze, Introduction to Information Retrieval, Cambridge University Press, 2008. (Available online)
• International Student Edition, imported by Kai-Fa (開發 ) Publishing
• Prerequisites: – Basic knowledge of data structures and algorithms, linear
algebra, and probability theory – Programming experience is *required* for homeworks &
projects
Target Audience
• CSIE Seniors and graduate students• IGPEECS (International Graduate
Program in Electrical Engineering and Computer Science)
IR, Spring 2015 NTUT CSIE 4
IR, Spring 2015 NTUT CSIE 5
Additional References
• References: – Ricardo Baeza-Yates and Berthier Ribeiro-Neto,
Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley, 2011.
• This is the second edition of their book Modern Information Retrieval in 1999. (華通 )
– Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley, 2010. (全華 )
– Stefan Buettcher, Charles L.A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, MIT Press, 2010.
IR, Spring 2015 NTUT CSIE 6
More Books on IR• Gerald Salton, Automatic information organization
and retrieval, McGraw-Hill, 1968.• Gerald Salton and M.J. McGill, Introduction to
modern information retrieval, McGraw-Hill, 1983.– Two classics, but out-of-print.
• C. J. van Rijsbergen, Information Retrieval, Butterworths, 1979. – The classic. More than 40 years old, but still worth
reading. • K. Sparck Jones, P. Willett, Readings in Information
Retrieval, Morgan Kaufmann, 1997. – A collection of classical IR papers. (out of print)
• I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, Managing Gigabytes, 2nd edition, 1999. – The authority on index construction and compression.
IR, Spring 2015 NTUT CSIE 7
Grading Policy
• Homework assignments and programming exercises: ~40%
• Mid-term exam: ~25%• Term project: ~35%
– Including proposal, presentation, and final report
IR, Spring 2015 NTUT CSIE 8
System Exercises and Term Project
• About 2-3 team-based system exercises– Maximum number of students per team:
• 4 for undergraduates• 2 for graduate students
– You can either write your own code or reuse existing open source code (to be detailed later)
• The term project– Either team-based system development (e.g.
extension to exercises)– Or academic paper presentation
• Only one person per team allowed– A proposal is *required* one week after midterm
(May 4, 2015)
IR, Spring 2015 NTUT CSIE 9
About the Term Project
• The score you’ll get depends on the functions, difficulty and quality of your project – For system development:
• System functions and correctness
– For academic paper presentation• Quality and your presentation of the paper• Major methods/experimental results *must* be presented• Papers from top conferences are strongly suggested
– E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, …
• Proposals are *required* for each team, and will be counted in the score
IR, Spring 2015 NTUT CSIE 10
Online Submission
• Submission instructions– Systems, programs, project proposals,
and project reports in electronic files must be submitted to the TA online at:• Submissions website & instructions : (To be
announced)
IR, Spring 2015 NTUT CSIE 11
What this Course is NOT about
• This course will NOT tell you– The tips and tricks of using search engines,
although power users might have better ideas on how to improve them
• There’re plenty of books and websites on that…
– How to find books in libraries, although it’s somewhat related to the basic IR concepts
– How to make money on the Web, although the currently largest search engine did it
What’s Information Retrieval?
• Things that you have been doing everyday!– Searching for something interesting: Web, news,
tweets, e-mails, images, videos, …– Asking for advices: shopping, restaurants,
movies, …– …
• User interests are changing all the time…– 2011: New Zealand Earthquake– 2012: Jeremy Lin– 2013: Meteor Russia– 2014: Ukraine riots– 2015: ?
IR, Spring 2015 NTUT CSIE 12
What’s Information Retrieval
IR, Spring 2015 NTUT CSIE 13
IR, Spring 2015 NTUT CSIE 14
In Google News
In Web Pages
IR, Spring 2015 NTUT CSIE 15
IR, Spring 2015 NTUT CSIE 16
In Wikipedia
In Google Images
IR, Spring 2015 NTUT CSIE 17
More details
IR, Spring 2015 NTUT CSIE 18
IR, Spring 2015 NTUT CSIE 19
Related Keywords
• TransAsia Airways• Flight 235 (GE235)• ATR 72-500, turboprop• Keelung River• Songshan, kinmen• Mayday, engine flameout• TransAsia Airways Flight 235• …
IR, Spring 2015 NTUT CSIE 20
Related Keywords in Chinese
• 復興航空• 墜毀• 基隆河• 台北松山 , 金門• 引擎熄火• 復興航空 235號班機空難• …• And this can go on:
– for other languages…– and other search engines…– and social websites…
IR, Spring 2015 NTUT CSIE 21
In Google Trends
IR, Spring 2015 NTUT CSIE 22
Similar Events
IR, Spring 2015 NTUT CSIE 23
IR, Spring 2015 NTUT CSIE 24
IR, Spring 2015 NTUT CSIE 25
IR, Spring 2015 NTUT CSIE 26
And Social Search…
IR, Spring 2015 NTUT CSIE 27
On Facebook
IR, Spring 2015 NTUT CSIE 28
How do I Know What People Care about?
IR, Spring 2015 NTUT CSIE 29
IR, Spring 2015 NTUT CSIE 30
What Is Information Retrieval?
• “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)
• Information vs. data
IR, Spring 2015 NTUT CSIE 31
Goal
• Information retrieval (IR): a research field that targets at effectively and efficiently searching information in text and multimedia documents
• In this course, we will introduce the basic text and query models in IR, retrieval evaluation, indexing and searching, and applications for IR
IR, Spring 2015 NTUT CSIE 32
A Big Picture
IR, Spring 2015 NTUT CSIE 33
Inverted Index
UserInterface
Text Operations
Query Expansion IndexingIndexing
RetrievalRetrieval
RankingRanking
Text
query
user need
user feedback
ranked docs
retrieved docs
Doc representationlogical view
inverted file
Document Collection
IR, Spring 2015 NTUT CSIE 34
Topics
• Text IR– Indexing and searching– Query languages and operations
• Retrieval evaluation• Modeling
– Boolean model– Vector space model– Probabilistic model
• Applications for IR– Multimedia IR– Web search– Digital libraries
IR, Spring 2015 NTUT CSIE 35
Organization of the Textbook
• Basics in IR (focus)– Inverted indexes for boolean queries (Ch.1-5)– Term weighting and vector space model (Ch. 6-7)– Evaluation in IR (Ch. 8)
• Advanced Topics– Relevance feedback (Ch. 9)– XML retrieval (Ch. 10)– Probabilistic IR (Ch. 11)– Language models (Ch. 12)
• Machine learning in IR (useful)– Text classification (Ch. 13-15)– Document clustering (Ch. 16-18)
• Web Search– Web crawling and indexes (Ch. 19-20)– Link analysis (Ch. 21)
Some Overlap with Other Fields
• Data mining, Text mining, Information Extraction
• Machine Learning• Natural Language Processing• Social Network Analysis• …
IR, Spring 2015 NTUT CSIE 36
IR, Spring 2015 NTUT CSIE 37
Pointers to Other Topics
• Cross-language IR• Image, video, and multimedia IR• Speech retrieval• Music retrieval• User interfaces• Parallel, distributed, and P2P IR• Digital libraries• Information science perspective• Logic-based approaches to IR• Natural language processing techniques• …
IR, Spring 2015 NTUT CSIE 38
Tentative Schedule
• Before midterm– Boolean retrieval (1 wk)– Indexing (2 wks)– Vector space model and evaluation (2 wk)– Relevance feedback (1 wk)– Probabilistic IR (2 wk)
• After midterm – Text classification (1-2 wk)– Document clustering (1-2 wk)– Web search (2 wks)– Advanced topics: CLIR, IE, … (2 wks)– Term Project Presentation (3 wks)
IR, Spring 2015 NTUT CSIE 39
Generic Resources
• Wikipedia page on Information Retrieval: http://en.wikipedia.org/wiki/Information_retrieval
• Information Retrieval Resources: http://www-csli.stanford.edu/~hinrich/information-retrieval.html
•
IR, Spring 2015 NTUT CSIE 40
Academic Resources• Journals
– ACM TOIS: Transactions on Information Systems – JASIST: Journal of the American Society of Information
Sciences– IP&M: Information Processing and Management– IEEE TKDE: Transactions on Knowledge and Data Engineering
• Conferences– ACM SIGIR: International Conference on Information Retrieval– WWW: World Wide Web Conference– ACM CIKM: Conference on Information Knowledge and
Management– JCDL: ACM/IEEE Joint Conference on Digital Libraries– ACM WSDM: International Conference on Web Search and
Data Mining– TREC: Text Retrieval Conference
Teaching in English…
• Slides and lectures will be offered mainly in English
• For better understanding for domestic students, important concepts will be briefly summarized in Chinese
IR, Spring 2015 NTUT CSIE 41
More on Term Projects
• Options for term projects– Option 1: team-based system project
• i.e., extension to system exercises
– Option 2: academic paper presentation• Only one person, NOT team-based
• Tentative schedule for all teams:– Proposal: *required* one week after midterm (May 4,
2015)– Presentations (including demos): *required* in the last
three weeks (starting from Jun. 8 or 15, 2015)– Final report: before the end of the semester (Jun. 30,
2015)• Slides, source code, documentation
For System Development
• You can write your own code in any programming language
• Or you can reuse existing open-source information retrieval tools
• Any topic relevant to information retrieval– Retrieval, analysis, extraction of entities,
topics, or their relations from various resources from the documents, Web, social media
IR, Spring 2015 NTUT CSIE 43
Some Open Source Tools
• Apache Lucene/Solr (in Java)– for indexing/search engine
• Apache Hadoop, Spark (in Java, Scala, Python)– For distributed computing and data analysis
• The Lemur Project, Indri, Galago – by CMU/Umass, (in C++)– For search engine, text analysis
• Terrier – by U. Glasgow (in Java)– For search engine
• …• You are encouraged to explore more!
IR, Spring 2015 NTUT CSIE 45
Thanks for Your Attention!
• Any question or comment? Please feel free to send e-mails to [email protected] or discuss with me at my office