introduction: machine learning in digital librariescornelia/russir14/lectures/...introduction:...
TRANSCRIPT
![Page 1: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/1.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Introduction: Machine Learning in DigitalLibraries
Compiled by Cornelia Caragea & Sujatha Das
Credits for slides: Hofmann, Mihalcea, Mobasher, Mooney, Schutze
August 18, 2014
![Page 2: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/2.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Course Title: Case studies in applying MachineLearning for Document Analysis and Retrieval
Tasks in Scientific Digital Libraries
![Page 3: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/3.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Quick Survey
1 BackgroundCS/non-CSIndustry vs. graduate studentCoding experiencePrior course on IR?Prior course in ML?
2 Expectations from RuSSIR3 Expectation from this course
![Page 4: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/4.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Machine Learning: Basics
![Page 5: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/5.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
What is “Learning"?
We “learn" many things:Motor skills: walk, ride a bicycle, drive, play tennis or golf,play the piano.Visual concepts: man-made objects, faces, natural objects.Language: Speech recognition, read and write naturallanguagesSpatial knowledge: Navigate between spatial locations,physical layout of a room.Symbolic knowledge: algebra, arithmetic, calculus.Social rules: how to interact with people, animals,machines....
![Page 6: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/6.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Abstract definition of “Learning"
Definition due to Herbert Simon (1980):
“Learning" denotes changes in a system that are adaptive inthat they enable the system to perform the same task or similartasks drawn from the same population better over time.
![Page 7: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/7.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Well-posed learning problem
Definition due to Tom Mitchell (1998):
A computer program is said to learn from experience E with respectto some task T and some performance measure P, if its performanceon T, as measured by P, improves with experience E.
![Page 8: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/8.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Spam Filtering
Suppose your email program watches which emails you do ordo not mark as spam, and based on that learns how to betterfilter spam. What is the task T in this setting?
1 Classifying emails as spam or not spam.2 Watching you label emails as spam or not spam.3 The number (or fraction) of emails correctly classified as
spam/not spam.4 None of the above - this is not a machine learning problem.
![Page 9: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/9.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Fields of application
Biology: Brain, Development, Evolution, Genetics,Neuroscience.Information Theory: Coding Theory, Entropy.Linguistics: Grammars, Language acquisitionMathematics: Calculus, Linear Algebra, Optimization.Psychology: Analogy, Concept Learning, Curiosity,Discovery, Memory, ReinforcementPhilosophy: Causality, Induction, Theory FormationStatistics: Probability Distributions, Estimation,Hypothesis Testing.
![Page 10: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/10.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Some applications of ML in practice
“If you invent a breakthrough in artificial intelligence, somachines can learn, that is worth 10 Microsofts", Bill Gatesquoted in NY Times, Monday March 3, 2004.
Information extraction from theweb: Google, Microsoft, Yahoo
Spam filtering
Speech/handwriting recognition
Object detection/recognition
Weather prediction
Stock market analysis
Search engines (e.g, Google)
Ad placement on websites
Adaptive website design
Credit-card fraud detection
Webpage clustering (e.g.,Google News)
Social Network Analysis
Machine Translation (e.g.,Google Translate)
Recommendation systems (e.g.,Netflix, Amazon)
Predicting a protein’s functions
Automatic vehicle navigation
Performance tuning of computersystems
Predicting good compilationflags for programs
... and many more
![Page 11: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/11.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Three fundamental problems in ML
Classification: Learning to predict discrete labelsassociated with given observations.
Binary classification: article related to politics or sportsMulticlass classification: digit recognition on postaladdresses
Regression: Learning to predict continuous outputsassociated with given observations
Example: Predict the sales for a particular coffee-mixproduct
Unsupervised learning: Learning to group objects intocategories, without any training labels.
Examples: clustering search results into topics
![Page 12: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/12.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Supervised framework
Learning Algorithm
Output h ∈ Hs.t. h(xi ) ≈ yi
Hypothesis classH h : X ∗ → Y
Dl = {xi , yi}i=1,nxi ∈ X ∗, yi ∈ Y
iid examples
h(xtest ) = y
xtest
new example
Learning = Search in Hypothesis Class
![Page 13: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/13.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Linearly-separable classifiers
Spam vs. not spamTumor (malignant,benign)
![Page 14: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/14.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Learning from relevant, labeled examples
Distinguish a picture of me from a picture of someoneelse?
Provide examples pictures of me and pictures of otherpeople and let a classifier learn to distinguish the two.
Determine whether a sentence is grammatical or not?Provide examples of grammatical and ungrammaticalsentences and let a classifier learn to distinguish the two.
Distinguish cancerous cells from normal cells?Provide examples of cancerous and normal cells and let aclassifier learn to distinguish the two.
![Page 15: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/15.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Labeled data (“play” prediction)
![Page 16: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/16.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Regression
![Page 17: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/17.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Regression in Medicine
[Efron et al., Least Angle Regression, Annals of Statistics,2004]
![Page 18: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/18.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Unsupervised learning
Supervised learning Unsupervised learning
![Page 19: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/19.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Clustering “news" articles
![Page 20: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/20.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
References
Pattern Recognition and Machine Learning, ChristopherBishop.Machine Learning, Tom Mitchell.The Elements of Statistical Learning: Data Mining,Inference and Prediction, Trevor Hastie, Robert Tibshirani,Jerome Friedman.
![Page 21: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/21.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Information Retrieval Systems: Basics
![Page 22: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/22.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
What is Information Retrieval (IR)
The processing, indexing and retrieval of textualdocuments.
1 retrieving relevant documents to a query.2 retrieving from large sets of documents efficiently.
![Page 23: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/23.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Key terms
Query: a representation of what the user is looking for -can be a list of words or a phrase.Document: webpage/pdf/image...what user wants toretrieveCollection or corpus: a set of documentsIndex: a set of data structures that make querying efficientTerm: word or concept that appears in a document or aquery
![Page 24: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/24.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Typical IR system architecture
![Page 25: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/25.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
What is a Digital Library?
An electronic library for a focused collection of digitalobjectsObjects can include text, visual material, audio material...(electronic media formats)A type of information retrieval system.Examples: CiteSeerx , PubMed, ACM DL, LawNet ...
![Page 26: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/26.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Web Search (Google) vs. Digital Libraries
“All the Web” vs. domain-specific collections/special typesof documentsEverybody vs. users with “special” needs“Documents” vs. “Documents, Authors, Connections...”For a DL (or a typical IR system)
Must assemble a document corpus (spidering the Web orfrom trusted sources)Document collections need to be constantly updatedDifferent types of search, ranking, and visualization tasks
![Page 27: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/27.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
ACM Digital Library
![Page 28: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/28.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
PubMed
![Page 29: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/29.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
LawNet
![Page 30: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/30.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Components in an IR system
Crawl/Acquire* documents that need to be indexed in thesystem.Index/Search retrieves documents that contain a givenquery token from the inverted index.Rank scores all retrieved documents according to arelevance metric.Visualize manages interaction with the user:
![Page 31: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/31.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Typical IR Search
Given:A corpusA user query in the form of a textual string
Find:A ranked set of documents that are relevant to the query
![Page 32: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/32.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Relevance and Ranking
Relevance is a subjective judgment and may include:Being on the proper subject.Being timely (recent information).Being authoritative (from a trusted source).Satisfying the goals of the user and his/her intended use ofthe information (information need)
Main relevance criterion: an IR system should fulfill auser’s information need
Relevance is “hard to measure”Measures such as Precision, Recall, Mean ReciprocalRank, NDCG on benchmark collections (example, fromTREC)
![Page 33: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/33.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Information Retrieval
The processing, indexing and retrieval of documents.1 retrieving relevant documents to a query.2 retrieving from large sets of documents efficiently.
Matching documents and queriesHandling vocabulary mismatch (“PRC" vs “China")Handling ambiguity (“bat", “jaguar")
![Page 34: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/34.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Implementation and User Experience concerns
Fast search (efficient data structures such as invertedindices)What queries are possible?How many results?Query suggestions?Show similar searches?Cluster results, other visualizations?
![Page 35: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/35.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
References
“Introduction to Information Retrieval", C.D. Manning, P.Raghavan, H. Schütze“Mining the Web: Discovering Knowledge from Hypertext",Soumen Chakrabarti“Search Engines: Information Retrieval in Practice", BruceCroft, Donald Metzler and Trevor Strohman
![Page 36: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/36.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
ML in a practical IR system (CiteSeer)
![Page 37: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/37.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
What is CiteSeer/CiteSeerx?
Scientific digital library for Computer and InformationScienceIndexes (free) PostScript and PDF research articles on theWebAutomated techniques for acquiring and harvestingresearch articlesSeveral functionalities: citation indexing, metadataextraction, author disambiguation, citation statistics andtrends
![Page 38: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/38.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
CiteSeer: Document Search and Metadata
![Page 39: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/39.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
CiteSeer: Citations and Trends
![Page 40: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/40.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Author Disambiguation
![Page 41: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/41.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Reference Recommendation
![Page 42: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/42.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Upcoming Lectures/Hands-on
How are we applying Machine Learning techniques forthese tasks in CiteSeer?
Day 1: Introduction + Heritrix (crawling) exerciseDay 2: Classification + Weka exerciseDay 3: Pagerank/graph-based analysis + Gephi demoDay 4: Topic Modeling + Mallet (LDA) exerciseDay 5: Information Extraction + OpenCalais demo
Requirements for exercises: Familiarity with Java andLinux environments
![Page 43: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/43.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Crawling the Web
![Page 44: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/44.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
The Web (Corpus) by the Numbers
43 million web servers167 Terabytes of data
About 20% text/html
100 Terabytes in “deepWeb”440 Terabytes in emails
[Lyman & Varian: How much Information? 2003]http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
![Page 45: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/45.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Spiders (Robots/Bots/Crawlers)
Spidering represents the main difference betweentraditional IR and IR these days.
Start with a comprehensive set of root URL’s from which tostart the search.Follow all links on these pages recursively to find additionalpages.Index/Process all novel found pages in an inverted indexas they are encountered.May allow users to directly submit pages to be indexed (andcrawled from).
![Page 46: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/46.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Search Strategies
Breadth-first Search
![Page 47: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/47.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Search Strategies
Breadth-first Search
![Page 48: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/48.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Search Strategies
Breadth-first Search
![Page 49: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/49.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Search Strategies
Depth-first Search
![Page 50: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/50.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Search Strategies
Depth-first Search
![Page 51: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/51.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Search Strategies
Depth-first Search
![Page 52: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/52.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Some challenges/concerns
Must detect when revisiting a page that has already beenspidered (web is a graph not a tree, link canonicalization).Must efficiently index visited pages to allow rapidrecognition test.Restricting the crawl (robots.txt, content/anchor-textdecide-rules)How often should we crawl?Directed/Focused Spidering
![Page 53: Introduction: Machine Learning in Digital Librariescornelia/russir14/lectures/...Introduction: Machine Learning in Digital Libraries Compiled by Cornelia Caragea&Sujatha Das Credits](https://reader030.vdocument.in/reader030/viewer/2022041017/5ecab3d2fff2642f79473a5c/html5/thumbnails/53.jpg)
Machine Learning: Basics IR Systems: Basics CiteSeer Crawling
Hands-on with Heritrix