![Page 1: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/1.jpg)
Algorithms forInformation Retrieval
Prologue
![Page 2: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/2.jpg)
References
Managing gigabytesA. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
A bunch of scientific papers available on the course site !!
Mining the Web: Discovering Knowledge from...S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
![Page 3: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/3.jpg)
More than 85% users arrive to a site from a SE
Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5%
ASK,... Toolbar searches: 49.6% Google, 46.1% Yahoo,...
SE have an impact onto: Web structure, knowledge and understanding, social behavior....
...and, onto the market: 33% users believe that “the results of a query are the
best place where to buy things” !! Ads (4B$ in USA, 2B€ in Europe, 180M€ in Italy)
Paid search: 65% Google, 25% Yahoo, 8% MSN,... Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,...
Much interest...
![Page 4: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/4.jpg)
Retrieve docs that are “relevant” for the user query
Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm “bag of words”
Relevant ?!?
...We face many difficulties, especially on the
Web!!!
Goal of a Search Engine
![Page 5: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/5.jpg)
Languages/Encodings Hundreds of languages: 55 (Jul01) Home pages:
In 1997: English 82%, the next 15 take 13% In 2001: English: 53%, the next 9 take 30%
Distributed authorship Millions of people creating pages with their own style… Not all have the purest motives in providing high-quality
information - commercial motives drive “spamming”.
Web is huge and heterogeneous
Extracting “significant data” is difficult !!
![Page 6: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/6.jpg)
Web is highly dynamic [154 sites, 2004]
A “good” coverage of the indexed Web is difficult !!
Normalizedwrt first week
![Page 7: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/7.jpg)
Web structure
![Page 8: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/8.jpg)
User Queries
Query composition: Short
2001: 2.54 terms avg
80% less than 3 terms
Imprecise terms
78% of the queries are not modified
Query results: 85% of the users look at just one result-page
![Page 9: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/9.jpg)
User Needs
Informational – want to learn about something (~40%)
Navigational – want to go to a page (~25%)
Transactional – want to do something (~35%)
Access a service Downloads Shop
Asthma
Alitalia
NY weatherMars surface images
Nikon CoolPix
![Page 10: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/10.jpg)
Evolution of Search Engines First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data Query mining
1995-1997 AltaVista, Excite, Lycos, etc
1998: Google, now everyone
No winner yet !!
Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]
![Page 11: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/11.jpg)
What is a search engine, nowadays?
![Page 12: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/12.jpg)
Size of search engines [2005]
Google vs Yahoo: 20-30% sharing of results
![Page 13: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/13.jpg)
Ranking: Google vs Yahoo!
![Page 14: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/14.jpg)
Ranking: Google vs Google.cn
![Page 15: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/15.jpg)
Clustering engines Vivisimo, Snaket,...
Suggestions
Products
Local searches News, Blogs, ....
Not only Web Searches...
![Page 16: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/16.jpg)
Directories
Deep web: Invisible-web.net, Completeplanet, ResoruceDiscovery Network
![Page 17: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/17.jpg)
“Vertical” search engines
![Page 18: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/18.jpg)
About this course
This course is a mix of Smart algorithms & data structures Data compression IR tools: Data Projection, Clustering,...
![Page 19: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/19.jpg)
Massive Data
Nature 2/06 issue highlight trends in sciences:“2020 – Future of computing”
Exponential growth of scientific data Due to e.g. large experiments, sensor networks, etc Nano-tech provides further opportunities
Paradigm shift: Science will be about mining data
Computer science paramount in all sciences
March 2006
![Page 20: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/20.jpg)
Algorithm Inadequacy Importance of scalability/efficiency
→ Algorithmics core computer science area
Traditional algorithmics:Transform input to output using simple machine model
Communities addressing inadequacies have emerged
You should be space/IO-aware
programmers
![Page 21: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/21.jpg)
I/O-conscious Algorithms
Disk access is 106 times slower than main memory access
Store/access data taking advantage of blocks
I/O-efficient algorithms: Move as few disk blocks as possible to solve given problem Access close blocks to reduce the seek time
track
magnetic surface
read/write armread/write head
“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)
![Page 22: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/22.jpg)
Streaming Algorithms
Data arrive continuously or we wish FEW scans
Streaming algorithms: Use few scans Handle each element fast Use small space
track
magnetic surface
read/write armread/write head
![Page 23: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific](https://reader035.vdocument.in/reader035/viewer/2022062500/56649dcf5503460f94ac3085/html5/thumbnails/23.jpg)
Cache-Oblivious Algorithms
Unknown and/or changing devices
Block access important on all levels of memory hierarchy But memory hierarchies are very diverse
Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels