internet archive & web datamining raymie stata uc santa cruz & internet archive

26
Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Upload: antonia-audrey-casey

Post on 04-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Internet Archive&

Web Datamining

Raymie Stata

UC Santa Cruz & Internet Archive

Page 2: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Agenda

• State of the Archive– Collections– Infrastructure (freecache)

• Internet Analytics– Information carnivores

Page 3: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Archive Overview

• Started in 1996

• Transitioned from “Archive of the Internet” to “Archive on the Internet”

• Transitioning to “Digital Library of the Future”

• Funding from private foundations, plus lots of volunteers

Page 4: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Digital Library of the Future

• Information is accessible to anyone from anywhere

• The best and broadest information is available

• We imagine a small network of very large, regional, “mega” digital libraries

Universal Access to Human Knowledge

Page 5: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Web collection

• Over 10B pages, 200TB, 50M sites– Broad crawls (20TB snapshot/2 months)– Narrow crawls (elections, 9/11)– “Heritage crawls”– Writing new crawler :-(

• Wayback machine– Success! 4M hits/day– Have search engine, but hidden!– Policy has been tested, remains same

Page 6: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Moving images

• 2500 Movies

• “Open source movies”– Upload your movie to the Archive– Build a movie at the Archive!

Page 7: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Texts

• Have > 20K books

• Actively involved in “1M Book” and ICDL

• Bookmobile– Protest of Eldred– Real interest turned out to be overseas

• India (30!), Egypt, Uganda

– Spun into separate non-profit

Page 8: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Audio - eTree

• Around 5,000 concerts from 250 bands– Growing 30 concerts, 1 band/day

• Largest consumer of bandwidth– Consistent 85Mbps (downloads)

• Same policy as Wayback– We respect requests

Page 9: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Infrastructure

• Infinite bandwidth and storage– Core competency of the Archive– Vision, not reality– But striving for it makes us better

• Recent challenges– Moving from 250TB to 1PB– Supporting eTree bandwidth

Page 10: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

The Petabyte challenge

• Finally having problems predicted– Power, cooling, disk failures dominating– Need larger staff, real software engineering

• BUT:– Took much longer than anticipated– Sticking to our philosophies

• Commodity hardware

• Widely used software + simple scripts

Page 11: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

The Petabyte architecture

• New datacenter– To solve our power and cooling problems

• Better “procurement” process• File-level mirroring

– Use basic FS, simple scripts– Preparing for geoplexing (vs. file-level RAID)

• Elimination of inter-crawl copies– This is currently our “backup”

Page 12: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

The (eTree) bandwidth challenge

• Can we do better than simply buying more bandwidth?

• Yes! Find other people willing to help

• Cooperative/open-source CDN

Page 13: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Freecache.org

• It shouldn’t cost you to give away content

• To distribute using freecache, simply:– Replace: href=http://X/Y– With: href=http://freecache.org/http://X/Y

• To be a distribution node, simply install a 1K perl-script on your Apache server

Page 14: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Freecache design

• Content routing done centrally

• Right now, routing is random– Working on “closeness-driven” routing

• LRU eviction policy

• Throttles “cheaters”

• Broken browsers have been a problem

Page 15: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

“Web scale” datamining

Apps &Apps &

Feature DatamartsFeature Datamarts

WarehouseWarehouse

Data collectionData collection

Use data• Wayback, Wayback search• Web characterization• Story lifecycle analyzer

Access subsets of data fast• Full-text index, shingleprints• Connectivity, Term vectors

Download web pages• Donations, crawling

Store and access pages• Page cache• Feature extractor

AccessAccess

Page 16: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Tools for Web mining

• Very similar to the Astronomy project– Need indexes, parallelism– Need to move computation to the data– Strategies to deal with different result-set sizes

• Current focus is on the “warehouse”

Page 17: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Web datamining usingWeb Carnivores

Page 18: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

The Carnivore Analogy [Etzioni96]

Web pages

Page 19: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

The Carnivore Analogy

Web pagesSearch engines

Page 20: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

The Carnivore Analogy

Web pagesSearch engines

Carnivore apps

Page 21: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Carnivores

• Search engines have what you want– Google has 3B pages: “It’s in there”

– No need to crawl anymore

• However, their general-purpose interface do not always yield good results for specific information needs

Page 22: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Googlisms: a fun carnivoreGooglism for: scott kirkpatrick

scott kirkpatrick is an associate for rossscott kirkpatrick is an awesome drummer with many fine credits to his namescott kirkpatrick is 17 but certified as an adultscott kirkpatrick is listed as one of the executors in the will of george hankins dated 1 october 1838 in jackson countyscott kirkpatrick is the new chairpersonscott kirkpatrick is joining the flett chiropractic clinic

Googlism for: john kubiatowicz

john kubiatowicz is a professor in computer science at uc berkeleyjohn kubiatowicz is currently an assistant professor at the university of california at berkeleyjohn kubiatowicz is designing ajohn kubiatowicz is working on oceanstorejohn kubiatowicz is a researcher at berkeley exploring the space of introspective computingjohn kubiatowicz is a doctoral candidate in the department of electrical engineering and computer science at mit

Page 23: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

A carnivore for genre search

• Genre classifies documents by its intent– Why was the document written

• Search engines search by topic, not genre

• Idea: build a carnivore for genre search

Page 24: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Genre search engine

Topic(from user)

Genre(static)

Term-vectorgeneration

Google

QueryGeneration Google Filter Results

Page 25: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

Making it work

• Query templates– Details of query matters

• PMI-IR for genre terms

• Discrimination as well as genre vector

Page 26: Internet Archive & Web Datamining Raymie Stata UC Santa Cruz & Internet Archive

User study

• Genre: “Buying guides”– Education for product selection– Lots on the Web, but hard to find– (Agreement on what they are)

• Results– Topic by itself: 0% P@10 (ie, none in top 10)– Topic + “buying guide:” 33%– Our carnivore: 51%