class 39: ...and the world wide web
DESCRIPTION
The World Wide WebDynamic Web ApplicationsSearch EnginesMapReduceCourse SummaryTRANSCRIPT
![Page 1: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/1.jpg)
Lecture 39: …and the World Wide Web
cs1120 Fall 2011David Evanshttp://www.cs.virginia.edu/evans
![Page 2: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/2.jpg)
2
Announcements
Exam 2 due 60 seconds ago!
Friday: we will return graded Exam 2, along with guidance about the Final
Must be present (or email me in advance) to win!
61626364656667686970
If you want to present your PS8 in class Monday, remember to email me!
![Page 3: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/3.jpg)
3
Plan
The World Wide WebBuilding Web ApplicationsHow Google Works
(or, going back to pre-PS5 to make things really fast again!)
cs1120 recap in one (heavily animated) slide!
![Page 4: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/4.jpg)
The World Wide Web
![Page 5: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/5.jpg)
The “Desk Wide Web”
Memex MachineVannevar Bush, As We May Think, LIFE, 1945
![Page 6: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/6.jpg)
WorldWideWeb
First web server and client, 1990(This picture, 1993)
Sir Tim Berners-LeeCERN (Switzerland)
MIT
![Page 7: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/7.jpg)
http://www.w3.org/History/1989/proposal-msw.html
Overview:Many of the discussions of the future at CERN and the LHC era end with the question – “Yes, but how will we ever keep track of such a large project?” This proposal provides an answer to such questions. Firstly, it discusses the problem of information access at CERN. Then, it introduces the idea of linked information systems, and compares them with less flexible ways of finding information.
![Page 8: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/8.jpg)
8
A Practical Project
![Page 9: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/9.jpg)
9
![Page 10: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/10.jpg)
10
WorldWideWeb
Established a common language for sharing information on computers
Lots of previous attempts (Gopher, WAIS, Archie, Xanadu, etc.) failed
![Page 11: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/11.jpg)
Why the World Wide Web?
– Didn’t attempt to maintain links, just a common way to name things
– Uniform Resource Locators (URL)http://www.cs.virginia.edu/cs1120/index.html
Service Hostname File Path
HyperText Transfer Protocol
World Wide Web succeeded because it was simple!
![Page 12: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/12.jpg)
HyperText Transfer Protocol
Client (Browser)
GET /cs1120/index.html HTTP/1.0
<html><head>…
Contentsof file
Server
HTML HyperText Markup Language
![Page 13: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/13.jpg)
HTML: HyperText Markup Language
• Language for controlling display of web pages• Uses formatting tags: between < and >
Document ::= <html> Header Body </html>Header ::= <head> HeadElements </head>HeadElements ::= HeadElement HeadElementsHeadElements ::= ε | <title> Element </title>Body ::= <body> Elements </body>Elements ::= ε | Element ElementsElement ::= <p> Element </p>Element ::= <center> Element </center>…
![Page 14: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/14.jpg)
Popular Web Site: Strategy 1Static, Authored Web Site
Content Producer
http://www.twinkiesproject.com/
Drawbacks:•Have to do all the work yourself•The world may already have enough Twinkie-experiment websites
![Page 15: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/15.jpg)
Popular Web Site: Strategy 2Dynamic Web Applications
Seed content and function
Web Programmer
eBay in 1997http://web.archive.org/web/19970614001443/http://www.ebay.com/
Produce more content
Attracts users
![Page 16: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/16.jpg)
Popular Web Site: Strategy 2Dynamic Web Applications
Seed content and function
reddit.com in 2005
Produce more content
Attracts users
reddit.com today
Advantages:• Users do most of the work• If you’re lucky, they might even pay you for the privilege!
Disadvantages:• Lose control over the content (you might get sued for things your users do)• Have to know how to program a web application
![Page 17: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/17.jpg)
Dynamic Web SitesPrograms that run on the web server
Can be written in any language (often in Python or Java), just need a way to connect the web server to the program
Program generates HTML (often JavaScript also now)Every useful web site does this
Programs that run on the client’s machineJava, JavaScript (aka, “Scheme for the Web”), Flash, etc.:
language must be supported by the client’s browserResponsive interface: limited round-trips to server
![Page 18: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/18.jpg)
18
Searching the Web
![Page 19: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/19.jpg)
19
![Page 20: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/20.jpg)
Building a Web Search Engine
• Database of web pages– Crawling the web collecting pages and links– Indexing them efficiently
• Responding to Searches– Spell checking – edit distance– How to find documents that match a query– How to rank the “best” documents
![Page 21: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/21.jpg)
Crawling CrawleractiveURLs = [ “www.yahoo.com” ]while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: page = downloadPage (URL) newURLs += extractLinks (page) activeURLs = newURLs
Problems:Will keep revisiting the same pagesWill take very long to get a good view of the webWill annoy web server adminsdownloadPage and extractLinks must be very robust
![Page 22: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/22.jpg)
Building a Web Search Engine
• Database of web pages– Crawling the web collecting pages and links– Indexing them efficiently
• Responding to Searches– How to find documents that match a query– How to rank the “best” documents
![Page 23: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/23.jpg)
Building an Index
• What if we just stored all the pages?
Answering a query would be (size of the database)(need to look at all characters in database)
Google: about 40 Billion pages (1 Trillion URLs, but number actually indexed is a closely kept corporate secret)
* 60 KB (average web page size) = ~2.4 Quadrillion bytes to search!
Linear is not nearly good enough when n is Quadrillions
![Page 24: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/24.jpg)
Hash Table
def lookup(key, table) : searchEntries(table[H(key, len(table))])
Index Key-Value Pairs
0 { <“Colleen”, ? >, <“virginia”, ? >, … }
1 { <“Bob”, ? >, … }
2
3
…
[about a million bins?]
Finding a good H is difficultYou can download google’s from http://code.google.com/p/google-sparsehash/
![Page 25: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/25.jpg)
Google’s Lexicon1998: 14 million words (billions today?)Lookup word in H(word, nbins): maps to WordID
Key Words0 [<“aardvark”, 1024235>, ... ]1 [<“aaa”, 224155>, ..., <“zzz”, 29543> ]... ...
nbins – 1 [<“abba”, 25583>, ..., <“zeit”, 50395> ]
![Page 26: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/26.jpg)
Google’s Reverse IndexWordId ndocs pointer
00000000 3
00000001 15
...
16777215 105
(Based on 1998 paper…definitely changed some since then, but now they are secretive!)
Lexicon: 293 MB (1998)Today: many GB?
“InvertedBarrels”:
41 GB (1998)Today: many TB?
![Page 27: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/27.jpg)
Inverted Barrelsdocid (27 bits) nhits (5 bits) hits (16 bits
each)
7630486927 23...
plain hit:capitalized: 1 bitfont size: 3 bitsposition: 12 bits first 4095 chars, everything else
extra info foranchors, titles(less position bits)
Suggested experiment for winter break: is the position field still only 12 bits?
![Page 28: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/28.jpg)
Building a Web Search Engine
• Database of web pages– Crawling the web collecting pages and links– Indexing them efficiently
• Responding to Searches– Spell checking – edit distance– How to find documents that match a query– How to rank the “best” documents
![Page 29: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/29.jpg)
Finding the “Best” Documents• Humans rate them– “Jerry and David’s Guide to the World Wide Web”
(became Yahoo!)• Machines rate them– Count number of occurrences of keyword• Easy for sites to rig this
– Machine language understanding not good enough• Business Model– Whoever pays you the most is listed first
![Page 30: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/30.jpg)
30
PageRank
If a site is important and interesting, other sites will link to it.
But…not all links are equal: if a lot of highly-ranked sites link to this site, this site should be highly-ranked.
Don’t ever take <a href=http://www.cs.virginia.edu/cs1120>cs1120</a>!
![Page 31: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/31.jpg)
PageRank
def pageRank (u): rank = 0 for b in linksToPage (u) rank = rank + PageRank (b) / Links (b) return rank
Would this work?
![Page 32: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/32.jpg)
Converging PageRank
• Ranks of all pages depend on ranks of all other pages
• Keep recalculating ranks until they converge
def CalculatePageRanks (urls): initially, every rank is 1 for as many times as necessary calculate a new rank for each page (using old ranks) replace the old ranks with the new ranks
How do initial ranks effect results?How many iterations are necessary?
![Page 33: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/33.jpg)
PageRank: 1998• Crawlable web (1998): • 150 million pages, 1.7 Billion links• Database of 322 million links– Converges in about 50 iterations
• Initialization matters– All pages = 1: very democratic, models browser
equally likely to start on random page– www.yahoo.com = 1, ..., all others = 0• More like what Google probably uses
![Page 34: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/34.jpg)
34
Do we have a search engine?
Google’s First Server
Theoretician: Sure!
Ali G: No way! It’ll blow up.
![Page 35: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/35.jpg)
35
How do we make our service fast enough to index the whole web and serve billions of requests?
![Page 36: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/36.jpg)
36
Counting Word Occurrences
“When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, …”
[ <“When”, 1>, <“in”, 1>, <“the”, 2> … ]
“We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the …”
[ <“We”, 1>, <“in”, 1>, <“the”, 2> … ]
map(doc, countWords)
If we have enough machines, can we do this fast for the whole web?
![Page 37: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/37.jpg)
[ <“When”, 1>, <“in”, 1>, <“the”, 2> … ]
[ <“We”, 1>, <“in”, 1>, <“the”, 2> … ]
[ <“a”, 5>, <“in”, 3>, <“the”, 2> … ]
[ <“apple”, 1>, <“in”, 1>, <“the”, 7> … ]
reduce
reduce
reduce[ <“a”, 5>, <“in”, 6>, … ]
[ <“We”, 1>, <“in”, 2>, … ]
[ <“a”, 5>, <“in”, 4>, … ]
![Page 38: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/38.jpg)
38
MapReduce
![Page 39: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/39.jpg)
39
Key to Massive Parallel Execution
Get rid of state and mutation!
![Page 40: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/40.jpg)
40
Functional Programming(PS 1-4)
(define (count-matches p b) (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))
Mechanical Logic
“Magic” Transistors
AND NOT
A B C R1 R0
0 0 0 0 00 0 1 0 1… … … … …
Any Discrete Function
(or a b) (not (and (not a) (not b)))
Any Mechanical Computation
Interpreters
1 2
# 1 0 1 1 0 1 1... ...
1 0 1 1 0 1 1 1 #
3 Turing Machine
def meval(expr, env): … return evalApplication(expr, env)
![Page 41: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/41.jpg)
Functional Programming(PS 1-4)
(define (count-matches p b) (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))
Mechanical Logic
“Magic” Transistors
AND NOT
A B C R1 R0
0 0 0 0 00 0 1 0 1… … … … …
Any Discrete Function
(or a b) (not (and (not a) (not b)))
Any Mechanical Computation
Interpreters
1 2
# 1 0 1 1 0 1 1... ...
1 0 1 1 0 1 1 1 #
3 Turing Machine
def meval(expr, env): … return evalApplication(expr, env)
![Page 42: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/42.jpg)
Functional Programming(PS 1-4)
(define (count-matches p b) (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))
A B C R1 R0
0 0 0 0 00 0 1 0 1… … … … …
Any Discrete Function
(or a b) (not (and (not a) (not b)))
Any Mechanical Computation
Interpreters
1 2
# 1 0 1 1 0 1 1... ...
1 0 1 1 0 1 1 1 #
3 Turing Machine
def meval(expr, env): … return evalApplication(expr, env)
State and Mutation
Objects
1m1: 2 3
SimObject
PhysicalObjectPlace
MobileObject
![Page 43: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/43.jpg)
Functional Programming(PS 1-4)
(define (count-matches p b) (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))
A B C R1 R0
0 0 0 0 00 0 1 0 1… … … … …
Any Discrete Function
(or a b) (not (and (not a) (not b)))
Any Mechanical Computation
Interpreters
1 2
# 1 0 1 1 0 1 1... ...
1 0 1 1 0 1 1 1 #
3 Turing Machine
def meval(expr, env): … return evalApplication(expr, env)
State and Mutation
Objects
1m1: 2 3
SimObject
PhysicalObjectPlace
MobileObject
![Page 44: Class 39: ...and the World Wide Web](https://reader035.vdocument.in/reader035/viewer/2022062514/559258af1a28ab6a418b45c2/html5/thumbnails/44.jpg)
Functional Programming(PS 1-4)
Mechanical Logic
“Magic” Transistors
Any Discrete Function
Any Mechanical Computation
Interpreters
State and Mutation
Objects
Recursive Definitions
Universality
Abstraction
Charge
Now, you know almost everythingyou need to build the next reddit or google!