scalable web crawling and basic transactions zachary g. ives university of pennsylvania cis 455 /...
TRANSCRIPT
![Page 1: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/1.jpg)
Scalable Web Crawling and Basic Transactions
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 21, 2023
![Page 2: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/2.jpg)
2
Administrivia
Emailed list of project partners due Friday
… For those without 4-person groups, I will try to assign / merge groups over the weekend … This might result in breaking up some 3
person groups
![Page 3: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/3.jpg)
3
Mercator: Scalable Web Crawler
Expands a “URL frontier” Avoids re-crawling same URLs
Also considers whether a document has been seen before Every document has signature/checksum info
computed as it’s crawled
![Page 4: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/4.jpg)
4
Mercator Web Crawler
1. Dequeue frontier URL2. Fetch document3. Record into RewindStream
(RIS)4. Check against fingerprints
to verify it’s new
5. Extract hyperlinks6. Filter unwanted links7. Check if URL repeated
(compare its hash)8. Enqueue URL
![Page 5: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/5.jpg)
5
Mercator’s Polite Frontier Queues
Tries to go beyond breadth-first approach – want to have only one crawler thread per server
Distributed URL frontier queue: One subqueue per worker thread The worker thread is determined by hashing
the hostname of the URL Thus, only one outstanding request per web server
![Page 6: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/6.jpg)
6
Mercator’s HTTP Fetcher
First, needs to ensure robots.txt is followed Caches the contents of robots.txt for various web
sites as it crawls them
Designed to be extensible to other protocols Had to write own HTTP requestor in Java –
their Java version didn’t have timeouts Today, can use setSoTimeout()
Can also use Java non-blocking I/O if you wish: http://www.owlmountain.com/tutorials/NonBlockingIo.htm But they use multiple threads and synchronous I/O
![Page 7: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/7.jpg)
7
Other Caveats
Infinitely long URL names (good way to get a buffer overflow!)
Aliased host names Alternative paths to the same host Can catch most of these with signatures of
document data (e.g., MD5) Crawler traps (e.g., CGI scripts that link to
themselves using a different name) May need to have a way for human to override
certain URL paths – see Section 5 of paper
![Page 8: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/8.jpg)
Mercator Document Statistics
PAGE TYPE PERCENTtext/html 69.2%image/gif 17.9%image/jpeg 8.1%text/plain 1.5%pdf 0.9%audio 0.4%zip 0.4%postscript 0.3%other 1.4%
Histogram of document sizes
(60M pages)
![Page 9: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/9.jpg)
9
Further Considerations
May want to prioritize certain pages as being most worth crawling Focused crawling tries to prioritize based on
relevance
May need to refresh certain pages more often
![Page 10: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/10.jpg)
10
Web Search Summarized
Two important factors: Indexing and ranking scheme that allows most
relevant documents to be prioritized highest Crawler that manages to be (1) well-
mannered, (2) avoid traps, (3) scale
We’ll be using Pastry to distribute the work of crawling and to distribute the data (what Google calls “barrels”)
![Page 11: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/11.jpg)
11
We Need More Than Synchronization
What needs to happen when you… Click on “purchase” on Amazon?
Suppose you purchased by credit card?
Use online bill-paying services from your bank? Place a bid in an eBay-like auction system? Order music from iTunes?
What if your connection drops in the middle of downloading?
Is this more than a case of making a simple Web Service (-like) call?
![Page 12: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/12.jpg)
12
Transactions Are a Means of Handling Failures
There are many (especially, financial) applications where we want to create atomic operations that either commit or roll back
This is one of the most basic services provided by database management systems, but we want to do it in a broader sense
Part of “ACID” semantics…
![Page 13: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/13.jpg)
13
ACID Semantics
Atomicity: operations are atomic, either committing or aborting as a single entity
Consistency: the state of the data is internally consistent
Isolation: all operations act as if they were run by themselves
Durability: all writes stay persistent!
![Page 14: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/14.jpg)
14
A Problem Confronted by eBay
eBay wants to sell an item to: The highest bidder, once the auction is over, or The person who’s first to click “Buy It Now!”
But: What if the bidder doesn’t have the cash?
A solution: Record the item as sold Validate the PayPal or credit card info with a 3rd
party If not valid, discard this bidder and resume in prior
state
![Page 15: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/15.jpg)
15
“No Payment” Isn’t the Only Source of Failure
Suppose we start to transfer the money, but a server goes down…
Purchase:sb = Seller.balbb = Buyer.balWrite Buyer.bal= bb - $100
Write Item.sellTo = Buyer
Write Seller.bal= sb + $100
CRASH!
![Page 16: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/16.jpg)
16
Providing Atomicity and Consistency
Database systems provide transactions with the ability to abort a transaction upon some failure condition Based on transaction logging – record all
operations and undo them as necessary
Database systems also use the log to perform recovery from crashes Undo all of the steps in a partially-complete
transaction Then redo them in their entirety This is part of a protocol called ARIES
![Page 17: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/17.jpg)
17
The Need for Isolation
Suppose eBay seller S has a bank account that we’re depositing money into, as people buy:
What if two purchases occur simultaneously, from two different servers on different continents?
S = Accounts.Get(1234)Write S.bal = S.bal + $50
![Page 18: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/18.jpg)
18
Concurrent Deposits
This update code is represented as a sequence of read and write operations on “data items” (which for now should be thought of as individual accounts):
where S is the data item representing the seller’s account # 1234
Deposit 1 Deposit 2Read(S.bal) Read(S.bal)S.bal := S.bal + $50 S.bal:= S.bal + €10Write(S.bal) Write(S.bal)
![Page 19: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/19.jpg)
19
A “Bad” Concurrent Execution
Only one action (e.g. a read or a write) can actually happen at a time for a given database, and we can interleave deposit operations in many ways:
Deposit 1 Deposit 2Read(S.bal) Read(S.bal)S.bal := S.bal + $50 S.bal:= S.bal + €10Write(S.bal) Write(S.bal)
time
BAD!
![Page 20: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/20.jpg)
20
A “Good” Execution
Previous execution would have been fine if the accounts were different (i.e. one were S and one were T), i.e., transactions were independent
The following execution is a serial execution, and executes one transaction after the other:
Deposit 1 Deposit 2Read(S.bal) S.bal := S.bal + $50 write(S.bal) Read(S.bal) S.bal:= S.bal + $10 Write(S.bal)
time
GOOD!
![Page 21: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/21.jpg)
21
Good Executions
An execution is “good” if it is serial (transactions are executed atomically and consecutively) or serializable (i.e. equivalent to some serial execution)
Equivalent to executing Deposit 1 then 3, or vice versa Why would we want to do this instead?
Deposit 1 Deposit 3read(S.bal) read(T.bal)S.bal := S.bal + $50 T.bal:= T.bal + €10write(S.bal) write(T.bal)
![Page 22: Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015](https://reader035.vdocument.in/reader035/viewer/2022062802/56649e8f5503460f94b93ce3/html5/thumbnails/22.jpg)
22
Concurrency Control
A means of ensuring that transactions are serializable
There are many methods, of which we’ll see one Lock-based concurrency control (2-phase
locking) Optimistic concurrency control (no locks –
based on timestamps) Multiversion CC …