the deep web: surfacing hidden value michael k. bergman · •analysis procedure of > 100 known...
TRANSCRIPT
Presented by Mat KellyCS895 – Web-based Information Retrieval
Old Dominion UniversitySeptmber 27, 2011
The Deep Web: Surfacing Hidden ValueMichael K. Bergman
Web-Scale Extraction of Structured DataMichael J. Cafarella, Jayant Madhavan & Alon Halevy
Bergman & Cafarella Presentation: Deep Web
Papers’ Contributions
• Bergman attempts various methods of estimating size of Deep Web
• Cafarella proposes concrete methods of extracting and more reliably estimating size of Deep Web and offers a surprising caveat in the estimation
Bergman & Cafarella Presentation: Deep Web
What is The Deep Web?
• Pages that do not exist in search engines
• Created dynamically as result of search
• Much larger than surface web (400-550x)
– 7500 TB (deep) vs. 19TB (surface) [in 2001]
• Information resides in databases
• 95% of the information is publicly accessible
Bergman & Cafarella Presentation: Deep Web
Estimating the Size
• Analysis procedure of > 100 known deep web sites1. Webmasters queried for record count and storage
size, 13% responded
2. Some sites explicitly stated their database size without the need for webmaster assistance
3. Site sizes compiled from lists provided at conferences
4. Utilizing a site’s own search capability with a term known not to exist, e.g. “NOT ddfhrwxxct”
5. If still unknown, do not analyze
Bergman & Cafarella Presentation: Deep Web
Further Attempts at Size Estimation:Overlap Analysis
• Compare (pair-wise) random listings from two independent sources
• Repeat pair-wise with all sources previously collected that are known to have deep web
• From the commonality of the listings, we can then abstract the total size
• Provides a lower bound size of the deep web, since our source list is incomplete
Total Size
src 1 listings
src 2listings
Total size covered bySrc1 listings =
(shared listings)(src 1 listings)
Bergman & Cafarella Presentation: Deep Web
Further Attempts at Size Estimation:Multiplier on Average Site’s Size
• From listing of 17,000 site candidates, 700 were randomized selected. 100 of these could be fully characterized
• Randomized queries issues to these 100 with results on HTML pages, mean page size calculatedand used for est.
17k deep websites
700 randomly chosen
100 sites used that could be fully
characterized? queried
Results page producedand analyzed
Bergman & Cafarella Presentation: Deep Web
Other Methods Used For Estimation
• Pageviews (“What’s Related” on Alexa) and Link References
• Growth Analysis obtained from Whois
– From 100 surface and 100 deep web sites’, acquired date site was established
– Combined and plotted to add time as factor for estimation
Bergman & Cafarella Presentation: Deep Web
Overall Findings From Various Analyses
• Mean deep website has web-expressed database (HTML included) of 74.4MB
• Actual record counts can be derived from one-in-seven deep websites
• On average, deep websites receive half as much monthly traffic as surface websites
• Median deep website receives more than two times traffic as random surface website
Bergman & Cafarella Presentation: Deep Web
The Followup Paper:Web-Scale Extraction of Structured Data
• Three systems for used for extracting deep web data– TextRunner
– WebTables
– Deep Web Surfacing (Relevant to Bergman)
• By using these methods, the data can be aggregated for use in other services, e.g.– Synonym finding
– Schema auto-complete
– Type prediction
became
Bergman & Cafarella Presentation: Deep Web
TextRunner
• Parses text from crawls into n-ary tuplesinto natural language– e.g. “Albert Einstein was born in 1879” into
the tuple <Einstein,1879> with the was_born_in relation
• This has been done before but TextRunner:– Works in batch mode: Consumes an entire
crawl, produces large amount of data– Pre-compute good extractions before queries
arrive and aggressively index– Discovers relations on-the-fly, others pre-
programmed– Others methods are query-driven and
perform all of the work on-demand
Einstein
born
1879
Argument 1
Argument 2
Predicate
Search
Search Results• Albert Einstein was born in 1879.
Demo: http://bit.ly/textrunner
Bergman & Cafarella Presentation: Deep Web
TextRunner’s Accuracy
Corpus Size (pages) Tuples Extracted
9 Million 1 Million
Accuracy
Early Trial
FollowupTrial
500 Million 900 Million
“Results not yet available”
http://turing.cs.washington.edu/papers/banko-thesis.pdf
Bergman & Cafarella Presentation: Deep Web
Downsides of TextRunner
• Text-centric extractors rely on binary relations of language (two nouns and a linking relation)
• Unable to extract data that conveys relations in a table form (but WebTables [next] can)
• Because of the on-the-fly analyses of relations, the output model is not relational
– e.g. We cannot know that Einstein is a human attribute and 1879 a birth year
Bergman & Cafarella Presentation: Deep Web
WebTables
• Designed to extract data from content within HTML’s table tag
• Ignores calendar, single cells and tables used as basis for site design
• General crawl of 14.1B tables contains 154M true relational database (1.1%).
• Evolved into
Bergman & Cafarella Presentation: Deep Web
How Does WebTables work?
• Throw out tables with single cell, calendars and those used for layout.– Accomplished with hand-written
detectors
• Label these as relational or non-relational using statistically trained classifiers– base classification on number of rows,
columns, empty cells, number of columns with numeric-only data, etc
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Trial 1 Trial 2 Trial 3
Group 1 9.5 5.2 6.9
Group 2 10 12 9.8
Group 3 9.6 7.3 8.7
Group 4 10 13 12
Relational Data
Bergman & Cafarella Presentation: Deep Web
WebTables Accuracy
• Procedure retains 81% of truly relational databases in input corpus though only 41% of output is relational (superfluous data)
• 271M relations including 125M of raw input’s 154M true relations (and 146M false ones)
Bergman & Cafarella Presentation: Deep Web
Downsides of WebTables
• Does not recover multi-table databases
• Traditional database restraints (e.g. key constraints) cannot be expressed with table tag
• Metadata is difficult to distinguish from table contents– Second trained classifier can be run to determine if
metadata exists
– Human-marked filtering of true relations indicates 71% have metadata
– Secondary classifier performs well with:• Precision of 89%
• Recall of 85%
Bergman & Cafarella Presentation: Deep Web
Obtaining Access to Deep-Web Databases
• Two Approaches
1. Create vertical search on specific domains (e.g. cars, books), a semantic mapping and a mediator for the domain.
• Not scalable
• Difficult to identify domain-query mapping
2. Surfacing: pre-compute relevant form submissions then index the resulting HTML
• Leverages current search infrastructure
Bergman & Cafarella Presentation: Deep Web
Surfacing Deep-Web Databases
1. Select values for each input in the form
– Trivial for select menus, challenging for text boxes
2. Perform enumeration of the inputs
– Simple enumeration is wasteful and un-scalable
– Text input falls in one of two categories:
1. Generic inputs that accept most keywords
2. Typed text input that only accept values in a particular domain
Bergman & Cafarella Presentation: Deep Web
Enumerating Generic Inputs
• Examine page for good candidate keywords to bootstrap an iterative probing process
• When valid results are produced from keywords, obtain more keywords from results page
Bergman & Cafarella Presentation: Deep Web
Selecting Input Combination
• Crawling forms with multiple inputs is expensive and not scalable
• Introduced notion: input template• Given a set of binding inputs:
Template = set of all form submissions using only Cartesian product of binding inputs
• Results in only informative templates in the form, only a few hundred form submissions per form
• No. of form submissions proportional to size of database in underlying form, NOT No. of inputs and possible combinations
Bergman & Cafarella Presentation: Deep Web
Extraction Caveats
• Semantics are lost when only using results pages
• Annotations, future challenge is to find right kind of annotation that can be used by the IR-style index most effectively
Bergman & Cafarella Presentation: Deep Web
In Summary
• The Deep Web is large – much larger than the surface web
• Bergman gave various means of estimating the deep web and some method of accomplishing this
• Cafarella et al. provided a much more structured approach in surfacing the content, not just to estimate magnitude but also to integrate the contents
• Cafarella suggests a better way to estimate the deep web size independent of the number of fields and possible combinations.
Bergman & Cafarella Presentation: Deep Web
References
• Bergman, M. K. (2001). The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing 7, 1-17. Available at: http://www.press.umich.edu/jep/07-01/bergman.html.
• Cafarella, M. J., Madhavan, J., and Halevy, A. (2009). Web-scale extraction of structured data. ACM SIGMOD Record 37, 55. Available at: http://portal.acm.org/citation.cfm?doid=1519103.1519112.