1 the web in the year 2010: challenges and opportunities for database research gerhard weikum...
TRANSCRIPT
1
The Web in the Year 2010:Challenges and Opportunities
for Database Research
Gerhard Weikum
http://www-dbs.cs.uni-sb.de
2
Importance of Database Technology
3
What Have We Done to the Web?
Information at your fingertips
Electronic commerce
Interactive TV
Digital libraries
Terabyte servers
Brave New World
Flooded by junk & ads
Poor responsiveness andvulnerable to load surges
Needles in haystacks
Unreliable services
Success stories require special care & investment
Back to Reality
4
The Grand Challenge:Service Quality Guarantees
”Our ability to analyze and predict the performance of the enormously complex software systems ...are painfully inadequate"
(PITAC Report)
Continuous ServiceAvailability Money-back Performance Guarantees
Guaranteed Search Result Quality
Importance of quality guarantees not limited to WebObservation:
DFG graduate program at U Saarland
Prediction for 2010:Either we will have succeeded in achieving these qualities,or we will face world-wide information chaos !
5
Outline
Why I’m Speaking Here
• Money-back Performance Guarantees
• Observations and Predictions
• Continuous Service Availability
• Guaranteed Search Result Quality
• Summary of My Message
6
Internal Server Error.Our system administrator has been notified.Please try later again.
From Best Effort to Performance GuaranteesObservations:
• Web service performance is best-effort only• Response time is unacceptable during peak load because of queueing delays• Performance is mostly unpredictable !
Example: Check Availability(Look-Up Will Take 8-25 Seconds)
Users (and providers) need performance guarantees !Unacceptably slow servers are like unavailable servers.With huge number of clients, guarantees may be stochastic.
7
Example: Video (& Audio) ServerPartitioning of continuous data objects with variable bit rateinto fragments of constant time length TPeriodic scheduling in rounds of duration T
0 T 3T2T 4T
Clients
Server fragment streams with deadlines for QoS
Admission controlto ensure QoS:yes, go aheadno way
T T Tserv seek rot ii
N
i
N
,
11f f f fserv seek rot
Ntrans
N* * * *Ttrans,i
0|)(* inf ][ serv
tserv fetTP
Chernoffbound
Stochastic model:
...
Auto-configure server: admission control, #disks, etc.
8
Observations and Predictions
resource dedication can simplify the problem
stochastic modeling is a crucial asset,but realistic modeling is difficult and sometimes impossible
Observations:
„low-hanging fruit“ engineering: 90% solution with 10% intellectual effort
Predictions for 2010:special-purpose,self-tuningservers withpredictableperformance
„Web engineering“ for end-to-end QoSwill rediscover stochastic modeling or will fail
95.0]2[ stimeresponsePstochastic guarantees for all data and services,e.g., of the form
money-back guarantees after trial phase
asap alerting about necessary resource upgrading
9
Outline
Why I’m Speaking Here
Money-back Performance Guarantees
• Observations and Predictions
Continuous Service Availability
• Guaranteed Search Result Quality
• Summary of My Message
10
Ranking bydescendingrelevance
Vector Space Model for Content Relevance
Search engine
Query (set of weightedfeatures)
||]1,0[ Fid Documents are feature vectors
||]1,0[ Fq
||
1
2||
1
2
||
1:),(F
jj
F
jij
F
jjij
iqd
qd
qdsim
Similarity metric:
11
Vector Space Model for Content Relevance
Search engine
Query (Set of weightedfeatures)
||]1,0[ Fid Documents are feature vectors
||]1,0[ Fq
||
1
2||
1
2
||
1:),(F
jj
F
jij
F
jjij
iqd
qd
qdsim
Similarity metric:Ranking bydescendingrelevance
e.g., using: k ikijij wwd 2/:
iikk
ijij fwithdocs
docsdffreq
dffreqw
##
log),(max
),(:
tf*idfformula
generalizes tomultimedia search
12
+ Consider in-degree and out-degree of Web nodes: Autority Rank (di) :=
Stationary visit probability [di]
in random walk on the Web
Link Analysis for Content Authority
Search engine
Query (Set of weighted features)
||]1,0[ Fq
Ranking by descendingrelevance & authority
Reconciliation of relevance and authoritybased on ad hoc weighting
13
Web Search Engines: State of the Artq = „Chernoff theorem“
AltaVista:
Google:
Yahoo:
Fermat's last theorem. Previous topic. Next topic. ...URL: www-groups.dcs.st-and.ac.uk/~history/His...st_theorem.html
...strong convergence \cite{Chernoff}. \begin{theorem}\label{T1} Let...http://mpej.unige.ch/mp_arc/p/00-277
Moment-generating Functions; Chernoff's Theorem; The Kullback-... http://www.siam.org/catalog/mcc10/bahadur.htm
Lycos: SIAM Journal on Computing Volume 26, Number 2 Contents Fail-Stop Signatures ...http://epubs.siam.org/sam-bin/dbq/toc/SICOMP/26/2
Mathsearch: No matches found.
Northernlight: J. D. Biggins- Publications. Articles on the Branching Random Walkhttp:/ / www.shef.ac.uk/ ~st1jdb/ bibliog.html
Excite: The Official Web Site of Playboy Lingerie Model Mikki Chernoff http://www.mikkichernoff.com/
14
But There Is Hope
Starting from (intellectually maintained) directory or promising but still unsatisfactory query results and exploring the neighborhood of these URLswould eventually lead to useful documents
Observation:
But intellectual time is expensive !
Research Avenues:Leverage advanced IR: automatic classification
Organize information and leverage IT megatrends: XML
15
Ontologies and Statistical Learning forContent Classification
...
Science
Mathematics
Probability and Statistics
Algebra
LargeDeviation
HypothesesTesting
...
...
Categories ||]1,0[ F
kc
Training sample
Feature space:term frequencies fj ...
New docs
Naive Bayes classifier:
]|[ fcdP k
][
][]|[
fP
cdPcdfP kk
][/......
)(1
11
fPqppff
dlengthk
fkF
fk
FF
with multinomial prior and estimated p1k, ... p|F|k, qk
or o
ther
cla
ssif
iers
(e.
g., S
VM
)
Good for query expansion and user relevance feedback
assign to highest-likelihood category
16
www.links2go.com: Chernoff theorem
For Better Search Quality: XML to the Rescue<travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> <motel price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ...
travelguide
place: Zion NP place: ...
location:Utah
activities:hiking,canyoneering
...
...
lodging
... motelprice=$55
hikes
...
hiketype=backcountry
level=difficult:Kolob Creek ... class 5.2 ...
trip report
DozentURL=...
Inhalt...
Semistructured data:elements, attributes, linksorganized as labeled graph
Querying XML
Regular expressionsover path labelsLogical conditionsover element contents
+
Example query:difficult hikes in affordablenational parks
Select PFrom //travelguide.com Whereplace Like „%Park%“ As P AndP.#.lodging.#.(price|rate) < $70And P.#.hike+.level? Like „%difficult%“
travelguide
place: Zion National Park place: ...
location:Utah
activities:hiking,canyoneering
...
...
And ... #.tripreport Like ...... many technical obstacles ...... 15 feet dropoff ...... need 100 feet rope ...
lodging
... motelprice=$55
hikes
...
hiketype=backcountry
level=difficult:Kolob Creek ... class 5.2 ...
trip report
place
lodging
hikeSelect PFrom //travelguide.com Whereplace Like „%Park%“ As P AndP.#.lodging.#.(price|rate) < $70And P.#.hike+.level? Like „%difficult%“
travelguide
place
<outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenging hike ... </trip> ... </region> ...
XXL: Reconciling XML and IR Technologies
DozentURL=...
Inhalt...
Result ranking of XML databased on semantic similarity
<travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ...
<outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenging hike ... </trip> ... </region> ...
Example query:difficult hikes in affordable national parks
Select PFrom //all-rootsWhere~place ~ „Park“ As P AndP.#.~lodge.#.~price < $70And P.#.~hike ~ „difficult“
And P.#.~activities ~ „climbing“
...
climbing
canyoneering
20
Ontologies, Statistical Learning, and XML:The Big Synergy
Research Avenue:build domain-specific and personalized ontologies leverage XML momentum !automatically classify XML documents into ontologyexpand query by mapping query into ontologyby adding category-specific path conditions ex.: #.math?.#.(~large deviation)?.#.theorem.~Chernoff
exploit user feedback
Research Issues:Which kind of ontology (tree, lattice, HOL, ...) ?Which feature selection for classification? Which classifier?Information-theoretic foundation?Theory of „optimal“ search results?
21
The Mega Challenge: Scalability
Observations:search engines cover only „surface web“: 1 Bio. docs, 20 TBytesmost data is in „deep web“ behind gateways: 500 Bio. docs, 8 PBytes
future search engines need new paradigmin new world with > 90% information in XMLand „deep web“ with > 90% information behind gateways
Research Avenue:
22
Predictions for 2010
XXX (Cross-Web XML Explorer), aka. Deep Search
future search engines will combine pre-computation (and enhance with richer form of indexing) &additional path traversal starting from index seeds(topic-specific crawling with „semantic“ pattern matching) &dynamic creation of subqueries at gateways
should carry out large-scale experiments
will be able to find results for every search in one day with < 1 min intellectual effortthat the best human experts can find with infinite time
should have a theory of search result „optimality“
23
Outline
Why I’m Speaking Here
Money-back Performance Guarantees
Observations and Predictions
Continuous Service Availability
Guaranteed Search Result Quality
• Summary of My Message
Strategic Research Avenuesinspired by Jim Gray‘s Turing Award lecture: trouble-free systems, always up, world memex
Conceivable killer arguments:Infinite RAM & network bandwidth and zero latency for freeSmarter people don‘t need a better Web
Challenges for 2010: Self-tuning servers with response time guarantees by reviving stochastic modeling and combining it with „low-hanging fruit“ engineering
Continuously available servers by new theory of recovery contracts for multi-tier federations in combination with better engineering
Breakthrough on search quality (incl. optimality theory ?) from synergy of ontologies, statistical learning, and XML