bjørn olstad cto fast search & transfer adjunct prof. the norwegian university of science &...
Post on 18-Dec-2015
218 views
TRANSCRIPT
Bjørn OlstadCTO FAST Search & TransferAdjunct Prof. The Norwegian University of Science & Technology
Email: [email protected]: +47 48011157
Why Search Engines are used increasingly to Offload Queries from Databases
”You are viewing 5 random jobs out of 2461 jobs in total....”
High input barrier
The RDBMS Experience
IYP: A Disruptive Change
ESP: Cleansing, Mining, Relevance and Discovery
Company nameBusiness CategoryTelephone numberAddress20 key terms
Company nameBusiness CategoryTelephone numberAddress20 key terms
Product &Services
Blogs++
Companyweb site
What is the phone numberto Will’s Barber shop?
Taylor or Gibson guitar?Good local offers?
Compare offeringsPhone / Directions
BTW: I’m using my iPAQ
ISVs: A Disruptive Change
Siebel 2000 Siebel 2005
“my” CRM Application “my” CRM Application
Information Access Layer3rd party content
Search is a strategic enablerSearch is a tactical afterthought
Search
Revisit the Assumptions …
1999
GIG
AB
YTES
2001: 6B
2002: 12B
2003: 24B
80
% U
nstr
uctu
red
2000: 3B
Cave paintings,Bone tools 40,000
BCEWriting 3500 BCE
0 C.E.Paper 105
Printing 1450
Electricity, Telephone 1870
Transistor 1947
Computing 1950
Internet (DARPA) Late 1960s
The Web 1993
SQ
L-7
0O
racl
e-7
9SQ
L-8
9SQ
L-9
2
SQ
L-9
9
SQ
L-0
3
Relational algebralarge – but “finite”data sets
structured data
Search & Explore focused“infinite”data sets
Unstructured & Structured
• Feeding/streaming, transaction, retrieval or analytics centric?
• Content size: M, L, VL, VVVL or Vn∞ L?
• Schema centric, Semi-structured XML, Text, Agnostic?
• Fuzzy & Value vs. Binary & Completeness?
• Discovery primitives?
• User interaction part of design target?
Extreme Capabilities?
Query LatencyRDBMS vs ESP
0
2
4
6
8
10
12
14
16
18
20
1/16 1/8 1/4 1/2 1 2 4 8 16 32
[sec.]
# q
uer
ies
• Structured data:• 5 million records; • 13 fields per record
• Structured queries:• 22 SQL queries( Representative in ERP )
Test Data:
• #1: FAST ESP w/ disk• Mean = 99 [ms]• St.dev. = 36 [ms]
• #2: Oracle w/ memory mapping• Mean = 4 057 [ms]• St.dev. = 9 368 [ms]
The Result:
ESP
RDBMS
0
100
200
300
400
500
600
700
800
900
1 2 3
FAST
ORA
20 users
50 users
100 users
Identical HW : single node, 2 CPU, 4GB ram 3 SCSI disksIdentical data : auction data from eBay, 3.6 million doc’sIdentical queries: 200 queries defined by Oracle
Query Per SecondRDBMS vs ESP
QPS
Disruptive Change
Relational Model
Queries that fit The ModelQueries that don’t fit The Model
Alternative I Alternative II
• Star, snowflake schemas++• Cubes / datamarts ++
Incremental fixes to painful shortcomings
Adds complexity
• Schema agnostic• Scalable ad-hoc querying• BLOBS Contextual Insight• Real-time fusion of disparate data
models• Massive fault tolerant scalability
0
2
4
6
8
10
12
14
16
18
20
1/16 1/8 1/4 1/2 1 2 4 8 16 32
[sec.]
# q
uer
ies
Extreme CapabilitiesESP Design Targets
Contextual Insight
Value/Noise SNR
User Interaction
ContextualRefinement
Game Changer driven by Extreme Retrival and on-the-fly Analytics
Powering Search Derivative Applications (SDAs)
• HW-cost: $320K (32CPU on 4 Sun servers)
• 90% sub-second query responseAverage = 12 s for the rest ….
• Relevance = Sorting
• 5 FTE to maintain
• HW-cost: $90K
• 100% sub-second query response
• Flexible relevance and discovery
• 0.5 FTE to maintain
Database Query OffloadingExample: AutoTrader.com
Car Dealers - Product Supply
ESP
RDBMS:
ESP:
Content ScalabilityRDBMS vs ESP
Examples of ESP deployments
• Compliance case:– 50B documents @ 80k average 4 PB (around 100 web indexes)
• Storage:– Intelligent content addressable storage– XML metadata and full content– EMC Centera: N * 256TB (N=1..400)
• Webmining – Webfountain:– 60.000 : 1 in query capacity (ESP : DB)
Contextual Search
Contextual Relevance Contextual Navigation• “Best of Web”
Recommender / Authority
• “Best of Enterprise”Linguistic / Statistic
• Contextual fact discovery
• On-the-fly meta-dataanalysis
From ACCESS To INSIGHT
Where is the emailfrom Peter aboutROI analysis?
Any new supiciousfinancial transactionpatterns?
FIND EXPLORE
STRUCTURED
…
FAST ESP
Single Field Search
Quering
SQL LIB
WWW(HTML, XML, WML,
JavaScript)
DB DB DB DB
DB
Turning around the PyramidHBZ.de – Leading German Library Service Center
Researchers
Librarians
From:
To:
ESP @ SCOPUS
• >200M articles / 180M citations• 180TB capacity / 14000 journals
David Goodman standing up and declaring in public, that Scopus is the best-designed database he's ever seen …
Search Reduces Clicks to Purchase and Browsing…
• Reduced # of clicks to buy content from > 4 to < 2
• 50% reduction in ringtone browsing
page views per sale
Wee
k 1
Wee
k 10
Browsing
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
120%
140%Launched search
• 100% increase in search • 20% increase in ringtone revenue
… and Drives Revenue
Wee
k 1
Wee
k 10
Search
Revenue
Launched search
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
120%
140%
Clicks to P
urchase
Relevance Drives Revenue
ØKOKRIM
Firewall
Real-time Registration
Transaction Log
Me
ss
ag
e Q
ue
ue
Data Validation
Queries
Results
Example: Norwegian Customs Foreign Exchange Transaction Monitoring
SECURITY ACCESS MODULEACL Monitor User Monitor
Databaseconnector
Firewall
Alerts
Business AnalyticsProcessing of real-time streams
Business IntelligenceESP vs. RDBMS Technology
OBSERVATIONThe Enterprise Search Platform (ESP), a relatively new concept, integrating advanced technologies typically associated with search engines, database tools, and analytical systems, is fast becoming able to solve modern business intelligence problems (using both structured and unstructured data) in a way that is fundamentally different from, and ultimately superior to, that of other currently available analytical or database software.
PREDICTIONEnterprise Search Platform and search centric application technology represents a true paradigm shift in the way data will be stored, analyzed and reported on in the future. Resulting realignments in the marketplace may be both rapid and tumultuous.
- Chief strategist leading BI vendor
<Company>Dynegy Inc</Company>
<Person>Roger Hamilton</Person>
<Company>John Hancock Advisers Inc. </Company>
<PersonPositionCompany> <OFFLEN OFFSET="3576" LENGTH="63" /> <Person>Roger Hamilton</Person> <Position>money manager</Position> <Company>John Hancock Advisers Inc.</Company> </PersonPositionCompany>
<Company>Enron Corp</Company>
<CreditRating> <OFFLEN OFFSET="3814" LENGTH="61" /> <Company_Source>Moody's Investors Service</Company_Source> <Company_Rated>Enron Corp</Company_Rated> <Trend>downgraded</Trend> <Rank_New>Baa2</Rank_New> <__Type>bonds</__Type> </CreditRating>
<Company>Moody's Investors Service</Company>
…….
``Dynegy has to act fast,'' said Roger Hamilton, a money manager with John Hancock Advisers Inc., which sold its Enron shares in recent weeks. ``If Enron can't get financing and its bonds go to junk, they lose counterparties and their marvelous business vanishes.''
Moody's Investors Service lowered its rating on Enron's bonds to ``Baa2'' and Standard & Poor's cut the debt to ``BBB.'' in the past two weeks.
……
Fact
Fact
Even
tEven
t
<Author>George Stein</ Author > BC-dynegy-enron-offer-update5Dynegy May Offer at Least $8 Bln to Acquire Enron (Update5)By George SteinSOURCEc.2001 Bloomberg NewsBODY
<Category>FINANCIAL</ Category >
Text Structure
The BI “hammer” Approach
Antiobiotics,Peptidyl,Eubacteria,RNA,Mg,…
Document Vector
SVD Analysis
{ λ1, λ2, ..., λn, Structured attributes }
( λ1, λ2, ..., λn )
XML
Direct access to RDBMsfor info from some Telco’s
XML feed from other Telco’s
Flat files (CSV or fixed)from the ’laggards’
Master database for persistant storage
ESP lookup
Ordered hits (by quality)Logic for cleansing
Cleansed data to ESP
clean data
’Error’ database for manual inspection, correction, storage/learning
Ambigous data(close hits or unidentified)
Contextual RefinementETL and Semantic understanding unite
Contextual InsightQuery-time fact analysis @ sub-document level
“…entry probe carried to[Saturn]’s moon Titan
as part of the…”
Co
nce
pts
Inte
nt
Revisit the Assumptions …
1999
GIG
AB
YTES
2001: 6B
2002: 12B
2003: 24B
80
% U
nstr
uctu
red
2000: 3B
Cave paintings,Bone tools 40,000
BCEWriting 3500 BCE
0 C.E.Paper 105
Printing 1450
Electricity, Telephone 1870
Transistor 1947
Computing 1950
Internet (DARPA) Late 1960s
The Web 1993
SQ
L-7
0O
racl
e-7
9SQ
L-8
9SQ
L-9
2
SQ
L-9
9
SQ
L-0
3
Scalable Search