© chair for database systems and information management 1 database systems and information...
TRANSCRIPT
© Chair for Database Systems and Information Management 1
ima
Database Systems and Information ManagementTechnische Universität Berlin
Situational Business Intelligence
Volker MarklTechnische Universität Berlin
© Chair for Database Systems and Information Management 2
imaAgenda
► Traditional Business Intelligence► Next Generation Business Intelligence ► Building Blocks
Cloud Computing, Map-Reduce, and Hadoop, Piglatin UIMA, Social Tagging
► The Long Tail of Situational Applications► Situational Business Intelligence► Challenges
© Chair for Database Systems and Information Management 3
imaTraditional Business Intelligence
© Chair for Database Systems and Information Management 4
imaHow Did We Get Here?
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
1990 1993 1996 1999 2002 2005 2008
Actual and forecasted BI tools software revenue as reported by IDC
BI over Text
Client ServerBusinessIntelligence
BatchReporting
Web enabledBusinessIntelligence
Query/ReportingOLAP
Source: GartnerSource: IDC
© Chair for Database Systems and Information Management 5
ima2008 CIO Priorities
Business Intelligence Applications
Enterprise Applications (ERP, SCM, and CRM)
Server and Storage Technologies (Virtualization)
Legacy Application Modernization
Security Technologies
Technical Infrastructure
Networking, Voice, and Data Communications (VoIP)
Collaboration Technologies
Document Management
Service-Oriented Technologies (SOA and SOBA)
** New question for 2007
To what extent will each of the following technologies be a Top 5 priority for you in 2008?
11.20%
8.02%
8.45%
5.79%
8.53%
4.67%
6.83%
7.75%
7.91%
6.71%
2008Increase*
1
2
3
4
5
6
7
8
9
10
Rank2008
1
2
5
3
6
8
4
10
9
7
Rank2007
1
**
9
10
2
12
8
4
**
6
Rank2006
Source: 2008 Gartner Executive Programs CIO Survey, January 10, 2008
2008 CIO Technology Priorities
* Unweighted average budget change
© Chair for Database Systems and Information Management 6
ima
Better/more information 22.9%Faster/quick retrieval 14.3%Accurate/updated data 11.4%Consistent platform 8.6%
Better integration 8.6%
Standardization 8.6%Other single mentions 40.0%
Please give me an example of how your business intelligence solution could better meet your organizations main objective?
Source: Business Intelligence Survey, IDC
What are CIOs missing?
© Chair for Database Systems and Information Management 7
imaNext Generation Business Intelligence
Internet
IntranetInformation ExtractionSemantic Integration
Load/Refresh or ad-hoc
TextTextText
TextTextXLS
TextTextXML
Who is leading in American Idol?
Who are the biggest players in the Linux market?
Which insurance policy customers are
at risk of being hit by a current storm?
Analysis Schema
and Entities
The next generation of Business Intelligence (NGBI) correlates data warehouses with text and semi-structured data from webservices of corporate intranets and the internet
Data WarehouseData Marts
© Chair for Database Systems and Information Management 8
imaAnswering a NGBI Query
Web 2.0 documents from 332 Wiki News docs (January –March 2007)Who are the biggest players in the “Linux” market?
© Chair for Database Systems and Information Management 9
imaData Source Identification
► Data Warehouse► Masterdata► Information Providers► Information Marketplaces► Crawling (Internet/Intranet)
Data Source identification
Atomic Entity extraction Schema extraction Data Cleansing Data Fusion
© Chair for Database Systems and Information Management 10
imaAtomic Entity Extraction
Out-of-the box data► Web Services for complex, atomic
and named entities
Frameworks► Infrastructures for extracting,
managing and scalable storage of named entities
► Web Services for extracting named entities
Basic Components► Screen scraper
Addi
tiona
l ext
racti
on a
nd d
ata
clea
nsin
g eff
ort
Data Source identification
Atomic Entity extraction Schema extraction Data Cleansing Data Fusion
© Chair for Database Systems and Information Management 11
imaAd hoc analysis processData Source identification
Atomic Entity extraction Schema extraction Data Cleansing Data Fusion
© Chair for Database Systems and Information Management 12
imaSchema Extraction
Company Technology ->Technology Company Technology -> Company
Pre Process Base extraction Schema extraction Data Cleansing Data Fusion
© Chair for Database Systems and Information Management 13
imaData Cleansing
Duplicates
Pre Process Base extraction Schema extraction Data Cleansing Data Fusion
© Chair for Database Systems and Information Management 14
Data Fusion
Data Source A
Schema Mapping
Duplicate Detection
Data Fusion
Data Source B
iPhone 3 Gen 299.95
iPhone 3G 199.99
Apple
Apple
iPhone 3 Gen 199.99Apple max length min
Info
rmati
on In
tegr
ation
match
Pre Process Base extraction Schema extraction Data Cleansing Data Fusion
e.g., Hummer (U Potsdam)
© Chair for Database Systems and Information Management 15
Data Fusion
b ca -
b -a db ca d
b -a -
b -a -b -a -
b ca -
b -a -b ca -
b ca -
e -a df(b,e) ca d
Integration of complementary tuples
Elemination of identical tuples
Elemination of subsumed tuples
Conflict resolution
Pre Process Base extraction Schema extraction Data Cleansing Data Fusion
© Chair for Database Systems and Information Management 16
imaAddress Uncertainty: Query Refinement► Extract->SELECT->PROJECT-JOIN-(COUNT, AVG, SUM, MEAN..)
► “Everything” about Dell?
► The market of “Linux” from 2007-2008?
► “What's the average analyst quote about the IBM stock price for the last month?”
► Drill down on region, time, organization ….
DATAQ
UER
Y
S U
U
S
© Chair for Database Systems and Information Management 17
ima
► Cloud Computing► Map Reduce► Pig► UIMA► Social Tagging
Building Blocks
© Chair for Database Systems and Information Management 18
imaCloud Computing
► What is Cloud Computing? Computing platform architecture Scales to any application High fault tolerance No generally accepted definition available Separation from Utility or Grid Computing is not obvious
© Chair for Database Systems and Information Management 19
imaCloud Computing
► How does Cloud Computing work? Lots of loosely coupled computers Use of commodity hardware Flexible up- or downgrading of resources APIs offer access to cloud computing systems Software takes care of parallelization, hardware failures
and error handling Resources (e.g. storage, computing power) can be bought
as services (paying for usage, e.g. Amazon)
© Chair for Database Systems and Information Management 20
imaMapReduce – Programming Model► Program logic is split into 2 functions:
Map(k,v) and Reduce(k,list(v))► Functions receive and produce (Key, Value)-pairs► Map(k,v) computes for each (k,v)-pair an
intermediate (ki,vi)-pair► Reduce(k,list(v)) merges all values with the same
key k and outputs the result.► MapReduce programs are easy to develop
Frameworks provide libraries Frameworks take care of parallelization, distribution and
error handling Only application specific source code is required
(no parallelization and error handling code)
© Chair for Database Systems and Information Management 21
ima
REDUCE(k,list(v))MAP(k,v)
MapReduce – Group AVG Example
NewYork, US, 10LosAngeles, US, 40London, GB, 20
Berlin, DE, 60Glasgow, GB, 10Munich, DE, 30…
(DE,45)(GB,15)(US,25)
(US,10)(US,40)
(GB,20)
(GB,10)
(DE,60)(DE,30)
(US,10)(US,40)
(GB,20)(GB,10)
(DE,60)(DE,30)
Input Data Intermediate(K,V)-Pairs
Result
© Chair for Database Systems and Information Management 22
imaMapReduce
► MapReduce Programming Model For processing of huge amounts of data Massive parallelization of computing tasks Applicable to many real world applications MapReduce programs are easy to implement
► MapReduce Engine Environment to run MapReduce programs Distributes computing tasks Errors are transparently handled Very scalable architecture Examples: Google MapReduce & Apache Hadoop
© Chair for Database Systems and Information Management 23
imaHadoop
► What is Hadoop? Free software framework for data intensive applications Enables distributed processing of vast amounts of data on
cloud computing architectures Supports clouds with 1000+ nodes Two components:
1) Hadoop Distributed File System (HDFS)2) MapReduce Engine
► Where can you get Hadoop? Top-level Apache Project: http://hadoop.apache.org/core/
© Chair for Database Systems and Information Management 24
imaHadoop - HDFS► Inspired by Google File System► Distributed storage for large files► Files are split up in multiple parts (default size 64MB)► Parts are spread over the HDFS nodes► Each part replicated (default 3 times)
© Chair for Database Systems and Information Management 25
imaHadoop – MapReduce Engine► Runs MapReduce programs► Libraries for Java and C++► Assigns Map and Reduce tasks to computing nodes► Reduction of data transfer volume
Tasks are assigned to nodes holding the data
► Node failures are transparently handled Tasks are restarted on node holding a replica of the data
TaskManagerMAP( )
MAP( )
MAP( )MAP( )MAP( ) FAILS!
…
© Chair for Database Systems and Information Management 26
imaHadoop
► Who uses Hadoop? Amazon A9.com (Search Index Building, Analytics) Facebook (Logfile Analysis) Google & IBM (University Initiative to Address Internet-Scale
Computing Challenges) Yahoo! (Crawling, Indexing, Searching)
Yahoo! Hadoop Cluster runs Terabyte Sort Benchmark in 209 seconds
And many others… (see http://wiki.apache.org/hadoop/PoweredBy)
► Hadoop resembles Google‘s MapReduce Framework J. Dean, S. Ghemawat
„MapReduce: Simplified Data Processing on Large Clusters“
© Chair for Database Systems and Information Management 27
imaThe Pig Project
► A platform for analyzing large data sets► Pig consists of two parts:
PigLatin: A Data Processing Language Pig Infrastructure (Grunt): An Evaluator for PigLatin programs
► Where can you get Pig? Apache Incubator Project: http://incubator.apache.org/pig
► Alternatives: HIVE (Facebook) JAQL (IBM Research)
© Chair for Database Systems and Information Management 28
imaPigLatin Data Processing Language► PigLatin is imperative (whereas SQL is declarative)
Step-by-step programming approach PigLatin queries are easy to write and understand
► Fully nestable data model Atomic values, tuples, bags, maps
► Operators of two flavors: Relational style operators (filter, join, etc.) Functional-programming style operators (map, reduce)
► Easy to extend by user functions► Example: “Find the top 10 most visited pages in each category”
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);urlInfo = load ‘/data/urlInfo’
as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories
generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;
Example taken from:“Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008
© Chair for Database Systems and Information Management 29
imaPig Infrastructure► Currently two modes:
Local: PigLatin programs are locally evaluated (run in a single JVM) MapReduce: PigLatin programs are compiled to sequences of MapReduce
programs and executed (e.g. on Hadoop)► Example:
LOAD visits
LOAD url info
GROUP BY url
FOREACH urlGENERATE count
JOIN on url
GROUP by category
FOREACH categoryGENERATE top10(urls)
Map1
Reduce1
Map2
Reduce2
Map3
Reduce3
Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008
© Chair for Database Systems and Information Management 30
imaUIMA
© Chair for Database Systems and Information Management 31
imaUIMA
Pre-Processing Analysis Phase Post-Processing
© Chair for Database Systems and Information Management 32
imaUIMA
► Annotators for Part of Speech detection, Named-Entity detection and Relation detection.
© Chair for Database Systems and Information Management 33
imaThe Stratosphere Project► Many BI queries exceed the capabilities of today‘s BI
systems „ Who are the biggest players in the Linux market?“ „ Which insurance policy customers are at risk of being hit by a current
storm?“► The Internet offers valuable information
Enterprise announcements and public business reports User generated content: Blogs, Wikis, Reviews, Comments, etc. News websites and feeds
► Next Generation Business Intelligence (NGBI) requires joint analysis of internet and enterprise data Internet, Intranet, Data Warehouse and Local Data must be processed
Goal of the Stratosphere Project is to build a NGBI System on a Cloud Computing Platform
© Chair for Database Systems and Information Management 34
ima
Computing Cloud
UI
Stratosphere - Architecture
InternetInternet
Retrieve
Extract(UIMA)
Process
Result
HAD
OO
PCache
Query Plan
Crawl Scan
Extract
Filter
Join
Group
QueryTranslationQuery
DataWarehouse
Intranet
Office documents (spreadsheets)
Further data sources:
© Chair for Database Systems and Information Management 35
imaStratosphere – Research Challenges
► Definition an algebra for expressing NGBI-queries Includes: traditional database operators, data retrieving
operators, information extraction operators, and information integration operators
► Implementation of NGBI query operators Requirements: highly-scalable, robust, self-tuning Leveraging Hadoop and map-reduce-frameworks
► Implementation of a cloud computing monitoring infrastructure Enabling for self-tuning NGBI-operators
© Chair for Database Systems and Information Management 36
imaRelated Project: DBLife
© Chair for Database Systems and Information Management 37
imaRelated Projects: Avatar Email Search
© Chair for Database Systems and Information Management 38
ima
http://water.usgs.gov/waterwatch/
(Zipcode)
edc.usgs.gov/
(Geocode = Latitude/Longitude) (Geocode = Latitude/Longitude)
http://www.dotd.florida.gov/
(HUC = Hydrological Unit Code)
Severe weather – Meet Pete, an insurance agent in Lousiana.
1. He sees a news report of a severe storm. What is the company’s risk?
2. Pete has an Excel spreadsheet with all policy holders he manages, which he filters to select only properties insured for more than $250,000.
3. Pete searches for a website that can predict flood levels for his area and finds www.floodlevels.com, a mashup which predicts the flood level for a geographic area based on USGS flood level forecasts, and GIS databases from
4. Pete connects his spreadsheet to www.floodlevels.com
5. He then forwards a risk summary to executives.
Which insurance policy customers are
at risk of being hit by a current storm?
Situational Business Intelligence Example
© Chair for Database Systems and Information Management 39
imaFlood Risk Assessment Mashup
Mashup SearchMashup Search
ReportReport
StandardizeStandardize www.floodlevels.comwww.floodlevels.com
standardizestandardize
policy XLSpolicy XLS water.usgs.govwater.usgs.gov edc.usgs.govedc.usgs.gov dotd.louisiana.govdotd.louisiana.gov
Screen ScrapingScreen Scraping
LineageLineage
StandardizationStandardization
© Chair for Database Systems and Information Management 40
ima
MissionCritical
BestEffort,AdHoc
Limited Time, Immediate Lots of Time
Mashups
SCAPortals
New InitiativesProof of Concept
Line of Business
IT Dept
DataMartDataWarehouse
Situational BI Evolution
© Chair for Database Systems and Information Management 41
ima
(Algebraic) Extraction► Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction
from the Web. International Joint Conferences on Artificial Intelligence (IJCAI) 2007: 2670-2676► Frederick Reiss, Shivakumar Vaithyanathan, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu: An Algebraic Approach
to Rule-Based Information Extraction. International Conference on data engineering (ICDE) 2008. 933-942
Schema generation from extracted uncertain data► Xin Dong, Alon Y. Halevy: Malleable Schemas: A Preliminary Report. WebDB 2005: 139-144► Marcos Antonio Vaz Salles, Jens-Peter Dittrich, Shant Kirakos Karakashian, Olivier René Girard, Lukas Blunschi: iTrails: Pay-
as-you-go Information Integration in Dataspaces. International Conference on Very Large Databases (VLDB) 2007: 663-674
Optimization► Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data
Engineering (ICDE) 2008: 636-645
BI over text► Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data
Engineering (ICDE) 2008: 636-645► Raghu Ramakrishnan and Andrew Tomkins: Towards a PeopleWeb. IEEE Computer 40(8): 63-72.► Web 2.0 Business Analytics. Alexander Löser, Gregor Hackenbroich, Hong-Hai Do, Henrike Berthold. Datenbank Spektrum
25/2008► T. S. Jayram, Andrew McGregor, S. Muthukrishan, Erik Vee: Estimating Statistical Aggregateson Probabilistic Data Streams.
PODS 07
Select Literature
© Chair for Database Systems and Information Management 42
ima
► BI over text will tap into a huge set of additional information for BI
► The next generation of business intelligence applications will utilize technologies for scalable processing and service computing to integrate data sources from warehouses, intranet, and internet
► Situational BI will create ad-hoc applications to answer complex questions over integrated data sources
► Open research problems: Which is the right extraction service? “How much” schema can be generated? “How much” optimization has the user to add? How to optimize UIMA based extraction plans on a HADDOP cloud? What is a suitable query language over HADOOP? Data cleansing, completion, and Duplicate detection of extracted data? Data explanation: Lineage but also: Why I do NOT see that data tuple?
Conclusion
© Chair for Database Systems and Information Management 43
ima
► Discussions at IBM Research and IBM SWG Anant Jhingran Hamid Pirahesh Kevin Beyer David Simmen Mehmet Altinel et al.
► My team at TU Berlin Alexander Löser Fabian Hüske Stephan Ewen Helko Glathe
Acknowledgements
© Chair for Database Systems and Information Management 44
ima
Thank You
MerciGrazie
Gracias
Obrigado
Danke
Japanese
English
French
Russian
German
Italian
Spanish
Brazilian Portuguese
Arabic
Traditional Chinese
Simplified Chinese
Hindi
Tamil
Thai
Korean