© chair for database systems and information management 1 database systems and information...

44
© Chair for Database Systems and Information Management 1 ima Database Systems and Information Management Technische Universität Berlin Situational Business Intelligence Volker Markl Technische Universität Berlin

Upload: cameron-lillian-henderson

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 1

ima

Database Systems and Information ManagementTechnische Universität Berlin

Situational Business Intelligence

Volker MarklTechnische Universität Berlin

Page 2: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 2

imaAgenda

► Traditional Business Intelligence► Next Generation Business Intelligence ► Building Blocks

Cloud Computing, Map-Reduce, and Hadoop, Piglatin UIMA, Social Tagging

► The Long Tail of Situational Applications► Situational Business Intelligence► Challenges

Page 3: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 3

imaTraditional Business Intelligence

Page 4: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 4

imaHow Did We Get Here?

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

1990 1993 1996 1999 2002 2005 2008

Actual and forecasted BI tools software revenue as reported by IDC

BI over Text

Client ServerBusinessIntelligence

BatchReporting

Web enabledBusinessIntelligence

Query/ReportingOLAP

Source: GartnerSource: IDC

Page 5: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 5

ima2008 CIO Priorities

Business Intelligence Applications

Enterprise Applications (ERP, SCM, and CRM)

Server and Storage Technologies (Virtualization)

Legacy Application Modernization

Security Technologies

Technical Infrastructure

Networking, Voice, and Data Communications (VoIP)

Collaboration Technologies

Document Management

Service-Oriented Technologies (SOA and SOBA)

** New question for 2007

To what extent will each of the following technologies be a Top 5 priority for you in 2008?

11.20%

8.02%

8.45%

5.79%

8.53%

4.67%

6.83%

7.75%

7.91%

6.71%

2008Increase*

1

2

3

4

5

6

7

8

9

10

Rank2008

1

2

5

3

6

8

4

10

9

7

Rank2007

1

**

9

10

2

12

8

4

**

6

Rank2006

Source: 2008 Gartner Executive Programs CIO Survey, January 10, 2008

2008 CIO Technology Priorities

* Unweighted average budget change

Page 6: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 6

ima

Better/more information 22.9%Faster/quick retrieval 14.3%Accurate/updated data 11.4%Consistent platform 8.6%

Better integration 8.6%

Standardization 8.6%Other single mentions 40.0%

Please give me an example of how your business intelligence solution could better meet your organizations main objective?

Source: Business Intelligence Survey, IDC

What are CIOs missing?

Page 7: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 7

imaNext Generation Business Intelligence

Internet

IntranetInformation ExtractionSemantic Integration

Load/Refresh or ad-hoc

TextTextText

TextTextXLS

TextTextXML

Who is leading in American Idol?

Who are the biggest players in the Linux market?

Which insurance policy customers are

at risk of being hit by a current storm?

Analysis Schema

and Entities

The next generation of Business Intelligence (NGBI) correlates data warehouses with text and semi-structured data from webservices of corporate intranets and the internet

Data WarehouseData Marts

Page 8: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 8

imaAnswering a NGBI Query

Web 2.0 documents from 332 Wiki News docs (January –March 2007)Who are the biggest players in the “Linux” market?

Page 9: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 9

imaData Source Identification

► Data Warehouse► Masterdata► Information Providers► Information Marketplaces► Crawling (Internet/Intranet)

Data Source identification

Atomic Entity extraction Schema extraction Data Cleansing Data Fusion

Page 10: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 10

imaAtomic Entity Extraction

Out-of-the box data► Web Services for complex, atomic

and named entities

Frameworks► Infrastructures for extracting,

managing and scalable storage of named entities

► Web Services for extracting named entities

Basic Components► Screen scraper

Addi

tiona

l ext

racti

on a

nd d

ata

clea

nsin

g eff

ort

Data Source identification

Atomic Entity extraction Schema extraction Data Cleansing Data Fusion

Page 11: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 11

imaAd hoc analysis processData Source identification

Atomic Entity extraction Schema extraction Data Cleansing Data Fusion

Page 12: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 12

imaSchema Extraction

Company Technology ->Technology Company Technology -> Company

Pre Process Base extraction Schema extraction Data Cleansing Data Fusion

Page 13: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 13

imaData Cleansing

Duplicates

Pre Process Base extraction Schema extraction Data Cleansing Data Fusion

Page 14: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 14

Data Fusion

Data Source A

Schema Mapping

Duplicate Detection

Data Fusion

Data Source B

iPhone 3 Gen 299.95

iPhone 3G 199.99

Apple

Apple

iPhone 3 Gen 199.99Apple max length min

Info

rmati

on In

tegr

ation

match

Pre Process Base extraction Schema extraction Data Cleansing Data Fusion

e.g., Hummer (U Potsdam)

Page 15: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 15

Data Fusion

b ca -

b -a db ca d

b -a -

b -a -b -a -

b ca -

b -a -b ca -

b ca -

e -a df(b,e) ca d

Integration of complementary tuples

Elemination of identical tuples

Elemination of subsumed tuples

Conflict resolution

Pre Process Base extraction Schema extraction Data Cleansing Data Fusion

Page 16: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 16

imaAddress Uncertainty: Query Refinement► Extract->SELECT->PROJECT-JOIN-(COUNT, AVG, SUM, MEAN..)

► “Everything” about Dell?

► The market of “Linux” from 2007-2008?

► “What's the average analyst quote about the IBM stock price for the last month?”

► Drill down on region, time, organization ….

DATAQ

UER

Y

S U

U

S

Page 17: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 17

ima

► Cloud Computing► Map Reduce► Pig► UIMA► Social Tagging

Building Blocks

Page 18: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 18

imaCloud Computing

► What is Cloud Computing? Computing platform architecture Scales to any application High fault tolerance No generally accepted definition available Separation from Utility or Grid Computing is not obvious

Page 19: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 19

imaCloud Computing

► How does Cloud Computing work? Lots of loosely coupled computers Use of commodity hardware Flexible up- or downgrading of resources APIs offer access to cloud computing systems Software takes care of parallelization, hardware failures

and error handling Resources (e.g. storage, computing power) can be bought

as services (paying for usage, e.g. Amazon)

Page 20: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 20

imaMapReduce – Programming Model► Program logic is split into 2 functions:

Map(k,v) and Reduce(k,list(v))► Functions receive and produce (Key, Value)-pairs► Map(k,v) computes for each (k,v)-pair an

intermediate (ki,vi)-pair► Reduce(k,list(v)) merges all values with the same

key k and outputs the result.► MapReduce programs are easy to develop

Frameworks provide libraries Frameworks take care of parallelization, distribution and

error handling Only application specific source code is required

(no parallelization and error handling code)

Page 21: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 21

ima

REDUCE(k,list(v))MAP(k,v)

MapReduce – Group AVG Example

NewYork, US, 10LosAngeles, US, 40London, GB, 20

Berlin, DE, 60Glasgow, GB, 10Munich, DE, 30…

(DE,45)(GB,15)(US,25)

(US,10)(US,40)

(GB,20)

(GB,10)

(DE,60)(DE,30)

(US,10)(US,40)

(GB,20)(GB,10)

(DE,60)(DE,30)

Input Data Intermediate(K,V)-Pairs

Result

Page 22: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 22

imaMapReduce

► MapReduce Programming Model For processing of huge amounts of data Massive parallelization of computing tasks Applicable to many real world applications MapReduce programs are easy to implement

► MapReduce Engine Environment to run MapReduce programs Distributes computing tasks Errors are transparently handled Very scalable architecture Examples: Google MapReduce & Apache Hadoop

Page 23: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 23

imaHadoop

► What is Hadoop? Free software framework for data intensive applications Enables distributed processing of vast amounts of data on

cloud computing architectures Supports clouds with 1000+ nodes Two components:

1) Hadoop Distributed File System (HDFS)2) MapReduce Engine

► Where can you get Hadoop? Top-level Apache Project: http://hadoop.apache.org/core/

Page 24: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 24

imaHadoop - HDFS► Inspired by Google File System► Distributed storage for large files► Files are split up in multiple parts (default size 64MB)► Parts are spread over the HDFS nodes► Each part replicated (default 3 times)

Page 25: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 25

imaHadoop – MapReduce Engine► Runs MapReduce programs► Libraries for Java and C++► Assigns Map and Reduce tasks to computing nodes► Reduction of data transfer volume

Tasks are assigned to nodes holding the data

► Node failures are transparently handled Tasks are restarted on node holding a replica of the data

TaskManagerMAP( )

MAP( )

MAP( )MAP( )MAP( ) FAILS!

Page 26: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 26

imaHadoop

► Who uses Hadoop? Amazon A9.com (Search Index Building, Analytics) Facebook (Logfile Analysis) Google & IBM (University Initiative to Address Internet-Scale

Computing Challenges) Yahoo! (Crawling, Indexing, Searching)

Yahoo! Hadoop Cluster runs Terabyte Sort Benchmark in 209 seconds

And many others… (see http://wiki.apache.org/hadoop/PoweredBy)

► Hadoop resembles Google‘s MapReduce Framework J. Dean, S. Ghemawat

„MapReduce: Simplified Data Processing on Large Clusters“

Page 27: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 27

imaThe Pig Project

► A platform for analyzing large data sets► Pig consists of two parts:

PigLatin: A Data Processing Language Pig Infrastructure (Grunt): An Evaluator for PigLatin programs

► Where can you get Pig? Apache Incubator Project: http://incubator.apache.org/pig

► Alternatives: HIVE (Facebook) JAQL (IBM Research)

Page 28: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 28

imaPigLatin Data Processing Language► PigLatin is imperative (whereas SQL is declarative)

Step-by-step programming approach PigLatin queries are easy to write and understand

► Fully nestable data model Atomic values, tuples, bags, maps

► Operators of two flavors: Relational style operators (filter, join, etc.) Functional-programming style operators (map, reduce)

► Easy to extend by user functions► Example: “Find the top 10 most visited pages in each category”

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);urlInfo = load ‘/data/urlInfo’

as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories

generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;

Example taken from:“Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008

Page 29: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 29

imaPig Infrastructure► Currently two modes:

Local: PigLatin programs are locally evaluated (run in a single JVM) MapReduce: PigLatin programs are compiled to sequences of MapReduce

programs and executed (e.g. on Hadoop)► Example:

LOAD visits

LOAD url info

GROUP BY url

FOREACH urlGENERATE count

JOIN on url

GROUP by category

FOREACH categoryGENERATE top10(urls)

Map1

Reduce1

Map2

Reduce2

Map3

Reduce3

Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008

Page 30: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 30

imaUIMA

Page 31: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 31

imaUIMA

Pre-Processing Analysis Phase Post-Processing

Page 32: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 32

imaUIMA

► Annotators for Part of Speech detection, Named-Entity detection and Relation detection.

Page 33: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 33

imaThe Stratosphere Project► Many BI queries exceed the capabilities of today‘s BI

systems „ Who are the biggest players in the Linux market?“ „ Which insurance policy customers are at risk of being hit by a current

storm?“► The Internet offers valuable information

Enterprise announcements and public business reports User generated content: Blogs, Wikis, Reviews, Comments, etc. News websites and feeds

► Next Generation Business Intelligence (NGBI) requires joint analysis of internet and enterprise data Internet, Intranet, Data Warehouse and Local Data must be processed

Goal of the Stratosphere Project is to build a NGBI System on a Cloud Computing Platform

Page 34: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 34

ima

Computing Cloud

UI

Stratosphere - Architecture

InternetInternet

Retrieve

Extract(UIMA)

Process

Result

HAD

OO

PCache

Query Plan

Crawl Scan

Extract

Filter

Join

Group

QueryTranslationQuery

DataWarehouse

Email

Intranet

Office documents (spreadsheets)

Further data sources:

Page 35: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 35

imaStratosphere – Research Challenges

► Definition an algebra for expressing NGBI-queries Includes: traditional database operators, data retrieving

operators, information extraction operators, and information integration operators

► Implementation of NGBI query operators Requirements: highly-scalable, robust, self-tuning Leveraging Hadoop and map-reduce-frameworks

► Implementation of a cloud computing monitoring infrastructure Enabling for self-tuning NGBI-operators

Page 36: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 36

imaRelated Project: DBLife

Page 37: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 37

imaRelated Projects: Avatar Email Search

Page 38: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 38

ima

http://water.usgs.gov/waterwatch/

(Zipcode)

edc.usgs.gov/

(Geocode = Latitude/Longitude) (Geocode = Latitude/Longitude)

http://www.dotd.florida.gov/

(HUC = Hydrological Unit Code)

Severe weather – Meet Pete, an insurance agent in Lousiana.

1. He sees a news report of a severe storm. What is the company’s risk?

2. Pete has an Excel spreadsheet with all policy holders he manages, which he filters to select only properties insured for more than $250,000.

3. Pete searches for a website that can predict flood levels for his area and finds www.floodlevels.com, a mashup which predicts the flood level for a geographic area based on USGS flood level forecasts, and GIS databases from

4. Pete connects his spreadsheet to www.floodlevels.com

5. He then forwards a risk summary to executives.

Which insurance policy customers are

at risk of being hit by a current storm?

Situational Business Intelligence Example

Page 39: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 39

imaFlood Risk Assessment Mashup

Mashup SearchMashup Search

ReportReport

StandardizeStandardize www.floodlevels.comwww.floodlevels.com

standardizestandardize

policy XLSpolicy XLS water.usgs.govwater.usgs.gov edc.usgs.govedc.usgs.gov dotd.louisiana.govdotd.louisiana.gov

Screen ScrapingScreen Scraping

LineageLineage

StandardizationStandardization

Page 40: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 40

ima

MissionCritical

BestEffort,AdHoc

Limited Time, Immediate Lots of Time

Mashups

SCAPortals

New InitiativesProof of Concept

Line of Business

IT Dept

DataMartDataWarehouse

Situational BI Evolution

Page 41: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 41

ima

(Algebraic) Extraction► Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction

from the Web. International Joint Conferences on Artificial Intelligence (IJCAI) 2007: 2670-2676► Frederick Reiss, Shivakumar Vaithyanathan, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu: An Algebraic Approach

to Rule-Based Information Extraction. International Conference on data engineering (ICDE) 2008. 933-942

Schema generation from extracted uncertain data► Xin Dong, Alon Y. Halevy: Malleable Schemas: A Preliminary Report. WebDB 2005: 139-144► Marcos Antonio Vaz Salles, Jens-Peter Dittrich, Shant Kirakos Karakashian, Olivier René Girard, Lukas Blunschi: iTrails: Pay-

as-you-go Information Integration in Dataspaces. International Conference on Very Large Databases (VLDB) 2007: 663-674

Optimization► Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data

Engineering (ICDE) 2008: 636-645

BI over text► Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data

Engineering (ICDE) 2008: 636-645► Raghu Ramakrishnan and Andrew Tomkins: Towards a PeopleWeb. IEEE Computer 40(8): 63-72.► Web 2.0 Business Analytics. Alexander Löser, Gregor Hackenbroich, Hong-Hai Do, Henrike Berthold. Datenbank Spektrum

25/2008► T. S. Jayram, Andrew McGregor, S. Muthukrishan, Erik Vee: Estimating Statistical Aggregateson Probabilistic Data Streams.

PODS 07

Select Literature

Page 42: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 42

ima

► BI over text will tap into a huge set of additional information for BI

► The next generation of business intelligence applications will utilize technologies for scalable processing and service computing to integrate data sources from warehouses, intranet, and internet

► Situational BI will create ad-hoc applications to answer complex questions over integrated data sources

► Open research problems: Which is the right extraction service? “How much” schema can be generated? “How much” optimization has the user to add? How to optimize UIMA based extraction plans on a HADDOP cloud? What is a suitable query language over HADOOP? Data cleansing, completion, and Duplicate detection of extracted data? Data explanation: Lineage but also: Why I do NOT see that data tuple?

Conclusion

Page 43: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 43

ima

► Discussions at IBM Research and IBM SWG Anant Jhingran Hamid Pirahesh Kevin Beyer David Simmen Mehmet Altinel et al.

► My team at TU Berlin Alexander Löser Fabian Hüske Stephan Ewen Helko Glathe

Acknowledgements

Page 44: © Chair for Database Systems and Information Management 1 Database Systems and Information Management Technische Universität Berlin Situational Business

© Chair for Database Systems and Information Management 44

ima

Thank You

MerciGrazie

Gracias

Obrigado

Danke

Japanese

English

French

Russian

German

Italian

Spanish

Brazilian Portuguese

Arabic

Traditional Chinese

Simplified Chinese

Hindi

Tamil

Thai

Korean