© chair for database systems and information management 1 database systems and information...

© Chair for Database Systems and Information Management 1

ima

Database Systems and Information ManagementTechnische Universität Berlin

Situational Business Intelligence

Volker MarklTechnische Universität Berlin


imaAgenda

► Traditional Business Intelligence► Next Generation Business Intelligence ► Building Blocks

Cloud Computing, Map-Reduce, and Hadoop, Piglatin UIMA, Social Tagging

► The Long Tail of Situational Applications► Situational Business Intelligence► Challenges


imaTraditional Business Intelligence


imaHow Did We Get Here?

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

1990 1993 1996 1999 2002 2005 2008

Actual and forecasted BI tools software revenue as reported by IDC

BI over Text

Client ServerBusinessIntelligence

BatchReporting

Web enabledBusinessIntelligence

Query/ReportingOLAP

Source: GartnerSource: IDC


ima2008 CIO Priorities

Business Intelligence Applications

Enterprise Applications (ERP, SCM, and CRM)

Server and Storage Technologies (Virtualization)

Legacy Application Modernization

Security Technologies

Technical Infrastructure

Networking, Voice, and Data Communications (VoIP)

Collaboration Technologies

Document Management

Service-Oriented Technologies (SOA and SOBA)

** New question for 2007

To what extent will each of the following technologies be a Top 5 priority for you in 2008?

11.20%

8.02%

8.45%

5.79%

8.53%

4.67%

6.83%

7.75%

7.91%

6.71%

2008Increase*

1

2

3

4

5

6

7

8

9

10

Rank2008

1

2

5

3

6

8

4

10

9

7

Rank2007

1

**

9

10

2

12

8

4

**

6

Rank2006

Source: 2008 Gartner Executive Programs CIO Survey, January 10, 2008

2008 CIO Technology Priorities

* Unweighted average budget change


ima

Better/more information 22.9%Faster/quick retrieval 14.3%Accurate/updated data 11.4%Consistent platform 8.6%

Better integration 8.6%

Standardization 8.6%Other single mentions 40.0%

Please give me an example of how your business intelligence solution could better meet your organizations main objective?

Source: Business Intelligence Survey, IDC

What are CIOs missing?


imaNext Generation Business Intelligence

Internet

IntranetInformation ExtractionSemantic Integration

Load/Refresh or ad-hoc

TextTextText

TextTextXLS

TextTextXML

Who is leading in American Idol?

Who are the biggest players in the Linux market?

Which insurance policy customers are

at risk of being hit by a current storm?

Analysis Schema

and Entities

The next generation of Business Intelligence (NGBI) correlates data warehouses with text and semi-structured data from webservices of corporate intranets and the internet

Data WarehouseData Marts


imaAnswering a NGBI Query

Web 2.0 documents from 332 Wiki News docs (January –March 2007)Who are the biggest players in the “Linux” market?


imaData Source Identification

► Data Warehouse► Masterdata► Information Providers► Information Marketplaces► Crawling (Internet/Intranet)

Data Source identification

Atomic Entity extraction Schema extraction Data Cleansing Data Fusion


imaAtomic Entity Extraction

Out-of-the box data► Web Services for complex, atomic

and named entities

Frameworks► Infrastructures for extracting,

managing and scalable storage of named entities

► Web Services for extracting named entities

Basic Components► Screen scraper

Addi

tiona

l ext

racti

on a

nd d

ata

clea

nsin

g eff

ort

Data Source identification



imaAd hoc analysis processData Source identification



imaSchema Extraction

Company Technology ->Technology Company Technology -> Company

Pre Process Base extraction Schema extraction Data Cleansing Data Fusion


imaData Cleansing

Duplicates



Data Fusion

Data Source A

Schema Mapping

Duplicate Detection

Data Fusion

Data Source B

iPhone 3 Gen 299.95

iPhone 3G 199.99

Apple

Apple

iPhone 3 Gen 199.99Apple max length min

Info

rmati

on In

tegr

ation

match


e.g., Hummer (U Potsdam)


Data Fusion

b ca -

b -a db ca d

b -a -

b -a -b -a -

b ca -

b -a -b ca -

b ca -

e -a df(b,e) ca d

Integration of complementary tuples

Elemination of identical tuples

Elemination of subsumed tuples

Conflict resolution



imaAddress Uncertainty: Query Refinement► Extract->SELECT->PROJECT-JOIN-(COUNT, AVG, SUM, MEAN..)

► “Everything” about Dell?

► The market of “Linux” from 2007-2008?

► “What's the average analyst quote about the IBM stock price for the last month?”

► Drill down on region, time, organization ….

DATAQ

UER

Y

S U

U

S


ima

► Cloud Computing► Map Reduce► Pig► UIMA► Social Tagging

Building Blocks


imaCloud Computing

► What is Cloud Computing? Computing platform architecture Scales to any application High fault tolerance No generally accepted definition available Separation from Utility or Grid Computing is not obvious


imaCloud Computing

► How does Cloud Computing work? Lots of loosely coupled computers Use of commodity hardware Flexible up- or downgrading of resources APIs offer access to cloud computing systems Software takes care of parallelization, hardware failures

and error handling Resources (e.g. storage, computing power) can be bought

as services (paying for usage, e.g. Amazon)


imaMapReduce – Programming Model► Program logic is split into 2 functions:

Map(k,v) and Reduce(k,list(v))► Functions receive and produce (Key, Value)-pairs► Map(k,v) computes for each (k,v)-pair an

intermediate (ki,vi)-pair► Reduce(k,list(v)) merges all values with the same

key k and outputs the result.► MapReduce programs are easy to develop

Frameworks provide libraries Frameworks take care of parallelization, distribution and

error handling Only application specific source code is required

(no parallelization and error handling code)


ima

REDUCE(k,list(v))MAP(k,v)

MapReduce – Group AVG Example

NewYork, US, 10LosAngeles, US, 40London, GB, 20

Berlin, DE, 60Glasgow, GB, 10Munich, DE, 30…

(DE,45)(GB,15)(US,25)

(US,10)(US,40)

(GB,20)

(GB,10)

(DE,60)(DE,30)

(US,10)(US,40)

(GB,20)(GB,10)

(DE,60)(DE,30)

Input Data Intermediate(K,V)-Pairs

Result


imaMapReduce

► MapReduce Programming Model For processing of huge amounts of data Massive parallelization of computing tasks Applicable to many real world applications MapReduce programs are easy to implement

► MapReduce Engine Environment to run MapReduce programs Distributes computing tasks Errors are transparently handled Very scalable architecture Examples: Google MapReduce & Apache Hadoop


imaHadoop

► What is Hadoop? Free software framework for data intensive applications Enables distributed processing of vast amounts of data on

cloud computing architectures Supports clouds with 1000+ nodes Two components:

1) Hadoop Distributed File System (HDFS)2) MapReduce Engine

► Where can you get Hadoop? Top-level Apache Project: http://hadoop.apache.org/core/


imaHadoop - HDFS► Inspired by Google File System► Distributed storage for large files► Files are split up in multiple parts (default size 64MB)► Parts are spread over the HDFS nodes► Each part replicated (default 3 times)


imaHadoop – MapReduce Engine► Runs MapReduce programs► Libraries for Java and C++► Assigns Map and Reduce tasks to computing nodes► Reduction of data transfer volume

Tasks are assigned to nodes holding the data

► Node failures are transparently handled Tasks are restarted on node holding a replica of the data

TaskManagerMAP( )

MAP( )

MAP( )MAP( )MAP( ) FAILS!

…


imaHadoop

► Who uses Hadoop? Amazon A9.com (Search Index Building, Analytics) Facebook (Logfile Analysis) Google & IBM (University Initiative to Address Internet-Scale

Computing Challenges) Yahoo! (Crawling, Indexing, Searching)

Yahoo! Hadoop Cluster runs Terabyte Sort Benchmark in 209 seconds

And many others… (see http://wiki.apache.org/hadoop/PoweredBy)

► Hadoop resembles Google‘s MapReduce Framework J. Dean, S. Ghemawat

„MapReduce: Simplified Data Processing on Large Clusters“


imaThe Pig Project

► A platform for analyzing large data sets► Pig consists of two parts:

PigLatin: A Data Processing Language Pig Infrastructure (Grunt): An Evaluator for PigLatin programs

► Where can you get Pig? Apache Incubator Project: http://incubator.apache.org/pig

► Alternatives: HIVE (Facebook) JAQL (IBM Research)


imaPigLatin Data Processing Language► PigLatin is imperative (whereas SQL is declarative)

Step-by-step programming approach PigLatin queries are easy to write and understand

► Fully nestable data model Atomic values, tuples, bags, maps

► Operators of two flavors: Relational style operators (filter, join, etc.) Functional-programming style operators (map, reduce)

► Easy to extend by user functions► Example: “Find the top 10 most visited pages in each category”

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);urlInfo = load ‘/data/urlInfo’

as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories

generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;

Example taken from:“Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008


imaPig Infrastructure► Currently two modes:

Local: PigLatin programs are locally evaluated (run in a single JVM) MapReduce: PigLatin programs are compiled to sequences of MapReduce

programs and executed (e.g. on Hadoop)► Example:

LOAD visits

LOAD url info

GROUP BY url

FOREACH urlGENERATE count

JOIN on url

GROUP by category

FOREACH categoryGENERATE top10(urls)

Map1

Reduce1

Map2

Reduce2

Map3

Reduce3

Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008


imaUIMA


imaUIMA

Pre-Processing Analysis Phase Post-Processing


imaUIMA

► Annotators for Part of Speech detection, Named-Entity detection and Relation detection.


imaThe Stratosphere Project► Many BI queries exceed the capabilities of today‘s BI

systems „ Who are the biggest players in the Linux market?“ „ Which insurance policy customers are at risk of being hit by a current

storm?“► The Internet offers valuable information

Enterprise announcements and public business reports User generated content: Blogs, Wikis, Reviews, Comments, etc. News websites and feeds

► Next Generation Business Intelligence (NGBI) requires joint analysis of internet and enterprise data Internet, Intranet, Data Warehouse and Local Data must be processed

Goal of the Stratosphere Project is to build a NGBI System on a Cloud Computing Platform


ima

Computing Cloud

UI

Stratosphere - Architecture

InternetInternet

Retrieve

Extract(UIMA)

Process

Result

HAD

OO

PCache

Query Plan

Crawl Scan

Extract

Filter

Join

Group

QueryTranslationQuery

DataWarehouse

Email

Intranet

Office documents (spreadsheets)

Further data sources:


imaStratosphere – Research Challenges

► Definition an algebra for expressing NGBI-queries Includes: traditional database operators, data retrieving

operators, information extraction operators, and information integration operators

► Implementation of NGBI query operators Requirements: highly-scalable, robust, self-tuning Leveraging Hadoop and map-reduce-frameworks

► Implementation of a cloud computing monitoring infrastructure Enabling for self-tuning NGBI-operators


imaRelated Project: DBLife


imaRelated Projects: Avatar Email Search


ima

http://water.usgs.gov/waterwatch/

(Zipcode)

edc.usgs.gov/

(Geocode = Latitude/Longitude) (Geocode = Latitude/Longitude)

http://www.dotd.florida.gov/

(HUC = Hydrological Unit Code)

Severe weather – Meet Pete, an insurance agent in Lousiana.

1. He sees a news report of a severe storm. What is the company’s risk?

2. Pete has an Excel spreadsheet with all policy holders he manages, which he filters to select only properties insured for more than $250,000.

3. Pete searches for a website that can predict flood levels for his area and finds www.floodlevels.com, a mashup which predicts the flood level for a geographic area based on USGS flood level forecasts, and GIS databases from

4. Pete connects his spreadsheet to www.floodlevels.com

5. He then forwards a risk summary to executives.

Which insurance policy customers are

at risk of being hit by a current storm?

Situational Business Intelligence Example


imaFlood Risk Assessment Mashup

Mashup SearchMashup Search

ReportReport

StandardizeStandardize www.floodlevels.comwww.floodlevels.com

standardizestandardize

policy XLSpolicy XLS water.usgs.govwater.usgs.gov edc.usgs.govedc.usgs.gov dotd.louisiana.govdotd.louisiana.gov

Screen ScrapingScreen Scraping

LineageLineage

StandardizationStandardization


ima

MissionCritical

BestEffort,AdHoc

Limited Time, Immediate Lots of Time

Mashups

SCAPortals

New InitiativesProof of Concept

Line of Business

IT Dept

DataMartDataWarehouse

Situational BI Evolution


ima

(Algebraic) Extraction► Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction

from the Web. International Joint Conferences on Artificial Intelligence (IJCAI) 2007: 2670-2676► Frederick Reiss, Shivakumar Vaithyanathan, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu: An Algebraic Approach

to Rule-Based Information Extraction. International Conference on data engineering (ICDE) 2008. 933-942

Schema generation from extracted uncertain data► Xin Dong, Alon Y. Halevy: Malleable Schemas: A Preliminary Report. WebDB 2005: 139-144► Marcos Antonio Vaz Salles, Jens-Peter Dittrich, Shant Kirakos Karakashian, Olivier René Girard, Lukas Blunschi: iTrails: Pay-

as-you-go Information Integration in Dataspaces. International Conference on Very Large Databases (VLDB) 2007: 663-674

Optimization► Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data

Engineering (ICDE) 2008: 636-645

BI over text► Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data

Engineering (ICDE) 2008: 636-645► Raghu Ramakrishnan and Andrew Tomkins: Towards a PeopleWeb. IEEE Computer 40(8): 63-72.► Web 2.0 Business Analytics. Alexander Löser, Gregor Hackenbroich, Hong-Hai Do, Henrike Berthold. Datenbank Spektrum

25/2008► T. S. Jayram, Andrew McGregor, S. Muthukrishan, Erik Vee: Estimating Statistical Aggregateson Probabilistic Data Streams.

PODS 07

Select Literature


ima

► BI over text will tap into a huge set of additional information for BI

► The next generation of business intelligence applications will utilize technologies for scalable processing and service computing to integrate data sources from warehouses, intranet, and internet

► Situational BI will create ad-hoc applications to answer complex questions over integrated data sources

► Open research problems: Which is the right extraction service? “How much” schema can be generated? “How much” optimization has the user to add? How to optimize UIMA based extraction plans on a HADDOP cloud? What is a suitable query language over HADOOP? Data cleansing, completion, and Duplicate detection of extracted data? Data explanation: Lineage but also: Why I do NOT see that data tuple?

Conclusion


ima

► Discussions at IBM Research and IBM SWG Anant Jhingran Hamid Pirahesh Kevin Beyer David Simmen Mehmet Altinel et al.

► My team at TU Berlin Alexander Löser Fabian Hüske Stephan Ewen Helko Glathe

Acknowledgements


ima

Thank You

MerciGrazie

Gracias

Obrigado

Danke

Japanese

English

French

Russian

German

Italian

Spanish

Brazilian Portuguese

Arabic

Traditional Chinese

Simplified Chinese

Hindi

Tamil

Thai

Korean

© chair for database systems and information management 1 database systems and information...

Documents