from data capturing to complex analytics · original focus on structured (relational) data, e.g....

© Prof. Dr. -Ing. Wolfgang Lehner

|

The holistic picture of Big Data - From data capturing to complex analytics Prof. Dr.-Ing. Wolfgang Lehner

Dresden, Systema Expert Day Jan-22nd, 2015

| 2

Data, data, everywhere… The situation today

Unstructured, coming from sources that haven’t been mined before

Compounded by internet, social media, cloud computing, mobile devices, digital images…

Exponential. Every 2 days we create as much data as from the Dawn of Civilisation to 2003*

Hard to keep up. Communication Operators managing petabyte scale expect 10-100 x times data growth in next 5 years**

| 3

Smart Everything

Smart Everything Smart „things“ Smart places Smart networks Smart services Smart solutions

„Smart-*“ infrastructure

Physical and digital worlds collide!

need to make things Smart…! Requirements for “Smart Everything” Interactive (“tangible”) low latency High volume high throughput

| 4

http://jerryrushing.net/wp-content/uploads/2012/04/robotic_assembly_line1.jpg http://www.witchdoctor.co.nz/wp-

content/uploads/2013/01/robot-fabrication-station.jpg

Real-Time Sensing and Decision

5G Revolution - Tactile Internet

Stat

e o

f th

e ar

t

Massive safety and security

5G

The Tactile

Internet

Massive low latency

Massive throughput

Massive sensing

Massive resilience

Massive fractal heterogenity

> 10Gbit/s per user < 1ms RTT > 10k sensors per cell 10x10 heterogeneity

| 5

Big Data Analytics…

… this is soooo 2012!

| 6

…from smart phone to smart lenses

http://ngm.nationalgeographic.com

novel Big Data Analytics apps with ms-response time incorporating local context as well as global state

your personal coupon arrived!!!

Buy x get y free

| 7

..beyond traditional applications

Shopping Application ________________

Product Recommendations

Record transactions,

weblogs, sensors

Refine Recommendations

Optimize the application

Mining of user transactions and

recommendation history

User Comments User on e-retail site

Inventory User Transactions

other data sources

Identify buying patterns, users likes/dislikes

| 8

Current State and Overall Question

Observation

„Things“ are generating lots of data Big Data Analytics FIND THE NEEDLE IN THE HAYSTACK + You don’t know if there is a needle at all + The needle may turn out to be a nail.

Question

How to orchestrate different methods and techniques far beyond a pure database system to implement a data refinement process?

How to provide data management services in a scalable way, combining data-intensive and compute-intensive application characteristics?

| 9

Observation 1: Infrastructure

Massive computing power in cloud/cluster environments

Huge variety of „mobile/distributed“ devices Significant computing power in “mobile” devices Massive memory capacity “disk is tape” – (NV)RAM is king

Significant communication capabilities

Main Memory and data-centric architectures as the main driver

Main-Memory is KING, disk is DEAD

10

> Observation 1: Infrastructure

Microsoft Massive Data Center

| 11

Observation 2: Data Production Process

Observation 2: Analytical Processes Refinement Process

Different steps with quality gates - from raw data to knowlegde extraction

Data

integration/

annotation

Data extraction

/ cleaning

Data

aquisition

Data

analysis and

visualization

Inter-

pretation

| 12

…in a nutshell

Big Data Life Cycle Management and Workflows

Efficient Big Data Architectures

Data Quality /

Data Integration

Visual

Analytics

Knowledge

Extraktion

Big Data Refinement Process

Functional Methods and Techniques

Orchestration of Analytical Process Steps

Provide efficient Data Management Runtime

… Big Data is MUCH more than just a lot of data, it‘s all about orchestration, quality control, and interpretation

| 13


Outline



Data Quality /

Data Integration

Visual

Analytics

Knowledge

Extraktion

| 14

Example: Object Matching (deduplication)

Identification of semantically equivalent objects

within one data source or between different sources

Original focus on structured (relational) data, e.g. customer data

CID Name Street City Sex

11 Kristen Smith 2 Hurley Pl South Fork, MN 48503 0

24 Christian Smith Hurley St 2 S Fork MN 1

Cno LastName FirstName Gender Address Phone/Fax

24 Smith Christoph M 23 Harley St, Chicago IL,

60633-2394 333-222-6542 /

333-222-6599

493 Smith Kris L. F 2 Hurley Place, South Fork

MN, 48503-5998 444-555-6666

| 15


Outline



Data Quality /

Data Integration

Visual

Analytics

Knowledge

Extraktion

| 16

Process Templates for Big Data

Example Template for Analytics

| 17

Building Blocks for Analytics

How does a clustering algorithm work?

Algorithm descriptions: verbal & abstract or programmed & specific

Essentials of clustering: three core phases, nine basic tasks

Evaluation Phase „measure similarity“

distance measure objects references distances

Selection Phase

„choose similar objects“

filters conditions

Association Phase

„group objects“ association function adjacencies clustering

| 18


density-based

hierarchical

| 19


| 20


Outline



Data Quality /

Data Integration

Visual

Analytics

Knowledge

Extraktion

21

> Visualisierung

12°°°°°°°°°°°

°°°°°°°°°°°°

Percentage of points/range not fitting under max. unimodal distribution

color dissimilarity

selected attribute

cluster-level(color amount = inhomogeneity)

dataset-level(color amount = usefulness for clustering)

22

> Beispiel: CAVE

23

> Visualisierung

12°°°°°°°°°°°

°°°°°°°°°°°°

Percentage of points/range not fitting under max. unimodal distribution

color dissimilarity

selected attribute

cluster-level(color amount = inhomogeneity)

dataset-level(color amount = usefulness for clustering)

| 24


Outline



Data Quality /

Data Integration

Visual

Analytics

Knowledge

Extraktion

| 25

Big Data Architectures

First Phase of the next generation HRSK (HRSK-II)

7.000 cores

Second Phase (by end of March 2015)

>40.000 cores in total

| 26

Scalable Data Management Runtime

High-Performance Scale-up Cmoputing Infrastructure (e.g. SGI UV 2000)

In-Memory Storage Engine

Programmability

… Relational Operators Data Mining Ops

Procedural Code Custom Operators

Analytical Performance

| 27


Outline



Data Quality /

Data Integration

Visual

Analytics

Knowledge

Extraktion

| 28

Scientific Workflows

What is it?

Series of structured activities and computations that arise in data-intensive problems

S1 Δ

snapshot

DW1Lookup FltNN Function SKtransfer/

compress

Load Join

L1 L2

Asssssssa

saaaaaaaaaaaa

asasasasasasasasasasasas

asasasasasasasasasasas

dasasasasasasasasasasasas

asas

dassssss

dasssssssss

adssssssssss

adssssssssssssssssssssss

adssssssss

Asssssssa

saaaaaaaaaaaa




asas

dassssss

dasssssssss

adssssssssss


adssssssss

AS$$00101123

WQ#A@1110112as12

WQ#A@1110113qqAS$$012

0001

zZQ#A@0000000121as2211

WQ#A@1110145qwAS$$011

000012111

00112001

AS$$00101227

AS$$00212215

zZQ#A@0000000120as1001

AS$$00242310

Asssssssa

saaaaaaaaaaaa




asas

dassssss

dasssssssss

adssssssssss


adssssssss

Asssssssa

saaaaaaaaaaaa




asas

dassssss

dasssssssss

adssssssssss


adssssssss

dp23#0009002

dp24#0009111

a132$12#20080905111011

a132$12#20080906004032

a132$12#20080906113002

dp32#0007010

dp35#0011027

dp43#0012410

a132$12#20080906141201

dp46#0014300

dp52#0015220

transfer/

compress

DW2

DW3 V1

V2Join

SP1

Load

SP2

S2

S3

sales data table

log file with employee data

Sensor data, clickstreams

comparison against snapshot

null value filter

schema modification

| 29

Data Lifecycle Management (DLCM)

What is it?

Classifying, managing, and moving information to the most cost effective data repository based on the value of each piece of information at that exact point in time

Implication: information value changes over time, it ages at different rates, it has a finite life-cycle, as data ages its performance needs change

DLCM in the context Big Data

Due to sheer volume of data we can no longer follow the traditional principle of Ubiquity = all data with the same value at all times and all possible locations Need-to-Know Principle = provide data only where it is required with only the required quality

Basis for Real-Time Analytics -> „Right-Time“ Analytics

30

>

ScaDS Dresden/Leipzig

31

> German Centers for Big Data

| 32

German Centers for Big Data

Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin Big Data Center (BBDC)

ScaDS Dresden/Leipzig (Competence Center for Scalable Data Services and Solutions Dresden/Leipzig) scientific coordinators: Nagel (TUD), Rahm (UL) start: Oct. 2014 duration: 4 years (option for 3 more years) initial funding: ca. 5.6 Mio. Euro

Overall Mission

Bundling and advancement of existing expertise on Big Data

Development of Big Data Services and Solutions

Driving Big Data Innovations

Leipzig

Dresden

| 33

Associated Partners

| 34

ScaDS Structure



Data Quality /

Data Integration

Visual

Analytics

Knowledge

Extraktion

Life sciences

Material and Engineering sciences

Digital Humanities

Environmental / Geo sciences

Business Data

Service

center

| 35

Research Partners

Data-intensive computing W.E. Nagel

Data quality / Data integration E. Rahm

Databases W. Lehner, E. Rahm

Knowledge extraction/Data mining C. Rother, P. Stadler, G. Heyer

Visualization S. Gumhold, G. Scheuermann

Service Engineering, Infrastructure K.-P. Fähnrich, W.E. Nagel, M. Bogdan

| 36

TU Dresden – Database Systems Group

| 37

Group Awards and Recognitions

Premium Research Relationship with SAP HANA NVIDIA Graduate Fellowship Program 2013/14 Amazon Research Grant 2013 IBM Smarter Planet Innovation Award 2012: Data

Management in Smart Grids Apps4Deutschland Competition 2012, 2nd and 3rd NVIDIA Professor Partnership Award 2011 ACM SIGMOD Programming Contest, 1st (2011)

and 2nd (2009) IEEE International Services Computing Contest,

1st (2008) and 2nd (2007) Accenture Campus Challenge, 1st in 2009 IBM Faculty Award 2006 AMD price for best diploma thesis 2009, 2010

and 2011 Saxonia Systems Special Women Award 2012 IBM price for best diploma thesis 2004 Lohrmann Medal 2010

| 38

Interdisciplinary Projects

Among Top 3 Universities in Engineering in Germany

– “University of Excellence” status since 2012

– 37’000+ students / 7‘500+ employees

Cluster of Excellence

– 5+ years concept

– New materials beyond CMOS

– Profound expertise in electronics

– 60+ investigators and teams

Collaborative Research Center

– Highly adaptive energy-efficient computing

– 12 year project

– 16 investigators and teams

Synergies Initiative

– Regional cooperation between academia, industry, education, culture, and administration

– 22 partners

5G Lab Germany

– Founded mid 2014

– Edge Cloud applications

– SW + HW developments

Big Data National Competence Center

– Service-oriented character

| 39

What’s in for you???

Service Center

Customers

Research Topics • Big Data architektures

• data quality and -integration

• knowledge extraction

• visual analytics

• data life cycle management and workflows

Application Areas • Life sciences

• Material sciences

• Digital Humanities

• Business Data

• … (Manufacturing ?)

Governance

Education

© Prof. Dr. -Ing. Wolfgang Lehner

|

The holistic picture of Big Data - From data capturing to complex analytics Prof. Dr.-Ing. Wolfgang Lehner

Dresden, Systema Expert Day Jan-22nd, 2015

from data capturing to complex analytics · original focus on structured (relational) data, e.g....

Documents