from data capturing to complex analytics · original focus on structured (relational) data, e.g....
TRANSCRIPT
© Prof. Dr. -Ing. Wolfgang Lehner
|
The holistic picture of Big Data - From data capturing to complex analytics Prof. Dr.-Ing. Wolfgang Lehner
Dresden, Systema Expert Day Jan-22nd, 2015
| 2
Data, data, everywhere… The situation today
Unstructured, coming from sources that haven’t been mined before
Compounded by internet, social media, cloud computing, mobile devices, digital images…
Exponential. Every 2 days we create as much data as from the Dawn of Civilisation to 2003*
Hard to keep up. Communication Operators managing petabyte scale expect 10-100 x times data growth in next 5 years**
| 3
Smart Everything
Smart Everything Smart „things“ Smart places Smart networks Smart services Smart solutions
„Smart-*“ infrastructure
Physical and digital worlds collide!
need to make things Smart…! Requirements for “Smart Everything” Interactive (“tangible”) low latency High volume high throughput
| 4
http://jerryrushing.net/wp-content/uploads/2012/04/robotic_assembly_line1.jpg http://www.witchdoctor.co.nz/wp-
content/uploads/2013/01/robot-fabrication-station.jpg
Real-Time Sensing and Decision
5G Revolution - Tactile Internet
Stat
e o
f th
e ar
t
Massive safety and security
5G
The Tactile
Internet
Massive low latency
Massive throughput
Massive sensing
Massive resilience
Massive fractal heterogenity
> 10Gbit/s per user < 1ms RTT > 10k sensors per cell 10x10 heterogeneity
| 5
Big Data Analytics…
… this is soooo 2012!
| 6
…from smart phone to smart lenses
http://ngm.nationalgeographic.com
novel Big Data Analytics apps with ms-response time incorporating local context as well as global state
your personal coupon arrived!!!
Buy x get y free
| 7
..beyond traditional applications
Shopping Application ________________
Product Recommendations
Record transactions,
weblogs, sensors
Refine Recommendations
Optimize the application
Mining of user transactions and
recommendation history
User Comments User on e-retail site
Inventory User Transactions
other data sources
Identify buying patterns, users likes/dislikes
| 8
Current State and Overall Question
Observation
„Things“ are generating lots of data Big Data Analytics FIND THE NEEDLE IN THE HAYSTACK + You don’t know if there is a needle at all + The needle may turn out to be a nail.
Question
How to orchestrate different methods and techniques far beyond a pure database system to implement a data refinement process?
How to provide data management services in a scalable way, combining data-intensive and compute-intensive application characteristics?
| 9
Observation 1: Infrastructure
Massive computing power in cloud/cluster environments
Huge variety of „mobile/distributed“ devices Significant computing power in “mobile” devices Massive memory capacity “disk is tape” – (NV)RAM is king
Significant communication capabilities
Main Memory and data-centric architectures as the main driver
Main-Memory is KING, disk is DEAD
10
> Observation 1: Infrastructure
Microsoft Massive Data Center
| 11
Observation 2: Data Production Process
Observation 2: Analytical Processes Refinement Process
Different steps with quality gates - from raw data to knowlegde extraction
Data
integration/
annotation
Data extraction
/ cleaning
Data
aquisition
Data
analysis and
visualization
Inter-
pretation
| 12
…in a nutshell
Big Data Life Cycle Management and Workflows
Efficient Big Data Architectures
Data Quality /
Data Integration
Visual
Analytics
Knowledge
Extraktion
Big Data Refinement Process
Functional Methods and Techniques
Orchestration of Analytical Process Steps
Provide efficient Data Management Runtime
… Big Data is MUCH more than just a lot of data, it‘s all about orchestration, quality control, and interpretation
| 13
Big Data Refinement Process
Outline
Big Data Life Cycle Management and Workflows
Efficient Big Data Architectures
Data Quality /
Data Integration
Visual
Analytics
Knowledge
Extraktion
| 14
Example: Object Matching (deduplication)
Identification of semantically equivalent objects
within one data source or between different sources
Original focus on structured (relational) data, e.g. customer data
CID Name Street City Sex
11 Kristen Smith 2 Hurley Pl South Fork, MN 48503 0
24 Christian Smith Hurley St 2 S Fork MN 1
Cno LastName FirstName Gender Address Phone/Fax
24 Smith Christoph M 23 Harley St, Chicago IL,
60633-2394 333-222-6542 /
333-222-6599
493 Smith Kris L. F 2 Hurley Place, South Fork
MN, 48503-5998 444-555-6666
| 15
Big Data Refinement Process
Outline
Big Data Life Cycle Management and Workflows
Efficient Big Data Architectures
Data Quality /
Data Integration
Visual
Analytics
Knowledge
Extraktion
| 16
Process Templates for Big Data
Example Template for Analytics
| 17
Building Blocks for Analytics
How does a clustering algorithm work?
Algorithm descriptions: verbal & abstract or programmed & specific
Essentials of clustering: three core phases, nine basic tasks
Evaluation Phase „measure similarity“
distance measure objects references distances
Selection Phase
„choose similar objects“
filters conditions
Association Phase
„group objects“ association function adjacencies clustering
| 18
Building Blocks for Analytics
density-based
hierarchical
| 19
Building Blocks for Analytics
| 20
Big Data Refinement Process
Outline
Big Data Life Cycle Management and Workflows
Efficient Big Data Architectures
Data Quality /
Data Integration
Visual
Analytics
Knowledge
Extraktion
21
> Visualisierung
12°°°°°°°°°°°
°°°°°°°°°°°°
Percentage of points/range not fitting under max. unimodal distribution
color dissimilarity
selected attribute
cluster-level(color amount = inhomogeneity)
dataset-level(color amount = usefulness for clustering)
22
> Beispiel: CAVE
23
> Visualisierung
12°°°°°°°°°°°
°°°°°°°°°°°°
Percentage of points/range not fitting under max. unimodal distribution
color dissimilarity
selected attribute
cluster-level(color amount = inhomogeneity)
dataset-level(color amount = usefulness for clustering)
| 24
Big Data Refinement Process
Outline
Big Data Life Cycle Management and Workflows
Efficient Big Data Architectures
Data Quality /
Data Integration
Visual
Analytics
Knowledge
Extraktion
| 25
Big Data Architectures
First Phase of the next generation HRSK (HRSK-II)
7.000 cores
Second Phase (by end of March 2015)
>40.000 cores in total
| 26
Scalable Data Management Runtime
High-Performance Scale-up Cmoputing Infrastructure (e.g. SGI UV 2000)
In-Memory Storage Engine
Programmability
… Relational Operators Data Mining Ops
Procedural Code Custom Operators
Analytical Performance
| 27
Big Data Refinement Process
Outline
Big Data Life Cycle Management and Workflows
Efficient Big Data Architectures
Data Quality /
Data Integration
Visual
Analytics
Knowledge
Extraktion
| 28
Scientific Workflows
What is it?
Series of structured activities and computations that arise in data-intensive problems
S1 Δ
snapshot
DW1Lookup FltNN Function SKtransfer/
compress
Load Join
L1 L2
Asssssssa
saaaaaaaaaaaa
asasasasasasasasasasasas
asasasasasasasasasasas
dasasasasasasasasasasasas
asas
dassssss
dasssssssss
adssssssssss
adssssssssssssssssssssss
adssssssss
Asssssssa
saaaaaaaaaaaa
asasasasasasasasasasasas
asasasasasasasasasasas
dasasasasasasasasasasasas
asas
dassssss
dasssssssss
adssssssssss
adssssssssssssssssssssss
adssssssss
AS$$00101123
WQ#A@1110112as12
WQ#A@1110113qqAS$$012
0001
zZQ#A@0000000121as2211
WQ#A@1110145qwAS$$011
000012111
00112001
AS$$00101227
AS$$00212215
zZQ#A@0000000120as1001
AS$$00242310
Asssssssa
saaaaaaaaaaaa
asasasasasasasasasasasas
asasasasasasasasasasas
dasasasasasasasasasasasas
asas
dassssss
dasssssssss
adssssssssss
adssssssssssssssssssssss
adssssssss
Asssssssa
saaaaaaaaaaaa
asasasasasasasasasasasas
asasasasasasasasasasas
dasasasasasasasasasasasas
asas
dassssss
dasssssssss
adssssssssss
adssssssssssssssssssssss
adssssssss
dp23#0009002
dp24#0009111
a132$12#20080905111011
a132$12#20080906004032
a132$12#20080906113002
dp32#0007010
dp35#0011027
dp43#0012410
a132$12#20080906141201
dp46#0014300
dp52#0015220
transfer/
compress
DW2
DW3 V1
V2Join
SP1
Load
SP2
S2
S3
sales data table
log file with employee data
Sensor data, clickstreams
comparison against snapshot
null value filter
schema modification
| 29
Data Lifecycle Management (DLCM)
What is it?
Classifying, managing, and moving information to the most cost effective data repository based on the value of each piece of information at that exact point in time
Implication: information value changes over time, it ages at different rates, it has a finite life-cycle, as data ages its performance needs change
DLCM in the context Big Data
Due to sheer volume of data we can no longer follow the traditional principle of Ubiquity = all data with the same value at all times and all possible locations Need-to-Know Principle = provide data only where it is required with only the required quality
Basis for Real-Time Analytics -> „Right-Time“ Analytics
30
>
ScaDS Dresden/Leipzig
31
> German Centers for Big Data
| 32
German Centers for Big Data
Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin Big Data Center (BBDC)
ScaDS Dresden/Leipzig (Competence Center for Scalable Data Services and Solutions Dresden/Leipzig) scientific coordinators: Nagel (TUD), Rahm (UL) start: Oct. 2014 duration: 4 years (option for 3 more years) initial funding: ca. 5.6 Mio. Euro
Overall Mission
Bundling and advancement of existing expertise on Big Data
Development of Big Data Services and Solutions
Driving Big Data Innovations
Leipzig
Dresden
| 33
Associated Partners
| 34
ScaDS Structure
Big Data Life Cycle Management and Workflows
Efficient Big Data Architectures
Data Quality /
Data Integration
Visual
Analytics
Knowledge
Extraktion
Life sciences
Material and Engineering sciences
Digital Humanities
Environmental / Geo sciences
Business Data
Service
center
| 35
Research Partners
Data-intensive computing W.E. Nagel
Data quality / Data integration E. Rahm
Databases W. Lehner, E. Rahm
Knowledge extraction/Data mining C. Rother, P. Stadler, G. Heyer
Visualization S. Gumhold, G. Scheuermann
Service Engineering, Infrastructure K.-P. Fähnrich, W.E. Nagel, M. Bogdan
| 36
TU Dresden – Database Systems Group
| 37
Group Awards and Recognitions
Premium Research Relationship with SAP HANA NVIDIA Graduate Fellowship Program 2013/14 Amazon Research Grant 2013 IBM Smarter Planet Innovation Award 2012: Data
Management in Smart Grids Apps4Deutschland Competition 2012, 2nd and 3rd NVIDIA Professor Partnership Award 2011 ACM SIGMOD Programming Contest, 1st (2011)
and 2nd (2009) IEEE International Services Computing Contest,
1st (2008) and 2nd (2007) Accenture Campus Challenge, 1st in 2009 IBM Faculty Award 2006 AMD price for best diploma thesis 2009, 2010
and 2011 Saxonia Systems Special Women Award 2012 IBM price for best diploma thesis 2004 Lohrmann Medal 2010
| 38
Interdisciplinary Projects
Among Top 3 Universities in Engineering in Germany
– “University of Excellence” status since 2012
– 37’000+ students / 7‘500+ employees
Cluster of Excellence
– 5+ years concept
– New materials beyond CMOS
– Profound expertise in electronics
– 60+ investigators and teams
Collaborative Research Center
– Highly adaptive energy-efficient computing
– 12 year project
– 16 investigators and teams
Synergies Initiative
– Regional cooperation between academia, industry, education, culture, and administration
– 22 partners
5G Lab Germany
– Founded mid 2014
– Edge Cloud applications
– SW + HW developments
Big Data National Competence Center
– Service-oriented character
| 39
What’s in for you???
Service Center
Customers
Research Topics • Big Data architektures
• data quality and -integration
• knowledge extraction
• visual analytics
• data life cycle management and workflows
Application Areas • Life sciences
• Material sciences
• Digital Humanities
• Business Data
• … (Manufacturing ?)
Governance
Education
© Prof. Dr. -Ing. Wolfgang Lehner
|
The holistic picture of Big Data - From data capturing to complex analytics Prof. Dr.-Ing. Wolfgang Lehner
Dresden, Systema Expert Day Jan-22nd, 2015