the evolution of data architecture
Post on 21-Jan-2018
265 Views
Preview:
TRANSCRIPT
Data Value Chain
AI
Machine Learning
Data Science
Analytics
Big Data
Decision making
Insight
AutomatedDecision making
Hype (?)
3
Data is the new Oil
https://www.economist.com/news/leaders/2172165
6-data-economy-demands-new-approach-antitrust-
rules-worlds-most-valuable-resource
4
Once upon a time, processors double in speed every 18 months …
The “Moore’s Law”
stopped 10 years ago.
CPU, RAM and disk almost
stopped improving in
speed ever since.
7
Processor speed has been stagnant
But data is being generated
at ever increasing speed.
Hardware improvement
cannot keep up with data
generation.
Multi-threaded systems,
distributed systems are the
must.
8
Distributed Systems are hard
Programmability
Scalability
Consistency
Availability
Partition Tolerance
Fault Tolerance
9
Modern Data Architecture
How do you:
transmit
collect
store
compute
Petabyte+ storage on
1000+ compute nodes?
12
Modern Data Center
DataCenter
ToR
Server1
Server10
ToR
Server1
Server10
ToR
Server1
Server10
ToR
Server1
Server10
Aggr Aggr Aggr
Core Core
Internet
AR AR
10Gbps
10Gbps
1Gbps
13
GFS
Master – slave architecture
Separation of control plane and
data plane
Low cost, commodity hardware
Failures are norm, rather
than exceptions
Balance availability and network
partition tolerance
Control
messages
Data
messagesGFS
Master
GFS
chunkservers
/foo/barGFS
client
14
MapReduce
A very simple yet powerful
distributed programming model
Share-nothing architecture
Programmability
Data-locality:
ship compute to data, rather
than shipping data to compute
Fault tolerance:
Intermediate state is stored in
storage.
Failed tasks can be restarted
easily.
Split 0
Split 1
Split 2
worker
worker
worker
Input files Map phase
worker
worker
Intermediate
files
Reduce
phase
Output 0
Output files
Output 1
masterassign
mapassign
reduce
15
Hadoop
GFS, MapReduce inspired Hadoop
Initially developed by Yahoo!
Released in 2006.
Used by most large enterprises
Hadoop 3.0 beta 1!
17
2006 2008 2009 2010 2011 2012 2013
Core Hadoop(HDFS,
MapReduce)
HBaseZooKeeper
SolrPig
Core Hadoop
HiveMahoutHBase
ZooKeeperSolrPig
Core Hadoop
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
Core Hadoop
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
SparkTez
ImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
ParquetSentrySparkTez
ImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
2007
SolrPig
Core Hadoop
KnoxFlink
ParquetSentrySparkTez
ImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
2014 2015
KuduRecordService
IbisFalconKnoxFlink
ParquetSentrySparkTez
ImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
Evolution of the Hadoop Platform
The stack is continually evolving and growing!
18
Mix and match
Resource Management
YARN Mesos Kubernetes
Storage
HDFS HBase Kudu S3 ADLS
Compute
MapReduce Hive Impala Spark Presto
Pig Drill Solr Storm
Ingest
Kafka
Flume
Beam
Samza
19
Why open source?
It’s free ($$$)
No vendor lock-in.
Faster development and faster adoption.
A new approach to foster collaboration.
Open source software is becoming the standard.
21
Sell open source software, really?
Water is free, but bottled water is not.
Cloudera sells the “bottle”
Cloudera’s Distribution of Hadoop.
The integration of software.
The support and services.
The management software is proprietary. The OSS is free of charge.
22
Market for open source software?
23
0
50
100
150
200
250
300
350
400
FY2015 FY2016 FY2017 FY2018 (f)
Revenue (million USD)
Hortonworks Cloudera MongoDB
Open Source Business Model
• MySQL
Dual licensing
• RedHat, Hortonworks
Support + services
• Java EE, Qt
Open core
• DataBricks, Amazon AWS, Microsoft Azure
Software as a Service
• Google Chrome, Android
Advertising-supported
• Cloudera, Confluent, MongoDB
Hybrid Open Source Software
24
“Big Data” finds many applications
across many industries
IT Healthcare Transportation Retail
Utilities Telecomm Public sector Manufactring
27
Applications and Use cases
Realtime database for serving internet traffic
Internet services (Facebook messenger), Twitter, Uber, Airbnb …
Data analytics
Assist in the development of new drugs by analyzing millions
of medical records
Data science / Machine learning
Fraud detection
Anti-money laundry
Cybersecurity
28
The Cloudera Platform for IoT – Data Mgmt. Value Chain
Data Sources Data Ingest Data Storage & ProcessingServing, Analytics &
Machine Learning
ENTERPRISE DATA HUB
Apache KafkaStream or batch ingestion of IoT data
Apache SqoopIngestion of data from relational sources
Apache HadoopStorage (HDFS) & deep batch processing
Apache KuduStorage & serving for fast changing data
Apache HBaseNoSQL data store for real time
applications
Apache ImpalaMPP SQL for fast analytics
Cloudera SearchReal time searchConnected Things/ Data
Sources
Other Data Sources Security, Scalability & Easy Management
Deployment Flexibility:
Datacenter Cloud
Apache SparkStream & iterative processing, ML
Predictive Maintenance on Thousands of Industrial Machinery in Real- Time
Challenge:
• Collect and analyze data from thousands of diverse manufacturing systems in real-time
Solution:
• iTrak application using Cloudera in the Cloud to monitor the performance of individual manufacturing systems in real-time
• Predictive Maintenance - Proactively identifying & fixing issues before they break
MANUFACTURING
» INDUSTRIAL IoT
» PREDICTIVE MAINTENANCE
» IMPROVED EFFICIENCIES
Industrial IoT – Predictive Maintenance
DATA-DRIVENPROCESS
CASE STUDY
DATA-DRIVENPRODUCTS
Using Predictive Maintenance to Improve Performance and Reduce Fleet Downtime
Challenge:
• Monitor the health of 180,000+ trucks in real-time in order to minimize downtime
Solution:
• OnCommand Connection collecting telematics and geolocation data across thousands of trucks
• Identify and correct engine problems early, and increase fleet uptime
• Reduced maintenance costs to $.03 per mile from $.12-$.15 per mile
Connected Vehicles & Telematics
DATA-DRIVENPROCESS
CASE STUDY
DATA-DRIVENPRODUCTS
TRANSPORTATION
» PREDICTIVE MAINTENANCE
» TELEMETRY
» LOWER TCO
Enabling the State of Kentucky manage snow and ice events in real time
Challenge:
• Kentucky Transportation Cabinet (KYTC) oversees the state’s transportation system, which includes 27,000 miles of highways, 230 airports and heliports, and more than three million drivers.
• Needed more efficient approach to inclement weather road management
Solution:
• KYTC has built a real-time weather response system that incorporates real-time data from Waze, HERE, ESRI’s GeoEvent processor, and Automatic Vehicle Locations (providing sensor data from salt trucks).
• KYTC aggregates 15-20 million records every day and process more than a million records per second.
Data Driven Dept. of Transportation
Source: http://www.routefifty.com/2016/09/data-drives-government/131821/
2016 Data Impact Award Winner
State of Kentucky Department of
Transportation
Improve Parkinson's Disease Monitoring and Treatment through IoT
Challenge:
• Collect and analyze data from wearables (more than 300 readings per second) from thousands of patients in real-time
Solution:
• Cloudera on Intel architecture to detect patterns in patient data streaming from wearables
• Continuously monitor the patients and symptoms to understand the progression of the disease objectively
HEALTHCARE
» WEARABLES
» PREDICTIVE ANALYTICS
» IMPROVED CARE
Connected Healthcare
DATA-DRIVENPROCESS
CASE STUDY
DATA-DRIVENPRODUCTS
Building a Holistic Picture of the US Securities Market From 50 Billion Daily Events
• Saving $10-20M in operational efficiencies annually
• 90-minute queries run in 10 seconds• Supporting future market growth and a
dynamic regulatory environment.
CUSTOMER 360
Using Big Data to Help Consumers Save Hundreds of Millions in Utility Bills
• Relevant insight into household energy use improves energy consciousness
• 2.7+ TWH (terawatt hours) saved to date
• Motivated consumers to save enough energy to power every household in Salt Lake City and St. Louis for a year
CUSTOMER 360
ENERGY & UTILITIES » PRODUCT INNOVATION» SERVICE IMPROVEMENT» IOT
Saving Lives by Detecting Sepsis Early Enough for Successful Treatment
• Builds a more complete picture of patients, conditions, and trends
• Has saved 100’s of lives already• Reduces hospital readmissions • 2PB+ in multi-tenant environment
supporting 100s of clients • Secure yet explorable
HEALTHCARE» 360° CUSTOMER VIEW» PREDICTIVE ANALYTICS» IMPROVED SERVICE
Improving Pediatric Care and Outcomes
• Quantifying effect of ambient noise on children’s vital signs
• Identifying cancerous genome variants in 20 minutes (vs. days before)
• Performing fewer CT scans and higher quality surgeries
CUSTOMER 360
HEALTHCARE» MACHINE LEARNING» IOT» 360o CUSTOMER VIEW
Government Revenue Service
Increasing Customer Convenience
• Provides view of the complete taxpayer journey
• Creates ability to pre-populate tax returns for increased ease of use
• Supports move to near-real-time oversight of operations and faster response
CUSTOMER 360
GOVERNMENT» SERVICE IMPROVEMENT» PROCESS IMPROVEMENT» 360° CUSTOMER VIEW
Driving Growth and Innovation
• Combines 80+ years’ data spanning all business units and 50 states
• Expedites holistic analysis and reports by 500X
• Enables more accurate and detailed predictive models to customize offers, optimizing pricing, and minimize risk
CUSTOMER 360
INSURANCE» 360° CUSTOMER VIEW» FRAUD DETECTION» PREDICTIVE ANALYTICS
Re-Platformed 1,600 Operational Databases & Systems onto a Cloudera EDH• Business & consumer data was spread
over a dozen different customer databases
• One daily ETL job (processing 1 billion customer records) used to take 24 hours
• Increased data velocity by 15x(5 times the data in 1/3 of the time)Now completes in 1 ½ hours
• BT now has access to the most up-to-date and centralized data for all their customers
CUSTOMER
360
TELECOMMUNICATIONS
» IMPROVED SERVICE
» PROCESS IMPROVEMENT
» IT COST REDUCTION
Future
Hardware evolution:
Cloud
40Gbps, 100Gbps networks
GPU, TPU
Flash disk
Application-driven:
Machine learning, deep learning
Realtime data stream processing (IoT)49
Takeaway
If you only remember 3 things from this talk:
1.Data is the new Oil
2.Open source is the standard
3.Think big! Remember GFS:
failures are the norm rather
than the exception!
54
Thank you
jojochuang@gmail.com / weichiu@apache.org / weichiu@cloudera.com
55
top related