cloudera presentation at the chief analytics officer, fall 2016
Post on 13-Jan-2017
1.332 Views
Preview:
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
Data Engineering and Data Science Modern Analytics and Data Processing for the Enterprise
2 © Cloudera, Inc. All rights reserved.
Today, Data is Everything!
Instrumentation
Consumerization
Experimentation
Today, everything that can be measured will be measured.
Today, data IS the application.
Today, becoming data-driven is a business imperative.
3 © Cloudera, Inc. All rights reserved.
“It will soon be technically feasible & affordable to
record & store everything…”
— New York Times
“Digital technologies will, in the near future, accomplish many tasks once considered
uniquely human.” .
— Second Machine Age
Data is abundant, diverse & shared freely
As is how we store, process and analyze it
Streaming Machine Learning BI
ETL Modeling
4 © Cloudera, Inc. All rights reserved.
The new analytics paradigm
Understand why it
happened
Change what
happens next
Determine what
happened
Make it happen
consistently
5 © Cloudera, Inc. All rights reserved.
Modern Data Engineering and Data Science requires a new approach in order to handle more data, faster, with better access and a
simplified architecture.
6 © Cloudera, Inc. All rights reserved.
Apache Hadoop
Hadoop Distributed File System (HDFS)
File Sharing & Data Protection Across Physical Servers
YARN/MapReduce v2
Distributed Computing Across
Physical Servers
Flexibility
•A single repository for storing processing & analyzing any type of data
•Not bound by a single schema
•On Premises and in the Cloud
Scalability + Complex Analysis
•Scale-out architecture divides workloads across multiple nodes
•Flexible file system eliminates ETL bottlenecks
•Real-time analytics
Low Cost •Can be deployed on industry
standard hardware
•Open source platform guards against vendor lock
•1-2 Orders of magnitude less expensive than traditional systems
Apache Hadoop is a platform for data storage and processing that is…
• Distributed • Scalable • Fault tolerant • Open source
(Original) Core Hadoop Components
7 © Cloudera, Inc. All rights reserved.
End to End Lifecycle of Data Science
Data Engineering Data Science Production (Data Engineering / App Development)
Data Wrangling
Visualization and Analysis
Model Training & Testing
Production Model
Preparation Batch Scoring
Online Scoring
Serving
Dev Tools: IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing
Data Governance Governance
Processing
Acquisition
Model Quality & Performance
Experiments
8 © Cloudera, Inc. All rights reserved.
Our Goal: Bring More Data Science Users to Hadoop
Help more data scientists
use the power of Hadoop
Use a powerful, familiar
environment with direct access
to Hadoop data and compute
Data Scientist
Data Engineer
Make it easy and secure to
add new users, use cases
Offer secure self-service
analytics and a faster path to
production on common,
affordable infrastructure
Enterprise Architect
Hadoop Admin
9 © Cloudera, Inc. All rights reserved.
Who is Data Engineering for?
• Needs projects to scale • Cares about performance • Cares about SLA’s • Needs multitenancy, security,
and optimized architecture
• Needs better scale • Cares about access to data • Wants better collaboration
without managing dependencies
Data Engineer/ETL Engineer Data Scientist/Data Analyst
• Cares that his team is productive
• Cares about enforcing standards.
• Wants results he can share with the business
Analytics Leader
10 © Cloudera, Inc. All rights reserved.
Requirements of a Data Science Platform
• Leverage Big Data – Volume, Variety, Velocity – to tackle various use cases
• Enable real-time use cases
• Provide sufficient toolset for the Data Analysts
• Provide sufficient toolset for the Data Scientists + Data Engineers
• Provide standard data governance capabilities
• Provide standard security across the stack
• Provide flexible deployment options
• Integrate with partner tools
• Provide management tools that make it easy for IT to deploy/maintain
11 © Cloudera, Inc. All rights reserved.
Cloudera Enterprise, A New Way Forward
12 © Cloudera, Inc. All rights reserved.
Data Engineering and Data Science Workloads
Data Ingestion
(Kafka, Navigator,
Search) Cloudera enables users to build real-time, end-to-end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security.
Data Processing
(Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets.
Data Science
(Spark MLlib) Cloudera is bringing the most popular data science languages/libraries to our platform for easier collaboration, self-service exploration, and implementation at scale. Cloudera is advancing the state of distributed machine learning at scale. Cloudera enables exploratory data science and the ability to deliver robust data products.
13 © Cloudera, Inc. All rights reserved.
Data Ingestion for Hadoop Ingest Any Data Type at Any Rate
STRUCTURED Sqoop
UNSTRUCTURED Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT YARN
SECURITY Sentry, RecordService
FILESYSTEM HDFS
RELATIONAL Kudu
NoSQL HBase
STORE
INTEGRATE
BATCH Spark, Hive, Pig
MapReduce
STREAM Spark
SQL Impala
SEARCH Solr
SDK Kite
Apache Sqoop: SQL to Hadoop • Efficiently bulk load data (bidirectional) • Easily get started with custom connectors freely available
(RDBMS/EDW/NoSQL)
Apache Flume: Log Aggregation for Hadoop • Efficiently move large amounts of streaming/log data • Reliable, scalable, manageable, and extensible for
production • Connector ecosystem for common streaming data sources • Easily gather logs from multiple systems
Apache Kafka: Pub-Sub Messaging for Hadoop • Move data from many “producers” to many “consumers” • Most flexible to support a wide range of use cases • Integrates with Flume, HBase, Spark, etc
14 © Cloudera, Inc. All rights reserved.
Powerful Data Processing The Most Apache Spark Experience
STRUCTURED Sqoop
UNSTRUCTURED Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT YARN
SECURITY Sentry, RecordService
FILESYSTEM HDFS
RELATIONAL Kudu
NoSQL HBase
STORE
INTEGRATE
BATCH Spark, Hive, Pig
MapReduce
STREAM Spark
SQL Impala
SEARCH Solr
SDK Kite
Spark: Data processing and data science for developers and data scientists • Easy development • Flexible, extensible API • Fast batch and stream processing
Cloudera: Most experience with Spark on Hadoop for instant success • First to ship and support • Most Spark users trained • Most customers running Spark • Most engineering resources (committers, contributors, support) • Only vendor focused on enterprise Spark
15 © Cloudera, Inc. All rights reserved.
Data Science A Unified Platform to Accelerate Data Science from Exploration to Production.
Data Scientists need to use data to…
▪ Explore
▪ Model
▪ Test
The field of data science blends math and statistics knowledge with advanced computer knowledge.
▪ “Data Scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician” Josh Wills
16 © Cloudera, Inc. All rights reserved.
Spark MLlib Collection of mainstream machine learning algorithms built on Spark
Including: •Classifiers: logistic regression, boosted trees, random forests, etc
•Clustering: k-means, Latent Dirichlet Allocation (LDA)
•Recommender Systems: Alternating Least Squares
•Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
•Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc
•Statistical Functions: Chi-Squared Test, Pearson Correlation, etc
17 © Cloudera, Inc. All rights reserved.
Logistic Regression Performance (Data Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Ru
nn
ing
Tim
e(s
)
# of Iterations
MapReduce
Spark
110 s/iteration
First iteration = 80s Further iterations 1s due to caching
18 © Cloudera, Inc. All rights reserved.
End to End Lifecycle of Data Science
Data Engineering Data Science Production (Data Engineering / App Development)
Data Wrangling
Visualization and Analysis
Model Training & Testing
Production Model
Preparation Batch Scoring
Online Scoring
Serving
Dev Tools: IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing
Data Governance Governance
Processing
Acquisition
Model Quality & Performance
Experiments
19 © Cloudera, Inc. All rights reserved.
Cloudera Data Science Workbench A unified platform to accelerate data science from exploration to production.
1. Team Productivity Cloudera Workbench
2. Automation Cloudera Pipelines
3. Data Products Cloudera Models
20 © Cloudera, Inc. All rights reserved.
Hadoop as a Data Science Platform
• Leverage Big Data
• Enable real-time use cases
• Provide sufficient toolset for the Data Analysts
• Provide sufficient toolset for the Data Scientists + Data Engineers
• Provide standard data governance capabilities
• Provide standard security across the stack
• Provide flexible deployment options
• Integrate with partner tools
• Provide management tools that make it easy for IT to deploy/maintain
Hadoop
Kafka, Spark Streaming, Kudu
Spark, Hive, Impala, Hue
Cloudera Data Science Workbench
Navigator + Partners
Kerberos, Sentry, Record Service, KMS/KTS
Cloudera Director
Rich Ecosystem
Cloudera Manager/Director
21 © Cloudera, Inc. All rights reserved.
Three Core Enterprise Applications
OPERA
TIONS
DATAM
ANAGEM
ENT
UNIFIEDSERVICES
PROCESS,ANALYZE,SERVE
STORE
INTEGRATE
Process data, develop & serve predictive models
Data Engineering & Science
ELT, reporting, exploratory business
intelligence
Analytic Database
Build data-driven applications to deliver
real-time insights
Operational Database
22 © Cloudera, Inc. All rights reserved.
DATA-DRIVEN PRODUCTS
Delivering Improved Cash Flow to Healthcare Providers
• Streamlined transfer of messages between payers and providers
• Reduced cost per terabyte of storage
by 90% • Delivered data encryption and security
protection for HIPAA compliance
HEALTHCARE » PRODUCT IMPROVEMENT » PREDICTIVE ANALYTICS » IT COST REDUCTION
23 © Cloudera, Inc. All rights reserved.
• End-to-end view of data is helping save lives by detecting sepsis early enough for successful treatment
• Has saved 100s of lives already & reduced hospital readmissions
• Centralized data from many systems available in a secure environment
• 2PB+ in multi-tenant environment supporting 100s of clients
Improve Products &
Services Efficiency
24 © Cloudera, Inc. All rights reserved.
Thank you jordan.volz@cloudera.com linkedin.com/in/jordanvolz
top related