c-bag big data meetup chennai oct.29-2014 hortonworks and concurrent on cascading
Post on 28-May-2015
1.067 Views
Preview:
DESCRIPTION
TRANSCRIPT
Page 1
Big Data Meetup
October 29, 2014
Chennai
C-BAG
Page 2
C-BAG Chennai Big Data Analytic Group C-BAG is an open group formed in the interest of creating a good BIG DATA Environment.
C-BAG is conducting weekly and monthly online/offline free sessions, creating awareness on the BIG DATA technologies and support BIG DATA initiatives. C-BAGs aim is to be a one stop place for all BIG DATA queries, discussions and support !
Contact Us : chennaibigdataanalyticgroup@gmail.com
Page 3
Speakers
About Dhruv Kumar Solutions Architect Concurrent Inc. Dhruv Kumar has over six years of diverse software development experience in Big Data, Web and High Performance Computing applications. Prior to joining Concurrent, he worked at Terracotta as a Software Engineer. He has a MS degree in Computer Engineering from the University of Massachusetts-Amherst.
About Vinay Shukla Director of Product Management Hortonworks Vinay Shukla is a seasoned Enterprise Software professional with extensive experience in Product management, Product development and Project management. Prior to Hortonworks, Vinay has worked as security architect, product manager, developer and project manager. Vinay admits to being a caffeine addict and spends his free time on a Yoga mat and on Hikes.
Page 4
• Founded in 2011
• Original 24 architects, developers, operators of Hadoop from Yahoo!
• Leaders in Hadoop community
• 500+ employees
Customer Momentum • 300+ customers in seven quarters, growing at 75+/quarter • Two thirds of customers come from F1000
Partner Momentum • Over 1000 Partners, Hundreds of Certified Solutions • Some key
partners include:
Hortonworks and Hadoop at Scale • HDP in production on largest clusters on planet • Most +1000 node clusters
Hortonworks enables adoption of Apache Hadoop with Hortonworks Data Platform
Page 5
The Forrester Wave™ Big Data Hadoop Solutions Q1 2014
“Hortonworks loves and lives open source innovation” World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR
A Leader in Hadoop
Page 6
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
Had
oop
&YA
RN
Pig
Hiv
e &
HC
atal
og
HB
ase
Sqo
op
Ooz
ie
Zoo
keep
er
Am
bari
Sto
rm
Flu
me
Kno
x
Pho
enix
Acc
umul
o
2.2.0 0.12.0
0.12.0 2.4.0
0.12.1
Data Management
0.13.0
0.96.1
0.98.0
0.9.1 1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5 0.4.0
4.0.0
1.5.1
Fal
con
0.5.0
Ran
ger
Spa
rk
Kaf
ka
0.14.0 0.14.0
0.98.4
1.6.1
4.2 0.9.3
1.2.0 0.6.0
0.8.1
1.4.5 1.5.0
1.7.0
4.1.0 0.5.0
0.4.0 2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slid
er
0.60
HDP 2.0
October
2013
HDP 2.2 October
2014
HDP 2.1
April
2014
Sol
r
4.7.2
4.10.0
0.5.1
Data Access Governance & Integration Security Operations
Page 7
The Modern Data Architecture w/ HDP
Page 8
Enterprise Goals for the Modern Data Architecture
• Consolidate siloed data sets structured and unstructured
• Central data set on a single cluster
• Multiple workloads across batch interactive and real time
• Central services for security, governance and operation
• Preserve existing investment in current tools and platforms
• Single view of the customer, product, supply chain
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOU
RC
ES
EXISTING(Systems(
Clickstream( Web((&Social(
Geoloca9on( Sensor((&(Machine(
Server((Logs(
Unstructured(
Page 9
1. Unlock New Applications from New Types of Data INDUSTRY USE CASE Sentiment
& Web Clickstream & Behavior
Machine & Sensor Geographic Server Logs Structured &
Unstructured
Financial Services New Account Risk Screens ✔ ✔
Trading Risk ✔
Insurance Underwriting ✔ ✔ ✔
Telecom Call Detail Records (CDR) ✔ ✔
Infrastructure Investment ✔ ✔
Real-time Bandwidth Allocation ✔ ✔ ✔
Retail 360° View of the Customer ✔ ✔ ✔
Localized, Personalized Promotions ✔
Website Optimization ✔
Manufacturing Supply Chain and Logistics ✔
Assembly Line Quality Assurance ✔
Crowd-sourced Quality Assurance ✔
Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔
Monitor Patient Vitals in Real-Time
Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔
Improve Prescription Adherence ✔ ✔ ✔ ✔
Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔
Monitor Rig Safety in Real-Time ✔ ✔ ✔
Government ETL Offload/Federal Budgetary Pressures ✔ ✔
Sentiment Analysis for Government Programs ✔
Page 10
..to shift from reactive to proactive interactions
HDP and Hadoop allow organizations to shift interactions from…
Reactive Post Transaction
Proactive Pre Decision
…to Real-time Personalization From static branding
…to repair before break From break then fix
…to Designer Medicine From mass treatment
…to Automated Algorithms From Educated Investing
…to 1x1 Targeting From mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Telco
Page 11
2. Or to realize a dramatic cost savings…
✚
EDW Optimization
OPERATIONS 50%
ANALYTICS 20%
ETL PROCESS 30%
OPERATIONS 50% ANALYTICS
50%
Current Reality EDW at capacity: some usage from low value workloads
Older data archived, unavailable for ongoing exploration
Source data often discarded
Augment w/ Hadoop
Free up EDW resources from low value tasks
Keep 100% of source data and historical data for ongoing exploration
Mine data for value after loading it because of schema-on-read
Hadoop Parse, Cleanse
Apply Structure, Transform
Page 12
2. Or to realize a dramatic cost savings…
✚
EDW Optimization
OPERATIONS 50%
ANALYTICS 20%
ETL PROCESS 30%
OPERATIONS 50% ANALYTICS
50%
Current Reality EDW at capacity: some usage from low value workloads
Older data archived, unavailable for ongoing exploration
Source data often discarded
Augment w/ Hadoop
Free up EDW resources from low value tasks
Keep 100% of source data and historical data for ongoing exploration
Mine data for value after loading it because of schema-on-read
MPP
SAN
Engineered System
NAS
HADOOP
Cloud Storage
$0 $20,000 $40,000 $60,000 $80,000 $180,000
Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)
Commodity Compute & Storage Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure
Hadoop Parse, Cleanse
Apply Structure, Transform
Storage Costs/Compute Costs from $19/GB to $0.23/GB
Page 13
3. Data Lake: An architectural shift
SCA
LE
SCOPE
Unlocking the Data Lake (
RDBMS
MPP
EDW
Data Lake Enabled by YARN • Single data repository,
shared infrastructure
• Multiple biz apps accessing all the data
• Enable a shift from reactive to proactive interactions
• Gain new insight across the entire enterprise
New Analytic Apps or IT Optimization
HDP 2.1
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
YARN
Page 14
Case Study: 12 month Hadoop evolution at TrueCar
Dat
a Pl
atfo
rm C
apab
ilitie
s
12 months execution plan
June 2013 Begin Hadoop Execution
July 2013 Hortonworks Partnership
May ‘14 IPO
Aug 2013 Training & Dev Begins
Nov 2013 Production Cluster 60 Nodes 2 PB
Jan 2014 40% Dev Staff Perficient
Dec 2013 Three Production Apps (3 total)
Feb 2014 Three More Production Apps (6 total)
12 Month Results at TrueCAR • Six Production Hadoop Applications • Sixty nodes/2PB data • Storage Costs/Compute Costs
from $19/GB to $0.23/GB
“We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
DRIVING INNOVATION THROUGH DATAREDUCING DEVELOPMENT TIME FOR PRODUCTION-GRADE HADOOP APPLICATIONSDhruv Kumar Solutions Architect, Concurrent Inc
GET TO KNOW CONCURRENT
2
Leader in Application Infrastructure for Big Data!
• Building enterprise software to simplify Big Data application development and management
Products and Technology!
• CASCADINGOpen Source - The most widely used application infrastructure for building Big Data apps with over 200,000 downloads each month. 8000 deployments worldwide.
• DRIVEN Enterprise data application management for Big Data apps
Proven — Simple, Reliable, Robust!
• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.
Founded: 2008 HQ: San Francisco, CA !CEO: Gary Nakamura CTO, Founder: Chris Wensel !!www.concurrentinc.com
BIG DATA APPLICATION INFRASTRUCTURE
3
“It’s all about the apps”"There needs to be a comprehensive solution for building, deploying, running
and managing this new class of enterprise applications.
Business Strategy Data & Technology
Challenges!!
Skill sets, systems integration, standard op procedure and
operational visibility
Connecting Business and Data
DATA APPLICATIONS - ENTERPRISE NEEDS
4
Enterprise Data Application Infrastructure!!
• Need reliable, reusable tooling to quickly build and consistently deliver data products !
• Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets
!
• Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application
!
• Need operational visibility for entire data application lifecycle
WORD COUNT EXAMPLE WITH CASCADING
5
!
!String docPath = args[ 0 ];!String wcPath = args[ 1 ];!Properties properties = new Properties();!AppProps.setApplicationJarClass( properties, Main.class );!HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!!
configuration
integration
!// create source and sink taps!Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );!Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );!!
processing
// specify a regex to split "document" text lines into token stream!Fields token = new Fields( "token" );!Fields text = new Fields( "text" );!RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );!// only returns "token"!Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!// determine the word counts!Pipe wcPipe = new Pipe( "wc", docPipe );!wcPipe = new GroupBy( wcPipe, token );!wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!
scheduling
!// connect the taps, pipes, etc., into a flow definition!FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );!// create the Flow!Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!wcFlow.complete(); // <<-- Runs jobs on Cluster
• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical
• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)
• Aggregations ‣ Count, Average, etc
SOME COMMON PROCESSING PATTERNS
6
filter
filter
function
functionfilterfunctiondata
PipelineSplit Join
Merge
data
Topology
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING API
7
Process Planner
Processing API Integration APIScheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
ScriptingScala, Clojure, JRuby, Jython, Groovy
Enterprise Java
Cascading Domain Specific Languages (DSLs)
FRAMEWORK AND PROGRAMMING LANGUAGE INDEPENDENCE
8
New Fabrics
ClojureSQL Ruby
StormTez
Supported Fabrics and Data Stores
Mainframe DB / DW Data Stores HadoopIn-Memory
!
• Any JVM language can use Cascading API
• Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
THE STANDARD FOR DATA APPLICATION DEVELOPMENT
9
www.cascading.org
Build data apps that are
scale-free!!!!
Design principles ensure best practices at
any scale
Test-Driven Development!
!Efficiently test code and process local files before
deploying on a cluster
Staffing Bottleneck!
!Use existing Java, SQL,
modeling skill sets
Operational Complexity!
!Simple - Package up into
one jar and hand to operations
Application Portability!
!!
Write once, then run on different computation
fabrics
Systems Integration!
!!
Hadoop never lives alone. Easily integrate to existing
systems
!
Proven application development framework for building data apps
Application platform that addresses:
CASCADING DATA APPLICATIONS
10
Enterprise IT!Extract Transform Load
Log File Analysis Systems Integration Operations Analysis
!
Corporate Apps!HR Analytics
Employee Behavioral Analysis Customer Support | eCRM
Business Reporting !
Telecom!Data processing of Open Data
Geospatial Indexing Consumer Mobile Apps Location based services
Marketing / Retail!Mobile, Social, Search Analytics
Funnel Analysis Revenue Attribution
Customer Experiments Ad Optimization
Retail Recommenders !
Consumer / Entertainment!Music Recommendation Comparison Shopping Restaurant Rankings
Real Estate Rental Listings
Travel Search & Forecast !
!
Finance!Fraud and Anomaly Detection
Fraud Experiments Customer Analytics
Insurance Risk Metric !
Health / Biotech!Aggregate Metrics For Govt
Person Biometrics Veterinary Diagnostics Next-Gen Genomics
Argonomics Environmental Maps
!
STRONG ORGANIC GROWTH
11
200,000+ downloads / month!8000+ Deployments!
BUSINESSES DEPEND ON US
12
• 30000 Jobs per day!
• Makes complex analysis of very large data sets simple!
• Machine learning, linear algebra to improve!
• User experience!
• Ad quality (matching users and ad effectiveness)!
• All revenue applications are running on Cascading/Scalding!
BUSINESSES DEPEND ON US
13
• Cascading Java API!
• Data normalization and cleansing of search and click-through logs for
use by analytics tools, Hive analysts!
• Easy to operationalize heavy lifting of data in one framework
BUSINESSES DEPEND ON US
14
• Cascalog (Clojure)!
• Weather pattern modeling to protect growers against loss!
• ETL against 20+ datasets daily!
• Machine learning to create models!
• Purchased by Monsanto for $930M US
BROAD SUPPORT
15
Hadoop ecosystem supports Cascading!
… AND INCLUDES RICH SET OF EXTENSIONS
16
http://www.cascading.org/extensions/
WORD COUNT DEMO ON HDP
17
• Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite !
• Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x !
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric
SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING
18
Concurrent offers training classes for Cascading & Scalding
CONTACT INFORMATION
Dhruv Kumar!Solutions Architect!
Concurrent Inc.!dkumar@concurrentinc.com
DRIVING INNOVATION THROUGH DATATHANK YOUDhruv Kumar
APPENDIX
21
USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO SPARK
22
• Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps !
• Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC !
• Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache Hadoop
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
SCALDING
23
• Scalding is a language binding to Cascading for Scala • The name Scalding comes from the combining of SCALa and cascaDING !
• Scalding is great for Scala developers; can crisply write constructs for matrix math… !
• Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines
PATTERN ENABLES MIGRATING YOUR MODELS TO SPARK
24
• Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps.
• PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms
• PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows!
• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner
• Pattern is great for migrating your model scoring to Hadoop from your decision systems
Confidential25
• Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest
!
algorithms extended based on customer use cases –
PATTERN: ALGOS IMPLEMENTED
Confidential26
BUILDING AND RUNNING PMML MODELS
Model Producer
Data PMML ModelExplore data and build model
using Regression, clustering, etc.
Training
Scoring
NewData
PMML model
Measure and improve model
Post Processing
ModelConsumer
Data Data
scores
PATTERN
ETL, prepare data
ETL, prepare data
LINGUAL
LINGUAL
top related