hadoop workflows using sas® data integration … workflows using sas® data integration studio lal...
Post on 22-Mar-2018
238 Views
Preview:
TRANSCRIPT
#AnalyticsXC o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Hadoop Workflows Using SAS® Data Integration Studio
Lal Puthenveedu RajanpillaiSolution ArchitectUnited HealthCare
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Analytics Platform highlights
Hadoop cluster – Architecture changes
ETL Process changes
Leveraging SAS DI Studio
Best practices & Lessons learned
Questions
AGENDA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Analytics Platform highlights
Hadoop cluster – Architecture changes
ETL Process changes
Leveraging SAS DI Studio
Best practices & Lessons learned
Questions
AGENDA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Claims and Financial data
750+ Users across Enterprise
Connected to multiple data sources
Metadata driven SAS Grid Environment
Access from SAS Clients and SAS Add-ons
Platform highlights
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Analytics Platform Highlights
Hadoop cluster – Architecture Changes
New ETL Process
Leveraging SAS DI Studio
Best practices & Lessons learned
Questions
AGENDA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Pre Hadoop
Claims
Revenue
Membership
Financial
Clinical
Operational
Call Center
ENTERPRISE
W AREHOUSE
SAS
ANALYTICS
P L ATFORM
OTHER
ANALYTIC
TOOLS
SAS
CL IENT
TOOLS
Accounting
Regulatory
Actuarial
Data Science
Marketing
Planning
Leadership
ANALYTICS PLATFORMLEGACY
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Unified Data-lake in Hadoop
Co-location of data from multiple sources
Hadoop cluster for storage and Processing
New users with diverse client tools
What changed
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
With Hadoop
Claims
Revenue
Membership
Financial
Clinical
Operational
Call Center
ENTERPRISE
W AREHOUSE
SAS
ANALYTICS
P L ATFORM
OTHER
ANALYTIC
TOOLS
SAS
CL IENT
TOOLS
Accounting
Regulatory
Actuarial
Data Science
Marketing
Planning
Leadership
ANALYTICS PLATFORMWITH HADOOP
HAD OOP
CL USTER
SAS A
ccelerators
SAS A
ccess
SAS In
mem
ory
Publish
DIRECT ACCESS TO HADOOP
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Minimize SAN storage utilization
Replace legacy ETL process leveraging Hadoop
Co-location of analytics data with enterprise data-lake
Better access from non-SAS clients
Efficient scheduling process
Benefits of new Architecture
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Analytics Platform highlights
Hadoop cluster – Architecture changes
ETL Process changes
Leveraging SAS DI Studio
Best practices & Lessons learned
Questions
AGENDA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Old Process
· Multiple data sources· Data loaded to staging by legacy ETL· Triggers custom SAS jobs· Staging tables to SAS warehouse· Reconciles to source systems· Updates analytics and reporting datasetsLEGACY ETL CUS TOM SAS ETL
S T AGING T A BLES
DA
TASO
URC
ES
RAW
ENRICH
RECONCILE
ANALYTICS
SAS DATA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
New Process
S AS DI STUDIO
HIV E S T AGING T A BLES
DA
TASO
URC
ES
RAW
ENRICH
RECONCILE
ANALYTICS
SAS DATA
· HDFS for data landing and archive· Hive staging tables· Job processing with SAS DI studio· Enriched and reconciled data in SAS and hive· Current data (Recent 3 years in SAS)· History data in Hive
RAW
ENRICH
RECONCILE
ANALYTICS
HIVE TABLES
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
SAS DI studio
UI based development
Hadoop containers for HDFS, Pig and Hive
Automatic status handling
Better readability and maintainability of code
Hadoop cluster for processing and storage
Advantages of new process
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Analytics Platform highlights
Hadoop cluster – Architecture changes
ETL Process changes
Leveraging SAS DI Studio
Best practices & Lessons learned
Questions
AGENDA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
SAS DI Studio and Hadoop
Containers for Hadoop jobs
PIG container with PROC HADOOP
Hive container with PROC SQL
Hadoop file reader/writer for direct access to HDFS files
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
SAS DI Studio job
2:SAS STAGE
Load f ile to SAS Staging.
Data step w ith LIBNAME to HDFS
3:SAS LOAD
Stage to SAS Data
Data Append
4:HIVE STAGE
Load files to HIVE Staging.
Hive Containers
5:HIVE LOAD
Stage to HIVE Data
Hive Container
1 : INITIALIZE Create new run,
Copy file, File DQPIG Container
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Analytics Platform highlights
Hadoop cluster – Architecture changes
ETL Process changes
Leveraging SAS DI Studio
Best practices & Lessons learned
Questions
AGENDA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Pig Container
HDFS file moves
Quality checks
Data filtering
Hive Container
Hive table load, read , updates
Hadoop file reader/writer
HDFS file read/write from/to SAS data
Hadoop Containers
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Conditional process flow
All status handing at job level
Functionality based sub-jobs
Return codes and error
messages from containers
* For SAS 9.4M1 Pig Container needs a work around for Status Handling
Status Handling
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
SAS and Hadoop compatibility
SAS 9.4M1 version
Hive 0.13 - table formats to avoid space issues & ensure proper data conversion
Error handing of pig container
Hadoop error code not populated. Used error text to set RC
%if "&SYSERRORTEXT" ne "" %then %do ;
%let trans_rc = 9999;
%let job_rc = 9999;
%end;
Compatibility
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Optimized the job stream
Two job streams
First updated SAS datasets minimal dependency to the cluster
Second updated the hive tables
The loop functionality provided by SAS was effectively used
Hive with ORC SerDe and Snappy compression
and more..
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#analyticsx
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Analytics Platform Highlights
Hadoop cluster – Architecture Changes
New ETL Process
Leveraging SAS DI Studio
Best practices & Lessons learned
Questions ?
AGENDA
C o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
#AnalyticsX
top related