a scalable data transformation framework using the hadoop … ·  · 2016-06-02a scalable data...

35
A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director–Data Platform Kiru Pakkirisamy CTO

Upload: buihanh

Post on 13-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

A Scalable Data Transformation Framework

using the Hadoop Ecosystem

Raj NairDirector–Data Platform

Kiru Pakkirisamy CTO

Page 2: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

AGENDA

• About Penton and Serendio Inc

• Data Processing at Penton

• PoC Use Case

• Functional Aspects of the Use Case

• Big Data Architecture, Design and Implementation

• Lessons Learned• Lessons Learned

• Conclusion

• Questions

Page 3: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

About Penton

• Professional information services company

• Provide actionable information to five core markets

Agriculture Transportation Natural

Products

Infrastructure Industrial Design

& Manufacturing

Success StoriesSuccess Stories

EquipmentWatch.com Govalytics.com

Prices, Specs, Costs, Rental Analytics around Gov’t capital spending

down to county level

SourceESB NextTrend.com

Vertical Directory, electronic parts Identify new product trends in the natural

products industry

Page 4: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

About Serendio

Serendio provides Big Data Science

Solutions & Services for

Data-Driven Enterprises.

www.serendio.com

Page 5: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Data Processing at PentonData Processing at Penton

Page 6: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

What got us thinking?

• Business units process data in silos

• Heavy ETL

– Hours to process, in some cases days

• Not even using all the data we want • Not even using all the data we want

• Not logging what we needed to

• Can’t scale for future requirements

Page 7: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Data Processing Pipeline New

features

New

Insights

New

Products

Biz ValueAssembly Line

processing

The Data Processing Pipeline

InsightsProducts

Data Processing Pipeline

Page 8: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Penton examples

• Daily Inventory data, ingested throughout the day

(tens of thousands of parts)

• Auction and survey data gathered daily

• Aviation Fleet data, varying frequency

AnalyzeIngest, store

Clean, validateApply Business Rules

Map

Analyze

Report

Distribute

Slow Extract, Transform and Load = Frustration + missed business SLAs

Won’t scale for future

Various data formats, mostly unstructured

Page 9: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Current Design

• Survey data loaded as CSV files

• Data needs to be scrubbed/mapped

• All CSV rows loaded into one table

• Once scrubbed/mapped data is loaded into main tables

• Not all rows are loaded, some may be used in the future

Page 10: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

What were our options?

Adopt Hadoop Ecosystem

- M/R: Ideal for Batch Processing

- Flexible for storage

- NoSQL: scale, usability and flexibility

Expand RDBMS options

- Expensive

- Complex

HBASE OracleSQL

Server

Drools

Page 11: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

POC Use CasePOC Use Case

Page 12: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Primary Use Case

• Daily model data – upload and map

– Ingest data, build buckets

– Map data (batch and interactive)

– Build Aggregates (dynamic)

Issue: Mapping timeIssue: Mapping time

Page 13: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Functional AspectsFunctional Aspects

Page 14: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Data Scrubbing

• Standardized names for fields/columns

• Example - Country

– Unites States of America -> USA

– United States -> USA

Page 15: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Data Mapping

• Converting Fields - > Ids

– Manufacturer - Caterpillar -> 25

– Model - Caterpillar/Front Loader -> 300

• Requires the use of lookup tables and partial/fuzzy

matching strings

Page 16: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Data Exporting

• Move scrubbed/mapped data to main RDBMS

Page 17: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Key Pain Points

• CSV data table continues to grow

• Large size of the table impacts operations on rows in a single

file

• CSV data could grow rapidly in the future

Page 18: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Criteria for New Design

• Ability to store an individual file and manipulate it easily

– No join/relationships across CSV files

• Solution should have good integration with RDBMS

• Could possibly host the complete application in future

• Technology stack should possibly have advanced analytics

capabilities

NoSQL model would allow to quickly retrieve/address

individual file and manipulate it

Page 19: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Big Data ArchitectureBig Data Architecture

Page 20: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Solution Architecture

REST API

CSV and Rule Management Endpoints

CSV

Files

Master database of

Products/ Parts

Current

Push

Updates

Insert

RDB-> Data Upload UI

API Calls Launch

Data manipulation APIs exposed through REST

layer

Existing Business

Applications

HBASE

HADOOP HDFS

Current

Oracle

Schema

Insert

Accepted

DataMR Jobs

Survey

RESTDrools

Use HBaseas a store for CSV

files

Drools –for rule

based data scrubbing

Operations on individual files in UI through

Hbase Get/Put

Operations on all/groups of

files using MR jobs

Page 21: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Hbase Schema Design

• One row per HBase row

• One file per HBase row

– One cell per column qualifier (simple and started the development

with this approach)

– One row per column qualifier (more performant approach)

Page 22: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Hbase Rowkey Design

• Row Key

– Composite

• Created Date (YYYYMMDD)

• User

• FileType

• GUID• GUID

• Salting for better region splitting

– One byte

Page 23: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Hbase Column Family Design

• Column Family

– Data separated from Metadata into two or more

column families

– One cf for mapping data (more later)

– One cf for analytics data (used by analytics

coprocessors)

Page 24: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

M/R Jobs

• Jobs

– Scrubbing

– Mapping

– Export

• Schedule

– Manually from UI– Manually from UI

– On schedule using Oozie

Page 25: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Sqoop Jobs

• One time

– FileDetailExport (current CSV)

– RuleImport (all current rules)

• Periodic

– Lookup Table Data import– Lookup Table Data import

• Manufacture

• Model

• State

• Country

• Currency

• Condition

• Participant

Page 26: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Application Integration - REST

• Hide HBase AP/Java APIs from rest of

application

• Language independence for PHP front-end

• REST APIs for

– CSV Management– CSV Management

– Drools Rule Management

Page 27: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Lessons LearnedLessons Learned

Page 28: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Performance Benefits

• Mapping

– 20000 csv files, 20 million records

– Time taken – 1/3rd of RDBMS processing

• Metrics

– < 10 secs vs (Oracle Materialized View)

• Upload a file• Upload a file

– < 10 secs

• Delete a file

– < 10 secs

Page 29: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Hbase Tuning

• Heap Size for

– RegionServer

– MapReduce Tasks

• Table Compression

– SNAPPY for Column Family holding csv data– SNAPPY for Column Family holding csv data

• Table data caching

– IN_MEMORY for lookup tables

Page 30: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Application Design Challenges

• Pagination – implemented using intermediate REST layer and

scan.setStartRow.

• Translating SQL queries

– Used Scan/Filter and Java (especially on coprocessor)

– No secondary indexes - used FuzzyRowFilter

– Maybe something like Phoenix would have helped

• Some issues in mixed mode. Want to move to 0.96.0 for • Some issues in mixed mode. Want to move to 0.96.0 for

better/individual column family flushing but needed to 'port'

coprocessors (to protobuf)

Page 31: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Hbase Value Proposition

• Better response in UI for CSV file operations - Operations

within a file (map, import, reject etc) not dependent on the

db size

• Relieve load on RDBMS - no more CSV data tables

• Scale out batch processing performance on the cheap (vs

vertical RDBMS upgrade)

• Redundant store for CSV files

• Versioning to track data cleansing

Page 32: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Roadmap

• Benchmark with 0.96

• Retire Coprocessors in favor of Phoenix (?)

• Lookup Data tables are small. Need to find a better alternative

than HBase than HBase

• Design UI for a more Big Data appropriate model

– Search oriented paradigm, than exploratory/ paginative

– Add REST endpoints to support such UI

Page 33: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Wrap-UpWrap-Up

Page 34: A Scalable Data Transformation Framework using the Hadoop … ·  · 2016-06-02A Scalable Data Transformation Framework using the Hadoop Ecosystem ... SQL Server Drools. POC Use

Conclusion

• PoC demonstrated

– value of the Hadoop ecosystem

– Co-existence of Big data technologies with current solutions

– Adoption can significantly improve scale

– New skill requirements