building a self service analytics platform on hadoop

1Page

Building a Self Service Analytics Platform on Hadoop

Avinash Ramineni

2Page

Clairvoyant

3Page

Clairvoyant Services

4Page

Quick Poll

• Big Data Deployments in Prod

• Hadoop Distributions• People use Ecosystems rather than tools

• Architecture was implemented on Cloudera

• Cloud Experience – AWS ?

5Page

Challenges

• Data in Silos

• Acquires Perspectives as data is moved

• Data availability delays

• Legacy Systems handling the Volume , Veracity and Velocity

• Extracting data from legacy systems

• Lack of Self-Service Capabilities

• Knowledge becomes tribal – instead of institutional

• Security / Compliance Requirements

6Page

Data Lake Attributes

• Data Democratization

• Data Discovery

• Data Lineage

• Self-Service capabilities

• Metadata Management

7Page

Without Self-Service

8Page

Self-Service at all Levels

Ingest Organize Enrich Analyze Dashboards

AnalyzeIngest Organize Enrich Insights

9Page

Key Design Tenets

• Separation of Compute and Storage

• Independently scale compute and storage

• Data Democratization and Governance

• Bring your own Compute (BYOC)

• HA / DR

• Open Source Stack

10

Page

Separation of Compute and Storage

• Scale storage and compute independently

• Shifts bottleneck from Disk IO to Network

• Centralized Data Storage

• Data Democratization

• No data duplication

• Easier Hardware upgrade paths

• Flexible Architecture

• DR Simplified

11

Page

BYOC (Bring Your Own Cluster)

• Each department/application can bring its own Hadoop cluster

• Eliminates the need for very large clusters

• Easier to administer and maintain

• Reduces multi-tenancy issues

• Clusters can be upgraded independently

• Enables usage based cost model

Centralized / Common S3 Storage

MarketingCluster

Centralized Storage

PersonalizationCluster

MainCluster

12

Page

Architecture

13

Page

Architecture – Data Ingestion Layer

• DB Ingestor

• Stream Ingestor

• Kafka and Spark Streaming

• File Ingestor

• FTP / SFTP / Logs

• Ingestion using Service API

14

Page

Architecture – Data Processing Layer

• Storage layer carved into logical buckets• Landing, Raw, Derived and Delivery• Schema stored with data (no guesswork)

• Platform Jobs • Converting text to Parquet• Saving streaming data Parquet• Derivatives• Compaction• Standardization

15

Page

Architecture – Data Delivery Layer

• Data Delivery • SQL - Spark Thrift Server / Impala

• Tableau, SQL IDE, Applications

• Self Service • Derivatives

• Represented Via SQL on Delivery Layer• Stored in Derived Storage Layer • Metadata driven

• Derived Layer Generators• Long running Spark Job• Derivative Refresh

16

Page

Key Takeaways - Cloud

• Hadoop Cloud ready-ness• Cloudera Director Limitations• Multi-Availability zone, regions

• Storage• Instance Storage• EBS Volumes

• gp2 vs st1

• S3 Eventual Consistency

17

Page

Key Takeaways - Spark Thrift Server

• Spark Thrift Server Support• Performance Tuning• Concurrency• partition strategy• Cache Tables

• Compression Codec for Parquet• Snappy vs gzip

18

Page

Key Takeaways - Security

• Secure by Design, Secure by Default• Access to Data on S3

• IAM Roles

• Sentry• Support for Spark

• Kerberos • Spark Thrift Server

• Navigator• Support for Spark

19

Page

Key Takeaways - General

• Rapidly Changing Technology• Feature addition• Documentation• Bugs• Jar hell

• Small files • Performance Issues• Compaction

20

Page

Key Takeaways - General

• Partition Strategy• Parquet Files

• Balancing parallelism and throughput• Table Partitions

• Cluster sizing, optimization and tuning

• Integrating with Corporate infrastructure• Deployment practices• Monitoring and Alerting• Information Security Policies

21

Page

Data Security

22

Page

Questions

• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni