building a self service analytics platform on hadoop
TRANSCRIPT
![Page 1: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/1.jpg)
1Page
Building a Self Service Analytics Platform on Hadoop
Avinash Ramineni
![Page 2: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/2.jpg)
2Page
Clairvoyant
![Page 3: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/3.jpg)
3Page
Clairvoyant Services
![Page 4: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/4.jpg)
4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?
![Page 5: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/5.jpg)
5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements
![Page 6: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/6.jpg)
6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
![Page 7: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/7.jpg)
7Page
Without Self-Service
![Page 8: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/8.jpg)
8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights
![Page 9: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/9.jpg)
9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack
![Page 10: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/10.jpg)
10
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified
![Page 11: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/11.jpg)
11
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
MarketingCluster
Centralized Storage
PersonalizationCluster
MainCluster
![Page 12: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/12.jpg)
12
Page
Architecture
![Page 13: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/13.jpg)
13
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API
![Page 14: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/14.jpg)
14
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets• Landing, Raw, Derived and Delivery• Schema stored with data (no guesswork)
• Platform Jobs • Converting text to Parquet• Saving streaming data Parquet• Derivatives• Compaction• Standardization
![Page 15: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/15.jpg)
15
Page
Architecture – Data Delivery Layer
• Data Delivery • SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service • Derivatives
• Represented Via SQL on Delivery Layer• Stored in Derived Storage Layer • Metadata driven
• Derived Layer Generators• Long running Spark Job• Derivative Refresh
![Page 16: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/16.jpg)
16
Page
Key Takeaways - Cloud
• Hadoop Cloud ready-ness• Cloudera Director Limitations• Multi-Availability zone, regions
• Storage• Instance Storage• EBS Volumes
• gp2 vs st1
• S3 Eventual Consistency
![Page 17: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/17.jpg)
17
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support• Performance Tuning• Concurrency• partition strategy• Cache Tables
• Compression Codec for Parquet• Snappy vs gzip
![Page 18: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/18.jpg)
18
Page
Key Takeaways - Security
• Secure by Design, Secure by Default• Access to Data on S3
• IAM Roles
• Sentry• Support for Spark
• Kerberos • Spark Thrift Server
• Navigator• Support for Spark
![Page 19: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/19.jpg)
19
Page
Key Takeaways - General
• Rapidly Changing Technology• Feature addition• Documentation• Bugs• Jar hell
• Small files • Performance Issues• Compaction
![Page 20: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/20.jpg)
20
Page
Key Takeaways - General
• Partition Strategy• Parquet Files
• Balancing parallelism and throughput• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure• Deployment practices• Monitoring and Alerting• Information Security Policies
![Page 21: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/21.jpg)
21
Page
Data Security
![Page 22: Building A Self Service Analytics Platform on Hadoop](https://reader031.vdocument.in/reader031/viewer/2022030318/5a64965e7f8b9a63568b4c45/html5/thumbnails/22.jpg)
22
Page
Questions
• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni