architecting big data solutions · 2018-07-27 · key trends device explosion ubiquitous connection...

134
Architecting Big Data Solutions Michael Carducci @MichaelCarducci

Upload: others

Post on 12-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Architecting Big Data SolutionsMichael Carducci @MichaelCarducci

Page 2: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Limitations

Page 3: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

No Prescription

Page 4: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Focused on “mainstream" tools

Page 5: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

This talk is very high-level

Page 6: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Ecosystem is Young

Page 7: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Science is Out of Scope

Page 8: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Agenda

• What is Big Data

• Big Data Design Considerations

• Big Data Solution Architecture

• Big Data Ecosystem

• Architecture Template

• Data Acquisition

• Persistence

• Transform/Analytical

• Presentation

• Sample Architectures

• Closing Thoughts

Page 9: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

What is Big Data

Page 10: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

–IBM Marketing Cloud, 2017

“90% of the data in the world today has been created in the last two years alone, at 2.5

quintillion bytes of data a day!”

What is Big Data

Page 11: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

What is Big Data

• Large quantity of data

• Many sources/formats

• Unstructured

• Large Processing Load (needs to be distributed)

• Streaming/real-time processing

• Big Deployment footprint

Page 12: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Key Trends

Device Explosion

Ubiquitous Connection

Social Networks

Sensor Networks

Cheap Storage

Inexpensive Computing

23.14 Billion Connected devices

3.3 ZB by 2021 278 EB/month

Millions of new sensors go online every hour

$0.019 Avg Cost of 1GB in 2016 15.5 Million Times Cheaper

$0.03 Cost/GFLOPS in 2018

2.77 billion social media users By 2019

Page 13: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Big Data Big Opportunities

Page 14: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

A New Way of Thinking What, not Why

Page 15: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Big Data - Challenges

Volume Variety Velocity

Page 16: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Designing Big Data Solutions

Page 17: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Considerations

Page 18: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Availability

Page 19: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Consistency

Page 20: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Access Patterns

Page 21: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Real-time vs Historical

Page 22: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

How Big?

Page 23: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Security

Page 24: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Big Data Solution Longevity

• We are still on the bleeding-edge

• Tools are still young/evolving

• Don’t “predict” the future

Page 25: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Future Proofing Your Education and Effort

• Don’t follow “trends” follow Value

• Look for high-level support

• Look for product/developer support

• Look for cloud options

• Look for adoption by leading companies

• Look for open APIs and Data Formats

Page 26: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Design Approach:

Page 27: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Start with the end it mind

Page 28: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Leverage Everything Available

Page 29: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 30: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 31: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 32: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 33: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 34: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 35: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 36: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 37: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 38: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 39: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 40: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 41: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 42: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 43: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 44: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 45: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Hadoop Ecosystem

Page 46: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Architecture Template

Data Acquisition

Persistence

Transformation/ Analytics

Presentation

Page 47: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition

Page 48: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition Responsibilities

• Connect to data sources

• Data Conversion

• Filtering

• Protocol Management (handshake)

• Track Data as it moves

• Save Data to Persistence Module

• Stream Data to Processing Module

Page 49: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Historical Data Acquisition (Store & Forward)

• Receive from a source “at rest”

• Move data in units

• Track Completion

• Retransmit if required

Page 50: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Real-Time Data Acquisition (Streaming)

• Continuously moving data stream

• Throttle and Source

• Throttle for Persistence

• Inflight Storage

Page 51: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - What to Architect

• Identification of new data

• Re-Acquisition and Retransmission

• Data Loss Tolerance

• Source Buffering

• Security in Transit

• Privacy

• Monitoring

• Throttling

• Redundancy

• Scalability

Page 52: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Best Practices

• Involve Source Owners

• Reliable Means for Identification of New Data

• Reliable Means for Identification of Missing Data

• APIs and Formats should be standardized as early as possible

• Consider Separate Channels for Real-Time vs Historical Data

• Pay Attention to Security and Privacy

• Use Reliable Inflight Storage

Page 53: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Popular Technologies

• SQOOP

• Flume

• Kafka/Storm

• Spark

• NiFi

Page 54: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

SQOOP

• Command line tool for transferring data to/from relational data sources and hadoop

• Transfer entire database or the results of SQL

• Supports incremental data exchange

• Support for common file formats (Avro, Sequence, Parquet, CSV etc)

• Hive and HBase support

• Support Parallelism

• BLOB support

Page 55: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

SQOOP

• Command line tool for transferring data to/from relational data sources and hadoop

• Transfer entire database or the results of SQL

• Supports incremental data exchange

• Support for common file formats (Avro, Sequence, Parquet, CSV etc)

• Hive and HBase support

• Support Parallelism

• BLOB support

Page 56: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Flume

• A distributed service for collecting, aggregating, and moving large amounts of log/file data

• Robust/Fault tolerant (throttling, failover, recovery etc)

• Sources can span a large number of servers

• Supports multiple sources (files, strings, http posts, etc)

• Support for multiple sinks

• Can add custom sources and sinks through code

• Supports in-flight data processing through intercepters

Page 57: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Flume - Benefits and Challenges

• Pros

• Configuration Driven

• Massively Scalable

• Customizable through code

Page 58: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Flume - Benefits and Challenges

• Challenges

• No Ordering Guarantees

• Duplication possible

Page 59: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Kafka

• An open source message broker platform for real-time data feeds

• Pub/sub architecture

• Developed at LinkedIn, written in Scala

• Topics are published, multiple subscribers can be for each topic

• Ordering guarantees

• Coding required for pub/sub to interface with Kafka (not config driven)

• Replication and high Availability

• Often paired with Storm for Steam Processing

Page 60: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Kafka - Benefits and Challenges

• Pros

• Highly scalable realtime messaging system

• Multiple subscribers

• Ordering guarantees

Page 61: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Kafka - Benefits and Challenges

• Challenges

• Coding is required for publishers and subscribers

Page 62: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Spark

“A fast and general engine for large-scale data processing”

Page 63: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Spark Components

• Spark Streaming

• Spark SQL

• MLLib

• GraphX

Page 64: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Spark - Benefits and Challenges

• Pros

• Rich Ecosystem

• Very Fast

• Supports Streaming

• Supports real-time Machine Learning model updates

Page 65: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Spark - Benefits and Challenges

• Challenges

• In-memory processing can be expensive

• Not “true” real-time

• No file-management system

Page 66: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

NiFi

“A real-time integrated data logistics and simple event processing platform Apache NiFi automates the movement of data between disparate data sources and systems, making data ingestion fast, easy and secure.”

Page 67: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 68: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 69: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

RDBMS - Benefits and Challenges

• Pros

• Mature/Popular

• Extensive Support

• Incremental Fetches and Filtering

• Simple transformations at Source

Page 70: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

RDBMS - Benefits and Challenges

• Challenges

• Proprietary data platforms

• Cross-organizational boundary access limited by security

Page 71: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Typically Integrated with Apache Sqoop

Page 72: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 73: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Files - Benefits and Challenges

• Pros

• Easy way to integrate with legacy apps

• Files easily cross organizational boundaries

• Many tools and utilities available

Page 74: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Files - Benefits and Challenges

• Challenges

• Slow

• Data Exposure

• Potential Data/Privacy Leaks

Page 75: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Typically integrated with Flume

Page 76: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 77: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 78: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 79: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

REST - Benefits and Challenges

• Pros

• de-facto standard for data exchange over the internet

• Excellent security and scalability

• Easy to integrate

Page 80: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

REST - Benefits and Challenges

• Challenges

• Not designed for real-time data

• Stateless APIs can be “chatty”

• Often data sources impose rate limiting

Page 81: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Typically integrated with Apache NiFi

Page 82: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 83: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Common Sources

• RDBMS

• Logs

• Files

• Social Media

• HTTP/Rest

• Data Streams

Page 84: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Streaming - Benefits and Challenges

• Pros

• Real-time, instantaneous data transfer

• Data streams are typically deltas

• Supported by major cloud APIs

Page 85: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Streaming - Benefits and Challenges

• Challenges

• Data is typically lost if a connection is broken

• Might need to be supplemented by historical data

• Some providers might implement rate-limiting

Page 86: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Typically integrated with Kafka/Storm or Spark Streaming

Page 87: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Persistence

Page 88: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Persistence Responsibilities

• Reliable Data Storage

• Data Access

• Scalability

• Redundancy

Page 89: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Persistence - What to Architect

• Polyglot Storage & flexible schema

• Consistency

• Reliability

• Read-intensive vs Write Intensive

• Mutable vs immutable data

• Cataloging

• Latency

Page 90: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Persistence - Best Practices

• Choose platform that best supports your use case

• Keep design/schema flexible

• Keep data at lowest granularity

• Summarize only where needed

• Consider real-time application needs

• Redundancy > Backups

Page 91: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Persistence

• HDFS

• HBase

• Cassandra

• S3

• Azure

• RDBMS

Page 92: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HDFS

Page 93: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HDFS

Page 94: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HDFS

Page 95: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HBase

Page 96: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HBase

Page 97: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HBase

Page 98: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Cassandra

Page 99: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

S3

Page 100: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Azure

Page 101: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

RDBMS

Page 102: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Transform/Analytics

Page 103: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Transformation Responsibilities

• Build/update Machine Learning Models

• Historical processing

• Analytical querying

• Transform data into intermediate representations

• Transform data for presentation representation

Page 104: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Historical Data Processing

• Pull data from persistence layer

• Aggregate, summarize, and analyze data

• Save to Persistence or Presentation Layer

Page 105: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Real-Time Transformation/Analytics (Streaming)

• Update ML Models in real time

• Update stream analytics

Page 106: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Transformation - What to Architect

• Data Access latency

• Support for input volume

• Architect for future consumers of analytical data

• Architect for anticipated access patterns

• Real-time vs batch processing

Page 107: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Data Acquisition - Best Practices

• “What” is a question for the data scientists

• Use the best tools for the job/TIMTOOTDI

Page 108: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Transformation - Common Tools

• Mapreduce

• Mahout

• Spark

• Pig

• Hive/Impala

Page 109: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Mapreduce

Page 110: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Mahaout

Page 111: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Spark

Page 112: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Introducing Apache Spark

“A fast and general engine for large-scale data processing”

Page 113: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Spark Components

• Spark Streaming

• Spark SQL

• MLLib

• GraphX

Page 114: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Pig

Page 115: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Pig

• Abstracts Map/Reduce Complexity

• Uses PIG LATIN scripting language

• Highly extensible with UDFs

• Can perform faster than MapReduce

Page 116: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Oozie

Page 117: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Presentation

Page 118: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Presentation Responsibilities

• Provide analytical results in a consumable/digestible form

• Act as a reporting data source

• Bridge OLAP/OLTP Environments

• Low-latency repository

Page 119: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Presentation - What to Architect

• Performance

• Data Access

• Security

• Privacy

• API

• Data Refresh Frequency

• Data Lifecycle

• Availability

• Consistency

• Load

Page 120: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Presentation - Best Practices

• Involve Consumers

• Optimize Refresh Rate for Requirements

• Choose persistence mechanism based on access patterns and needs

• Real-time vs Digest/Summary

Page 121: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Presentation

• HBase

• RDBMS

• MongoDB

• Cassandra

• Redis

Page 122: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HBASE

• NoSQL DB built on HDFS

• Based on BigTable

• No Language - API

• Auto Sharding

Page 123: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

HBase Data Model

• Fast access to any given row

• A row is referenced by a unique key

• Rows has a small number of column families

• A column family may contain arbitrary columns

• You can have a very large number of columns in a column family

• Each Cell can have many versions with given timestamps

• Sparse data is ok. Missing columns in a row consume no storage

Page 124: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Example: Web Table

com.cnn.www

Contents Column Family

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//

EN" <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//

EN" <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//

EN"

KeyAnchor Column Family

Contents:

“CNN” “CNN.com”

Anchor cnnsi.com:

Anchor my.look.ca:

Page 125: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Accessing HBase

• HBase shell

• API

• Wrappers in many language

• Spark, Hive, Pig

• Rest Service

• Thrift Service

• Avro Service

Page 126: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

RDBMS

• Standard/mature database Technology

• Low Latency

• Good support for additional querying/filtering/transformation

• Easy to integrate

Page 127: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

MongoDB

• NoSQL DocumentDB

• Very Fast

• Very Scalable

• Flexible Schema

• Supports Multiple Indexes

Page 128: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Cassandra

• NoSQL

• High Availability

• Write Scalability

• CQL Query Language

Page 129: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

MongoDB vs Cassandra

Page 130: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Redis

• Not just for caching

• Very Fast

• Distributed

• Persistent

Page 131: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Sample Architectures

Page 132: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Closing Thoughts

Page 133: Architecting Big Data Solutions · 2018-07-27 · Key Trends Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 23.14 Billion

Thank youMichael Carducci @MichaelCarducci