big data nalytics - sei digital library · pdf filearchitectural decisions 22 volume (> 10...

29
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Upload: haminh

Post on 31-Jan-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

BIG DATA ANALYTICSREFERENCE ARCHITECTURES ANDCASE STUDIES

Page 2: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Relational vs. Non-Relational Architecture

2

Relational Non-Relational

• Rational• Predictable• Traditional

• Agile• Flexible• Modern

Page 3: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Agenda

3

Big Data Challenges

Big Data Reference

Architectures

Case Studies

Tips for Designing Big Data Solutions

Page 4: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Big Data Challenges

4

UNSTRUCTURED

STRUCTURED

HIGH

MEDIUM

LOW

Archives Docs Business Apps

Media SocialNetworks

PublicWeb

DataStorages

MachineLog Data

SensorData

Data StoragesRDBMS, NoSQL, Hadoop, file systemsetc.

Machine Log DataApplication logs, event logs, server data, CDRs, clickstream data etc.

Sensor DataSmart electric meters, medical devices, car sensors, road cameras etc.

ArchivesScanned documents, statements, medical records, e-mails etc..

DocsXLS, PDF, CSV, HTML, JSON etc.

Business AppsCRM, ERP systems, HR, project management etc.

Social NetworksTwitter, Facebook, Google+, LinkedIn etc.

Public WebWikipedia, news, weather, public finance etc

MediaImages, video, audio etc.

Velocity Variety VolumeComplexity

Page 5: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Big Data Analytics

5

Traditional Analytics (BI) Big Data Analytics

Focus on

Data Sets

Supports

• Descriptive analytics• Diagnosis analytics

• Limited data sets• Cleansed data• Simple models

• Large scale data sets• More types of data• Raw data• Complex data models

• Predictive analytics• Data Science

Causation: what happened, and why?

Correlation: new insight More accurate answers

vs

Page 6: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Big Data Analytics Use Cases

6

DataDiscovery

BusinessReporting

Real TimeIntelligence

Data QualitySelf Service

Business Users

Intelligent AgentsConsumers

Low Latency Reliability

VolumePerformance

Data Scientists/Analysts

Page 7: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Big Data Analytics Reference Architectures

7

Architecture Drivers: Reference Architectures: 

▪ Volume▪ Sources▪ Throughput▪ Latency▪ Extensibility▪ Data Quality▪ Reliability▪ Security▪ Self-Service▪ Cost

▪ Extended Relational▪ Non-Relational▪ Hybrid

Page 8: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Relational Reference Architecture

8

Web Services

MobileDevices

NativeDesktop

WebBrowsers

AdvancedAnalytics

OLAP Cubes

Query &Reporting

OperationalData Stores

Data Marts

DataWarehouses

Replication

API/ODBC

Messaging

ETL

Unstructured

Semi-Structured

Data Sources Integration Data Storages Analytics Presentation

Structured

Page 9: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

9

Extended Relational Reference Architecture

Web Services

MobileDevices

NativeDesktop

WebBrowsers

AdvancedAnalytics

OLAP Cubes

Query &Reporting

OperationalData Stores

Data Marts

DataWarehouses

Replication

API/ODBC

Messaging

ETL

Unstructured

Semi-Structured

Data Sources Integration Data Storages Analytics Presentation

Structured

Key components affected with Big Data challenges

Page 10: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Non-Relational Reference Architecture

10

Web Services

MobileDevices

NativeDesktop

WebBrowsers

AdvancedAnalytics

Map Reduce

Query &Reporting

Search Engines

Distributed File Systems

NoSQL Databases

API

Messaging

ETL

Unstructured

Semi-Structured

Data Sources Integration Data Storages Analytics Presentation

Structured

Key components introduced with non-relational movement

Page 11: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Extended Relational vs. Non-Relational Architecture

11

Architecture DriversExtended Relational

Non‐Relational

Large data volume

Self‐service (ad‐hoc reporting)

Unstructured data processing

High data model extensibility

High data quality and consistency

Extensive security

Reliability and fault‐tolerance

Low latency (near‐real time)

Low cost

Skills availability

Page 12: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Extended Relational vs. Non-Relational Architecture

12

Architecture DriversExtended Relational

Non‐Relational

Large data volume

Self‐service (ad‐hoc reporting)

Unstructured data processing

High data model extensibility

High data quality and consistency

Extensive security

Reliability and fault‐tolerance

Low latency (near‐real time)

Low cost

Skills availability

Page 13: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Extended Relational vs. Non-Relational Architecture

13

Architecture DriversExtended Relational

Non‐Relational

Large data volume

Self‐service (ad‐hoc reporting)

Unstructured data processing

High data model extensibility

High data quality and consistency

Extensive security

Reliability and fault‐tolerance

Low latency (near‐real time)

Low cost

Skills availability

Page 14: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Relational vs. Non-Relational Architecture

14

Relational Non-Relational

• Rational• Predictable• Traditional

• Agile• Flexible• Modern

Page 15: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

DataDiscovery

BusinessReporting

Real TimeIntelligence

Big Data Analytics Use Cases

15

Business Users

Intelligent AgentsConsumers

PerformanceVolume

Data Scientists

Page 16: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Data Discovery: Non-Relational Architecture

16

Web Services

MobileDevices

NativeDesktop

WebBrowsers

AdvancedAnalytics

Map Reduce

Query &Reporting

Search Engines

Distributed File Systems

NoSQL Databases

API

Messaging

ETL

Unstructured

Semi-Structured

Data Sources Integration Data Storages Analytics Presentation

Structured

Page 17: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

DataDiscovery

BusinessReporting

Real TimeIntelligence

Big Data Analytics Use Cases

17

Intelligent AgentsConsumers

Data Scientists

Data QualitySelf Service

Business Users

Page 18: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Business Reporting: Hybrid Architecture

18

Web Services

MobileDevices

NativeDesktop

WebBrowsers

Map Reduce

SQL Query &Reporting

Distributed File Systems

API

Messaging

ETL

Unstructured

Semi-Structured

Data Sources Integration Data Storages Analytics Presentation

Structured Relational DWH/DM

AdvancedAnalytics

Search Engines

Extended Relational components Non-relational components

Page 19: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

DataDiscovery

BusinessReporting

Real TimeIntelligence

Big Data Analytics Use Cases

19

Data Scientists Business Users

Intelligent AgentsConsumers

Low Latency Reliability

Page 20: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Lambda Architecture

20

Source:

Page 21: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

21

Business Goals: Provide visual environment for building custom mobile application Charge customers based on the platform they are using, number of consumers’ applications etc.

Business Area:Cloud based platform for building, deploying, hosting and managing of mobile applications

Case Study #1: Usage & Billing Analysis

Page 22: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Architectural Decisions

22

▪ Volume (> 10 TB)▪ Sources (Semi-structured - JSON)▪ Throughput (> 10K/sec)▪ Latency (2 min)▪ Extensibility (Custom metrics)▪ Data Quality (Consistency)

▪ Reliability (24/7)▪ Security (Multitenancy)▪ Self-Service (Ad-Hoc reports)▪ Cost (The less the better )▪ Constraints (Public Cloud)

Architecture Drivers:

Trade-off:// Extended

Relational Non-Relational

Extensibility ‐ +

Data Quality + ‐

Self-Service + ‐

Extended Relational Architecture Extensibility via Pre‐allocated 

Fields pattern 

Page 23: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Solution Architecture

23

Technologies: • Amazon Redshift• Amazon SQS• Amazon S3• Elastic Beanstalk• Jaspersoft BI Professional• Python

Page 24: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

24

Business Goals: Build in-house Analytics Platform for ROI measurement and performance analysis of every product and feature delivered by the e-commerce platform; Provide the ability to understand how end-users are interacting with service content, products, and features on sites; Do clickstream analysis; Perform A/B Testing

Business Area:Retail. A platform for e-commerce and collecting feedbacks from customers

Case Study #2: Clickstream for retail website

Page 25: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

// ExtendedRelational

Non-Relational

Volume/Scalability +/‐ +

Throughput + +

Self-Service + +/‐

Extensibility ‐ +

Architectural Decisions

25

▪ Volume (45 TB)▪ Sources (Semi-structured - JSON)▪ Throughput (> 20K/sec)▪ Latency (1 hour)▪ Extensibility (Custom tags)▪ Data Quality (Not critical)

▪ Reliability (24/7)▪ Security (Multitenancy)▪ Self-Service (Canned reports, Data science)

▪ Cost (The less the better )▪ Constraints (Public Cloud)

Architecture Drivers:

Trade-off:

Non‐Relational Architecture Reporting via Materialized View

pattern

Page 26: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Solution Architecture

26

Technologies: • Amazon S3• Flume• Hadoop/HDFS, MapReduce• HBase• Oozie• Hive

Node 1

Node 2

Node N

Page 27: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Tips for Designing Big Data Solutions

27

Understand data users and sources Discover architecture drivers Select proper reference architecture Do trade-off analysis, address cons Map reference architecture to technology stack Prototype, re-evaluate architecture Estimate implementation efforts Set up devops practices from the very beginning Advance in solution development through “small wins” Be ready for changes, big data technologies are evolving

rapidly

Page 28: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

28

▪ Leading global Product and Application Development partner founded in 1993

▪ 3,300+ employees across North America, Ukraine and Western Europe

▪ Thousands of successful outsourcing projects!

SaaS/Cloud Solutions . Mobility Solutions . UX/UI BI/Analytics/Big Data . Software Architecture . Security

Clients include:

Page 29: BIG DATA NALYTICS - SEI Digital Library · PDF fileArchitectural Decisions 22 Volume (> 10 TB) Sources (Semi-structured - JSON) Throughput (> 10K/sec) Latency (2 min) Extensibility

Thank You!

29

SoftServe US OfficeOne Congress Plaza,111 Congress Avenue, Suite 2700 Austin, TX 78701Tel: 512.516.8880

Contacts Serhiy Haziyev: [email protected] Hrytsay: [email protected]