big data testing: a unified view - east carolina universitycore.ecu.edu/strg/seminars/16 march 2016...

30
BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department , March 16, 2016 http://core.ecu.edu/STRG

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI

ECU, Computer Science Department , March 16, 2016

http://core.ecu.edu/STRG

PRESENTATION CONTENT

1. Overview of Big Data

A. 5 V’s of Big Data

B. Data generation

C. Data acquisition

D. Data pre-processing

E. Data analysis

F. Apache Hadoop

2. Testing Big Data

A. Database Testing

B. Application Testing

C. Performance Testing

D. Traditional Testing vs Big Data Testing

2/30

WHAT? WHY?

Moore’s law (generation / storage)

Industry demand (science, business, etc..)

Traditional RDBMS not enough

Barrack Obama (200$) million

Y. Demchenko; C. Laat; P. Membrey; 2014

H. Hu; Y. Wen; T. Chua; X. Li; 2015

3/30

Comparative Definition:

“ Data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze”

-McKinsey Global Institute Report

HOW?

Increasing data storage capability

Physical data (cost) 228$ -> .88$GB

Virtualized data (size)

NoSQL data management

Distributed computing networks (parallel processing/cloud computing)

Improved network latency (speed)

Advances in data analysis ( machine learning )

Google invents MapReduce

Y. Demchenko; C. Laat; P. Membrey; 2014 Images: www.tutorialspoint.com/hadoop/

H. Hu; Y. Wen; T. Chua; X. Li; 2015

4/30

WHAT IS BIG DATA?

Attributive Definition:

“ Big Data Technologies describe a new generation of technologies and architectures, designed to

economically extract value from very large volumes of a wide variety of data, by enabling high-

velocity capture, discovery and/or analysis”

-International Data Company

H. Hu; Y. Wen; T. Chua; X. Li; 2015

5/30

THE 4 V’S OF BIG DATA:

Volume:

Terabytes and Petabytes of storage

Database functionality to handle TB & PB

Velocity:

TB/sec. data transfer rates

Variety:

Data Types: text, video, images, speech etc...

Data Source: Many sources, from varying distances, at varying speeds

Value:

Analysis: data analysis applications

Veracity: data must be correct

Data cleansing: removing noise and correcting errors

Demchenko, Y.; de Laat, C.; Membrey, P., "Defining architecture components of the Big Data Ecosystem,"

H. Hu; Y. Wen; T. Chua; X. Li; 2015

6/30

LAYERED VIEW OF BIG DATA

Application Layer

Data analysis

Query and clustering

Data classification

Computing Layer

Programming models

Data management: NoSQL, Files

System

Data Integration

Infrastructure Layer

Network storage resources

Network computation resources

Y. Demchenko; C. Laat; P. Membrey; 2014

H. Hu; Y. Wen; T. Chua; X. Li; 2015

7/30

BIG DATA LIFE CYCLE

1. Data Generation

Attributes

Sources

II. Data Acquisition

Collection

Transmission

Pre-processing

III. Data Storage

File systems

Database technologies

Programming models

IV. Data Analysis

Analysis techniques

Analysis paradigms

Y. Demchenko; C. Laat; P. Membrey; 2014

H. Hu; Y. Wen; T. Chua; X. Li; 2015

8/30

1. DATA GENERATION

The “Data” in Big Data

Volume: Petabytes & Exabytes

Velocity: PB/sec or real-time

Variety: text, image, video, logs, reports

Value: source of data

Domain specificity

H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 9/30

1. DATA GENERATION CONT.

Business Data:

Stock market, internet purchases, business to business

Billions of transactions per day

Networking Data:

Internet: Google 30 Pb/day

Social Networking: Facebook 30Pb/day

Internet of things: 30 million networked sensors

Scientific Data:

Astronomy: 20 TB of images a night

High-Energy Physics: LHC 2Pb/second

Data is already available!

Amazon RedShift

Petabyte sized data warehouse

http://www.kurzweilai.net/images/ http://bigdatatrainers.com/wp-content/uploads/2013/10/Big-Data-and-Stock-Markets.jpg H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 10/30

II. DATA ACQUISITION: DATA COLLECTION

Two categories

Pull-Based approach

Push-Based approach

Common Collection methods:

Sensors: Physical to Digital

Pull-Based approach

Log File: Record activity of software systems

Push-Based approach

Web Crawler: Collecting URLs for search engines

Pull-Based approach

https://s.campbellsci.com/images/10-569.png H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 11/30

II. DATA ACQUISITION: DATA TRANSMISSION

Transfer via IP Back Bone

Region or Internet scale

High Capacity Transfer

Data Center Transmission

Data Center Network Architecture

Consists of racks of servers

Connected by internal network

Transportation Protocols

Governs data transmission within data center

Transfer collected data into storage infrastructure

http://www.eam2go.com/articles/timewarner.jpg H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 12/30

II. DATA ACQUISITION: PRE-PROCESSING

Data quality is critical

Reduce noise and redundancy

Increase consistency

Integration

Combining data into a unified view

Distributed sources

Data needs to be standardized

Cleansing

Search for inaccurate, incomplete, irrelevant data

Requires data rules

Amend or remove bad data

Redundancy Elimination

Reduce transmission overhead

Prevent wasted storage space, inconsistency

Data corruption can destroy databases

H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 13/30

III. DATA STORAGE: STORAGE INFRASTRUCTURE

Store collected data into format for analysis

Physical Storage

Random access memory (RAM)

Magnetic disk (HDD)

Storage class memory (SDD)

Optical / Tape storage (Big Data Obsolete)

Network Infrastructure

Direct Attached Storage (DAS)

Network Attached Storage (NAS)

Storage Area Network (SAN)

Attributes

Persistent and reliable

Infrastructure must be able to scale up and down to meet application demand

SAN networks allows virtualization

Virtualization allows multiple networks to function as a single storage device

http://i.kinja-img.com/gawker-media/image/upload/jltsftzlt6tmfg67vh6m.jpg H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 14/30

III. DATA STORAGE: DATA MANAGEMENT FRAMEWORK

File Systems

Google File System (GFS)

Scalable distributed file system

Fault tolerance

High performance for large number of clients

Distributed over clusters of commodity

servers

Hadoop File System (HDFS)

Open-source based on GFS

Database Technologies

NoSQL systems

Schema free

Easy replication (for distribution)

Support huge amounts of data

Simple API

http://hbelbase.com/wp-content/uploads/2014/01/GFS.jpg H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 15/30

III. DATA STORAGE: PROGRAMMING MODELS

Programming models provide application logics that allow large data processing

Generic programming model

Stream programming model

Batch programming model

Generic Process Model

MapReduce invented by google

MapReduce most widely used in big data ecosystem

Allows distributed processing

Can be integrated with SQL

MapReduce consists of three main phases:

Map() - Data objects are mapped based analysis constraints

Shuffle() – Consolidates mapped objects into classes

Reduce() – Aggregate all shuffled objects into one

H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 16/30

IV. DATA ANALYSIS

Data Analysis methods are domain specific!

Goals of Data analysis

Extrapolate and interpret data

Check legitimacy of data

Assist decision making

Predict future trends

Provide recommendations

Types of Data Analysis

Descriptive analytics

Uses historical data to describe a trend or occurrence

Usually translated to graphical visualizations

Associated with business intelligence

Predictive analytics

Uses data to predict future trends or probabilities

Utilizes data mining to calculate predictions

Statistical techniques used to interpret data

Prescriptive analytics

Uses data to diagnose and infer information to assist decision making

H. Hu; Y. Wen; T. Chua; X. Li; 2015 http://charc-concepts.org/wp-content/uploads/2012/10/Data-Mining-2.jpg

http://www.noaanews.noaa.gov/stories2004/images/frances-radar-melbourne-fla-090404-0334z.jpg

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 17/30

IV. DATA ANALYSIS: ANALYSIS PARADIGMS

Stream Processing Model (Real-time)

Analyze as soon as possible

Data value relies on data freshness

High processing speed

Little raw data is stored

Batch Processing Model

Analysis of large batches of data

MapReduce is most common batch-

processing model

Processing is scheduled near the data

location

H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 18/30

IV. DATA ANALYSIS: DATA ANALYSIS TECHNIQUES

Data Mining

Computational process of discovering patterns in data sets

Data mining is used in:

Artificial intelligence, machine learning, pattern recognition,

statistics etc.

Types of data mining algorithms

Classification

Clustering

Regression

Statistical learning

Association analysis

Data Visualization

Information graphics and visualization

Graphical data representation is easy to understand

Due to volume and variety of data, visualization is needed

Visualization can assist

Algorithm design

Software development

https://www.flickr.com/photos/22402885@N00/3821069672/ H. Hu; Y. Wen; T. Chua; X. Li; 2015

I. Data Generation II. Data Acquisition III. Data Storage IV. Data Analysis 19/30

APACHE HADOOP

Leading Big Data Industry and Academia

Open-Sourced

Companies using Hadoop: Amazon, LinkedIn, IBM,

Microsoft and Intel

Adapted from Google’s MapReduce

Scalability & Flexibility

Clusters and servers can be added/removed without

interrupting system

Because of Java base, is compatible on all platforms

Fault tolerance

Does not rely on hardware for FT

Hadoop library designed to detect and handle failures

on application level

http://ecomcanada.org/blog/wp-content/uploads/2014/11/hadoop-architecture.png

H. Hu; Y. Wen; T. Chua; X. Li; 2015

Images: www.tutorialspoint.com/hadoop/

H. Hu; Y. Wen; T. Chua; X. Li; 2015

20/30

BIG DATA TESTING

I. Database Testing

II. Application Testing

III. Performance Testing

IV. Traditional Testing vs Big Data Testing

21/30

DATABASE TESTING: VERIFICATION OF DATABASE

Testing is Domain specific

Testers must know how to cover integrity constraints

Testers must cater data to trigger integrity constraints

Every integrity constraint must be tested

Traditional verification:

Manually filling tables with data copied from other

sources

Random data (Only good for performance and load

testing)

Implementing custom, domain specific, data generators

Big Database Functionality validation requires:

AUTOMATION

Generating formatted and unformatted data

Parse & interpret integrity constraints

Generate data to trigger data rules

Validate the correctness of generated data

Correctness of data structure

Ability to trigger integrity constraints

Sneed, H.M.; Erdoes, K., "Testing big data (Assuring the quality of large databases),"

22/30

BIG DATA TESTING

Goal: Verify Big Data application

Will examine big data testing Hadoop System

Three main steps to Big Data testing

1. Data staging validation

II. “MapReduce” Validation

III. Output Validation

http://www.guru99.com/big-data-testing-functional-performance.html#1

23/30

BIG DATA TESTING STEPS

Step I: Data Staging Validation

“Pre-Hadoop Phase”

Validate data pulled from data sources

Compare data to data pushed into system

Verify the data is loaded correctly into HDFS

Step II: “MapReduce” Validation

Ensuring that the Map Reduce process works correctly

Verify correct assess to all nodes

Data aggregation/segregation rules are implemented on

the data

Map function key value pairs are generated

Validating the data after Reduce process

Step III: Output Validation

Analyzing data output of Hadoop

Confirm that transformation rules are correct

Check data integrity and destination loading

Detect corruption by comparing output with HDFS

http://www.guru99.com/big-data-testing-functional-performance.html#1

24/30

PERFORMANCE TESTING

A form of non-functional testing which simulate load

conditions

Detect bottlenecks and performance issues

Provide benchmark data on system

Central Characteristics

Response time (Faster response time)

Resource use (Efficient resource use)

Stability (Reliable Stability)

A. Alexandrov, C. Brücke, and V. Markl, “Issues in Big Data Testing and Benchmarking,”

Sneed, H.M.; Erdoes, K., "Testing big data (Assuring the quality of large databases),"

25/30

PERFORMANCE TESTING

Performance testing focuses on improving the 4 V’s

Performance Test Types

Concurrent test

Tests the concurrent usage of a specific block of the Big Data

system

Determines any problems with many concurrent users

Load testing

Testing the performance of Big Data system in different load levels

to determine levels of performance.

Provides reliability and stability metrics

Focuses on user transactions with the system

Stress testing

Examine system performance in the most extreme

conditions of concurrent users and user transactions

Provides metrics on peak loads and failure conditions

Reveals weaknesses in system

Capacity testing

Determine maximum resource loads available to the

system

Provides metrics on physical limitations of the system

Provides metrics on maximum concurrent users and

maximum simultaneous transactions

A. Alexandrov, C. Brücke, and V. Markl, 2015

26/30

TRADITIONAL VS BIG DATA TESTING

A. Alexandrov, C. Brücke, and V. Markl, 2015 http://www.servermom.org/wp-content/uploads/2014/01/internet-speed-gauge.jpg

Attribute Traditional Database Testing Big Data Testing

Data • Structured

• Testing is well defined & established

• Can use manual sampling of data

• Structured and unstructured data

• Testing requires analysis of big data system domain

• Requires automation

Infrastructure • Does not require test environment • Requires test environment due to large data sizes

Validation Tools • Excel based or UI based automation

tools

• Does not require domain knowledge or

extensive training

• No defined universal tools

• Tools require skills and training

• Requires knowledge of specific big data systems

http://www.guru99.com/big-data-testing-functional-performance.html#1

27/30

BIG DATA TESTING CHALLENGES

Automation*

Much of the testing approaches need to be automated

Due to size, speed, and complexity of data

Automation cannot handle unexpected problems in

testing process

Generating Data

Create very large realistic data sets

Interpreting data rules from data

Requires machine learning and artificial intelligence

Cross-Platform Testing Tools

Testing applications are application/domain specific

Standardized Testing Framework

Testing frameworks for big data in their infancy

Variety of big data systems and system components

creates problems

28/30

CONCLUSIONS

Big Data Attributes

Volume

Velocity

Variety

Value

Big Data Life Cycle

1. Generation

2. Acquisition

3. Storage

4. Analysis

Big Data Application

Google’s MapReduce

Apache Hadoop

Big Data Testing

Database testing

Application testin

Performance Testing

Traditional vs. Big Data Testing

Big Data Testing Challenges

29/30

`

Thank You!