hadoop and measurement data - messdatenvisualisierung · hadoop is based on the map-reduce paradigm...

23
Hadoop and Measurement Data

Upload: lephuc

Post on 02-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Hadoop and Measurement Data

Page 2: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

What is Hadoop?

• A framework for distributed processing of large datasets

• Hadoop supports only one type of paradigm for distributed processing: Map - Reduce

• Hadoop origin was in the area of web search engines

• „Moves computations to data“ the opposite of “Data to computations”

• HDFS (Hadoop Distributed File System) – Optimal for write once – read many semantics

– Large files (or even streams) chunked into blocks, distributed over several network nodes

– Offers replication of data (configurable)

09.02.2017 2

Page 3: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Hadoop is based on the Map-Reduce Paradigm

– Programming paradigm for distributed computation

– Mapper transforms input key/value pairs to output key/value pairs

– Reducer forms result per key from all Mapper outputs

– Row based processing

– Operates on slices of the original dataset that can be processed completely independent from each other

09.02.2017 3

What is Hadoop?

Page 4: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Map-Reduce Sample

09.02.2017 4

Calculate maximum temperature per city based on a huge set of measurement data

City Temp

City 1 22

City 2 18

City 1 25

City 1 28

City 1 12

City 2 15

Big File or Stream

Once

Slicing

File 1

File N

City Temp

City 1 22

City 2 18

City 1 25

City Temp

City 1 28

City 1 12

City 2 15

Mapping

Calc

Calc

City Temp (max)

City 1 25

City 2 18

City Temp (max)

City 1 28

City 2 15

CPU N

CPU 1

Many

City Temp (max)

City 1 28

City 2 18

Reducing

Page 5: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Measurement Data Files

• Only small files (Excel, ASCII) have a line oriented structure -> no Hadoop needed

• Typically test stand or road test data files (like MDF) medium size of each file, but high number of files non tabular (= non sliceable) structure: – Todays file structure cannot be changed

– Channels with different measuring rates

– Channels with own time channel, event based data

– Resampling to max. rate of all channels needs factor 10 or more file space

• Testing of Assistants up to Self Driving! Different world with typical streams for Hadoop operation

09.02.2017 5

Page 6: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Measurement Data Files

• Only small files (Excel, ASCII) have a line oriented structure -> no Hadoop needed

• Typically Big-Data files (like MDF) have a non tabular (= non sliceable) structure: – Channels with different measuring rates

– Channels with own time channel, event based data

– Resampling to max. rate of all channels needs factor 10 or more file space

• Hadoop’s map technology with key-values must use time values as keys: Not applicable!

09.02.2017 6

Page 7: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Measurement Data Analysis

• Only very few calculations allow a row by row independent operation: – 1D Statistical Frequency (Histogram): Yes

– Signal Filtering need at least an overlap: No

– FFT Analysis needs whole set of values: No

– Slicing is critical!

• Because Slicing and Mapping is a “Once” process it must fit all possible analysis operations

• Complex calculations are not processable as independent blocks

09.02.2017 7

Page 8: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Where is Hadoop helpful?

• Tabular data structures are very frequently used in commercial / office application -> Hadoop helps

• Measurement data are typically not tabular -> Hadoop does not help for faster analysis

• Typical measurement data analysis is not tabular -> Hadoop does not help for faster analysis

• Measurement data are recorded once, but used frequently for analysis -> Hadoop’s HDFS file system is a good alternative to a regular file system to store all the data files

09.02.2017 8

Page 9: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

AMS Solutions for Distributed Analysis

with Big Amount of Data

09.02.2017 9

Page 10: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

AMS - Big Data Technologies

09.02.2017 10

• Parallel Operation 1. Single calculation (per calculation base) 2a. Analysis tree (per file base) 2b. Analysis tree with aggregation 3. Analysis job (per MaDaM/storage center)

• Statistic over meta data extracted during import • Catalog files

• Channel-name and –unit mapping • Resampling on a per calculation base

Page 11: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Parallel Operation – 1. Single Calculation

09.02.2017 11

• Independent rows (sample: Histogram) • Input data can be sliced and analyzed independent

• Overlapping value areas (sample Signal-Filter) • Data must be sliced with overlapping,

depending on parameter of calculation

• Not sliceable calculations (sample: FFT) • Calculation of one result value needs information

of all input values Strategy is depending on individual calculation

Page 12: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Parallel Operation – 1. Single Calculation

09.02.2017 12

Task 1 X Temp

[s] [°]

0.0 22

0.15 18

0.3 24

0.45 28

0.6 12

0.75 14

X Temp

[s] [°]

0.0 22

0.15 18

0.3 24

X Temp

[s] [°]

0.45 28

0.6 12

0.75 14

Calculation Histogram

Task N

X N

[°] []

10-15 0

15-20 1

20-25 2

25-30 0

X N

[°] []

10-15 2

15-20 0

20-25 0

25-30 1

Slicing Parallel Processing

X N

[°] []

10-15 2

15-20 1

20-25 2

25-30 1

Aggregation

DoubleChannel

Page 13: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Parallel Operation 2a. Analysis Tree - Without Aggregation

09.02.2017 13

• Processing on a per file base Files are natural slices

• Independent Analysis and Report • Synchronization problems solved in jBEAM • jBEAM project-templates define the analysis

• MultiFile-Modul controls the jBEAM pool • MF-Modul is responsible for load balancing • Each jBEAM needs access to the files

Several jBEAM instances each on an own server

Page 14: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

09.02.2017 14

Calculate a whole report from one or several files. Process individual file sets individual and in parallel.

Plenty of files

Many

Import

Server 1: jBEAM Instance 1

Analysis Report

Import

Server N: jBEAM Instance N

Analysis Report

Distribution & Load Control

Parallel Operation 2.a Analysis Tree - Without Aggregation

MultiFile-Controller

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

Distribution & Load Control

Page 15: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

09.02.2017 15

• Processing on a per file base Files are natural slices

• Independent Analysis and Report • Synchronization problems solved in jBEAM • jBEAM project-templates define the analysis

• MultiFile-Modul controls the jBEAM pool • MF-Modul is responsible for load balancing • Each jBEAM needs access to the files • MF-Modul is responsible for aggregation

Several jBEAM-instances each on an own server

Parallel Operation 2b. Analysis Tree - With Aggregation

Page 16: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

09.02.2017 16

Analyze file sets individually and aggregate results for a common report.

Plenty of files

Many

Import

Server 1: jBEAM Instance 1

Analysis

Import

Server N: jBEAM Instance N

Analysis

Rep.

Aggregation & Reporting

Parallel Operation 2.b Analysis Tree - With Aggregation

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

0110110001011101110101101001

MultiFile-Controller

Distribution & Load Control

Page 17: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

09.02.2017 17

Multiple-jBEAM Solution

jBEAM

Aggregation

jBEAM

File 1, 5, 11, …

jBEAM

File 2, 6, 12, …

jBEAM

File 3, 7, 13, …

jBEAM

File 4, 8, 14, …

MaDaM

Job Centre

File System

Final Report

Analysis Job

Page 18: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

• Data(-files) are stored where they are most frequently used

• Analysis / computation is done near the data

• Analysis is distributed using a communication

between different MaDaM-Instances

• Individual results are aggregated to a global result

09.02.2017 18

Parallel Operation

3. Global Solution: Multiple-MaDaM

Page 19: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

09.02.2017 19

Search for Tests & Preview: Modern interactive web interface accessible by any browser

Standardized Reports: Server-jBEAM-generated PDF files can be viewed by PDF-Reader

Interactive Analysis: jBEAM with Java Web Start running on client desktop

Import of Tests: MaDaM Importer with Java Web Start running on client desktop

1e-6 = 0.0001% Only meta data exchange

1e-3 = 0.1% EnCom-minimized traffic

1e0 = 100% Complete file upload

*) optimized traffic with EnCom

Multiple- MaDaM Solution

IP Traffic

Web Browser

jBEAM

Client

1e-6

Serv

er

Clie

nt

USA Germany China

*) *) *)

long d

ista

nce

1e-3

long d

ista

nce

Web Browser

jBEAM

Client

MaDaM

Importer

File System

Web Browser

jBEAM

Client

MaDaM

Importer

File System File System

Web Browser

jBEAM

Client

MaDaM

Importer

1e0=100%

jBEAM

Server HTML-5

MaDaM TM

Lucene Database

jBEAM

Server HTML-5

MaDaM TM

Lucene Database

jBEAM

Server HTML-5

MaDaM TM

Lucene Database

MaDaM

Importer

Page 20: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

For Big-Data processing the analysis tool must support special time and storage reducing technologies:

• “StandBy” technology of data file importers

– Read and import only channels really in use

– Create and use “overviews” for fast drawing

• Multithreading on a per calculation base

• Resampling on a per calculation base

• Quantity based calculation (1.5cm + 7mm = 2.2cm)

09.02.2017 20

Big Data Feature Fast processing features inside jBEAM

Page 21: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Why MaDaM does not use Hadoop

• For Hadoop the data structure and all related analysis logic must support up front data slicing

• This is not possible in measurement data analysis

– Measurement data structures are not tabular or line oriented

– The slicing of data can, if at all, be based on specific use cases

– Complex calculations are not processable as independent blocks

09.02.2017 21

Page 22: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

AMS has its own Approach

AMS uses adequate technologies for distributed analysis of multiple files, which can not be found in the Hadoop framework.

• Global distributed analysis by Multi-MaDaM

• Local distributed analysis by Multi-jBEAM

• Internal jBEAM technologies: – Multithreading in components

– Sequential processing of multiple measurement files

– Multiple x-grids (time-grids) on a per calculation base

– Multiple unit-spaces

– Channel-name and –unit mapping 09.02.2017 22

Page 23: Hadoop and Measurement Data - Messdatenvisualisierung · Hadoop is based on the Map-Reduce Paradigm –Programming paradigm for distributed computation –Mapper transforms input

Bahnhofstraße 6 1760 Opdyke Court German Centre, Unit 719A 09111 Chemnitz Auburn Hills, MI 48326 88 Keyuan Road, Pudong Germany USA Shanghai 201203 / PR China

Tel.: +49 (371) 918 668-0 Tel.: +1 (248) 270-7779 Tel.: +86 (21) 289 866 19 Fax.: +49 (371) 918 668-99 Fax: +1 (248) 393-0340 Fax: +86 (21) 289 865 11 E-Mail: [email protected] E-Mail: [email protected] E-Mail: [email protected] Web: www.AMSonline.de Web: www.AMSonline.eu Web: www.AMSonline.cn

Gesellschaft für angewandte Mess- und Systemtechnik mbH