distributed processing

24
Increase computational power with distributed processing Neil Stein 03 Nov 2012

Upload: neil-stein

Post on 24-May-2015

127 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Distributed processing

Increase computational power with distributed processing

Neil Stein 03 Nov 2012

Page 2: Distributed processing
Page 3: Distributed processing

A Discussion Example…….. Getting the data, and ordering it as needed…..

Familiar with grep and sort?

�  “grep” extracts all the matching lines

�  “sort” sorts all the lines

grep “some_record_parameters” hl7_transfer.data-file | sort [2012/02/25/ 9:15] records sent to healthcare-1 [2012/02/28/ 6:15] records sent to healthcare-2 [2012/03/12/ 10:30] records sent to healthcare-3

Page 4: Distributed processing

A Discussion Example…….. �  As the amount of data increases, process requires more and

more resources

�  What if hl7_transfor.data-file is 500GB or bigger?

�  What if there are hundreds or thousands of data files?

�  What if there are multiple types of data files? grep “provider 1” hl7_transfor.data-file | sort

�  Ignoring the process for a moment, how do we write all the data to disk in the first place?

Need to rethink the process

Page 5: Distributed processing
Page 6: Distributed processing

Distributed File-System – “the cloud” �  Files can be stored across many machines

�  Files can be replicated across many machines

�  Files can be in a hyrbid-cloud model

�  Share the file-system transparently

�  You simply see the usual file structure

�  Opportunity to leverage private and public cloud environments

Page 7: Distributed processing
Page 8: Distributed processing

Map-Reduce – the cloud �  A way of processing large amounts of data across many machines

�  Must be able to split-up the data in chunks for processing, (Map) �  Recombined after processing (Reduce) �  Requires a constant flow of data from one simple state to another

�  Allows for a simple way of breaking down a large task into smaller manageable tasks

�  Increase the available computational power

Page 9: Distributed processing

A look at Hadoop

Page 10: Distributed processing

What is Hadoop �  A Map-Reduce framework

�  Designed to run applications on clusters of local and remote systems

�  HDFS �  The file system of Hadoop (Hadoop Distributed

File System) �  Designed to access clusters of local and

remote systems

Page 11: Distributed processing

Putting the pieces together….

Page 12: Distributed processing

First, we need some code……

Map Reduce

Page 13: Distributed processing

Map

Hadoop streams information on STDIN Separate value with a newline (for Hadoop)

Page 14: Distributed processing

Reduce

Hadoop streams back to us on STDIN Output the aggregated records

Page 15: Distributed processing

Sanity Checking

This should work with small data-sets

Command

Results

Page 16: Distributed processing

Push file to “the distributed file system”

Put file on the DFS

Check that the file is in the cloud

Page 17: Distributed processing

Running in “the distributed environment”

Call the Hadoop streaming command Pass the appropriate parameters

Page 18: Distributed processing

Running in “the distributed environment”

Page 19: Distributed processing

Running in “the distributed environment”

Page 20: Distributed processing

Running in “the distributed environment”

Page 21: Distributed processing

Running in “the distributed environment”

Page 22: Distributed processing

Checking Status

�  Cluster Summary

�  Running Jobs

�  Completed Jobs

�  Failed Jobs

�  Job Statistics

�  Detailed Job Logs

Page 23: Distributed processing

Checking Distributed Cluster Health

�  List Data-Nodes

�  Dead Nodes

�  Node Heart-beat information

�  Failed Jobs

�  Job Statistics

�  Detailed Job Logs

Page 24: Distributed processing

Conclusion

�  A different paradigm for solving large-scale problems

�  Designed to solve specific problems that can be defined in a focused map-reduce manner