distributed processing

Increase computational power with distributed processing

Neil Stein 03 Nov 2012

A Discussion Example…….. Getting the data, and ordering it as needed…..

Familiar with grep and sort?

�  “grep” extracts all the matching lines

�  “sort” sorts all the lines

grep “some_record_parameters” hl7_transfer.data-file | sort [2012/02/25/ 9:15] records sent to healthcare-1 [2012/02/28/ 6:15] records sent to healthcare-2 [2012/03/12/ 10:30] records sent to healthcare-3

A Discussion Example…….. �  As the amount of data increases, process requires more and

more resources

�  What if hl7_transfor.data-file is 500GB or bigger?

�  What if there are hundreds or thousands of data files?

�  What if there are multiple types of data files? grep “provider 1” hl7_transfor.data-file | sort

�  Ignoring the process for a moment, how do we write all the data to disk in the first place?

Need to rethink the process

Distributed File-System – “the cloud” �  Files can be stored across many machines

�  Files can be replicated across many machines

�  Files can be in a hyrbid-cloud model

�  Share the file-system transparently

�  You simply see the usual file structure

�  Opportunity to leverage private and public cloud environments

Map-Reduce – the cloud �  A way of processing large amounts of data across many machines

�  Must be able to split-up the data in chunks for processing, (Map) �  Recombined after processing (Reduce) �  Requires a constant flow of data from one simple state to another

�  Allows for a simple way of breaking down a large task into smaller manageable tasks

�  Increase the available computational power

A look at Hadoop

What is Hadoop �  A Map-Reduce framework

�  Designed to run applications on clusters of local and remote systems

�  HDFS �  The file system of Hadoop (Hadoop Distributed

File System) �  Designed to access clusters of local and

remote systems

Putting the pieces together….

First, we need some code……

Map Reduce

Map

Hadoop streams information on STDIN Separate value with a newline (for Hadoop)

Reduce

Hadoop streams back to us on STDIN Output the aggregated records

Sanity Checking

This should work with small data-sets

Command

Results

Push file to “the distributed file system”

Put file on the DFS

Check that the file is in the cloud

Running in “the distributed environment”

Call the Hadoop streaming command Pass the appropriate parameters

Running in “the distributed environment”

Checking Status

�  Cluster Summary

�  Running Jobs

�  Completed Jobs

�  Failed Jobs

�  Job Statistics

�  Detailed Job Logs

Checking Distributed Cluster Health

�  List Data-Nodes

�  Dead Nodes

�  Node Heart-beat information

�  Failed Jobs

�  Job Statistics

�  Detailed Job Logs

Conclusion

�  A different paradigm for solving large-scale problems

�  Designed to solve specific problems that can be defined in a focused map-reduce manner

distributed processing

Technology

datafile sort

distributed filesystem

small data

data increases

push file

thousands of data files

distributed environmentcall

constant flow of data