bdamcanmit.files.wordpress.com · web viewfor example, consider a mapreduce job that counts the...

Apache Hadoop MapReduce is a framework for processing large data sets in parallel across a Hadoop cluster. Data analysis uses a two step map and reduce process.

The job configuration supplies map and reduce analysis functions and the Hadoop framework provides the scheduling, distribution, and parallelization services.The top level unit of work in MapReduce is a job. A job usually has a map and a reduce phase, though the reduce phase can be omitted. For example, consider a MapReduce job that counts the number of times each word is used across a set of documents. By default, the MapReduce framework gets input data from the Hadoop Distributed File System (HDFS).

The map phase counts the words in each document, then the reduce phase aggregates the per-document data into word counts spanning the entire collection.During the map phase, the input data is divided into input splits for analysis by map tasks running in parallel across the Hadoop cluster. The reduce phase uses results from map tasks as input to a set of parallel reduce tasks. The reduce tasks consolidate the data into final results. By default, the MapReduce framework stores results in HDFS. Although the reduce phase depends on output from the map phase, map and reduce processing is not necessarily sequential. That is, reduce tasks can begin as soon as any map task completes. It is not necessary for all map tasks to complete before any reduce task can begin.MapReduce operates on key-value pairs. Conceptually, a MapReduce job takes a set of input key-value pairs and produces a set of output key-value pairs by passing the data through map and reduce functions.

The map tasks produce an intermediate set of key-value pairs that the reduce tasks uses as input. The diagram below illustrates the progression from input key-value pairs to output key-value pairs at a high level:

Though each set of key-value pairs is homogeneous, the key-value pairs in each step need not have the same type. For example, the key-value pairs in the input set

1

(KV1) can be (string, string) pairs, with the map phase producing (string, integer) pairs as intermediate results (KV2), and the reduce phase producing (integer, string) pairs for the final results (KV3). The keys in the map output pairs need not be unique. Between the map processing and the reduce processing, a shuffle step sorts all map output values with the same key into a single reduce input (key, value-list) pair, where the 'value' is a list of all values sharing the same key. Thus, the input to a reduce task is actually a set of (key, value-list) pairs.The key and value types at each stage determine the interfaces to your map and reduce functions. Therefore, before coding a job, determine the data types needed at each stage in the map-reduce process. For example:Choose the reduce output key and value types that best represents the desired outcome.Choose the map input key and value types best suited to represent the input data from which to derive the final result.Determine the transformation necessary to get from the map input to the reduce output, and choose the intermediate map output/reduce input key value type to match.Control the MapReduce job characteristics through configuration properties. The job configuration specifies:

how to gather input the types of the input and output key-value pairs for each stage the map and reduce functions how and where to store the final results

This example demonstrates the basic MapReduce concept by calculating the number of occurrence of each each word in a set of text filesRecall that MapReduce input data is divided into input splits, and the splits are further divided into input key-value pairs. In this example, the input data set is the two documents, document1 and document2. The InputFormat subclass divides the data set into one split per document, for a total of 2 splits:

A (line number, text) key-value pair is generated for each line in an input document. The map function discards the line number and produces a per-line (word, count) pair for each word in the input line. The reduce phase

2

produces (word, count) pairs representing aggregated word counts across all the input documents.Given the input data shown above, the map-reduce progression for the example job is:

The output from the map phase contains multiple key-value pairs with the same key: The 'oats' and 'eat' keys appear twice. Recall that the MapReduce framework consolidates all values with the same key before entering the reduce phase, so the input to reduce is actually (key, values) pairs. Therefore, the full progression from map output, through reduce, to final results is:

Consider you have following input data for your MapReduce ProgramWelcome to Hadoop ClassHadoop is goodHadoop is bad

3

the data goes through following phasesInput Splits:Input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single mapMappingThis is very first phase in the execution of map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, job of mapping phase is to count number of occurrences of each word from input splits and prepare a list in the form of <word, frequency>ShufflingThis phase consumes output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, same words are clubed together along with their respective frequency.ReducingIn this phase, output values from Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.

Understanding the MapReduce Job Life CycleThis section briefly sketches the life cycle of a MapReduce job and the roles of the primary actors in the life cycle.Though other configurations are possible, a common Hadoop cluster configuration is a single master node where the Job Tracker runs, and multiple worker nodes, each running a Task Tracker. The Job Tracker node can also be a worker node.

When the user submits a MapReduce job to Hadoop:The local Job Client prepares the job for submission and hands it off to the Job Tracker.The Job Tracker schedules the job and distributes the map work among the Task Trackers for parallel processing.Each Task Tracker spawns a Map Task. The Job Tracker receives progress information from the Task Trackers.As map results become available, the Job Tracker distributes the reduce work among the Task Trackers for parallel processing.Each Task Tracker spawns a Reduce Task to perform the work. The Job Tracker receives progress information from the Task Trackers.

4

All map tasks do not have to complete before reduce tasks begin running. Reduce tasks can begin as soon as map tasks begin completing. Thus, the map and reduce steps often overlap.

Job TrackerThe Job Tracker is responsible for scheduling jobs, dividing a job into map and reduce tasks, distributing map and reduce tasks among worker nodes, task failure recovery, and tracking the job status. When preparing to run a job, the Job Tracker:Fetches input splits from the shared location where the Job Client placed the information.Creates a map task for each split.Assigns each map task to a Task Tracker (worker node).The Job Tracker monitors the health of the Task Trackers and the progress of the job. As map tasks complete and results become available, the Job Tracker:Creates reduce tasks up to the maximum enableed by the job configuration.Assigns each map result partition to a reduce task.Assigns each reduce task to a Task Tracker.A job is complete when all map and reduce tasks successfully complete, or, if there is no reduce step, when all map tasks successfully complete.

Task TrackerA Task Tracker manages the tasks of one worker node and reports status to the Job Tracker. Often, the Task Tracker runs on the associated worker node, but it is not required to be on the same host.When the Job Tracker assigns a map or reduce task to a Task Tracker, the Task Tracker:Fetches job resources locally.Spawns a child JVM on the worker node to execute the map or reduce task.Reports status to the Job Tracker.The task spawned by the Task Tracker runs the job's map or reduce functions.

Map TaskThe Hadoop MapReduce framework creates a map task to process each input split. The map task:Uses the InputFormat to fetch the input data locally and create input key-value pairs.Applies the job-supplied map function to each key-value pair.Performs local sorting and aggregation of the results.

If the job includes a Combiner, runs the Combiner for further aggregation.Stores the results locally, in memory and on the local file system.

5

Communicates progress and status to the Task Tracker.Map task results undergo a local sort by key to prepare the data for consumption by reduce tasks. If a Combiner is configured for the job, it also runs in the map task.

A Combiner consolidates the data in an application-specific way, reducing the amount of data that must be transferred to reduce tasks. For example, a Combiner might compute a local maximum value for a key and discard the rest of the values..When a map task notifies the Task Tracker of completion, the Task Tracker notifies the Job Tracker. The Job Tracker then makes the results available to reduce tasks.Reduce TaskThe reduce phase aggregates the results from the map phase into final results. Usually, the final result set is smaller than the input set, but this is application dependent. The reduction is carried out by parallel reduce tasks. The reduce input keys and values need not have the same type as the output keys and values.

Java MapReduceHaving run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper class, which declares an abstract map() method

The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. For the present example, the input key is a long integer offset, the input value is a line of text the output key is a year, and the output value is an air temperature (an integer). Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package.Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and IntWritable (like Java Integer).The map() method is passed a key and a value. We convert the Text value containing the line of input into a Java String, then use its substring() method to extract the columns we are interested in.The map() method also provides an instance of Context to write the output to. In this case, we write the year as a Text object (since we are just using it as a key), and the temperature is wrapped in an IntWritable. We write an output record only if the temperature is present and the quality code indicates the temperature reading is OK.

6

The reduce function is similarly defined using a Reducerfour formal type parameters are used to specify the input and output types, thistime for the reduce function. The input types of the reduce function must match theoutput types of the map function: Text and IntWritable. And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum temperature, which we find by iterating through the temperatures and comparing each with a record of the highest found so far.

The third piece of code runs the MapReduce jobA Job object forms the specification of the job. It gives you control over how the job isrun. When we run this job on a Hadoop cluster, we will package the code into a JARfile (which Hadoop will distribute around the cluster). Rather than explicitly specify the name of the JAR file, we can pass a class in the Job’s setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class.Having constructed a Job object, we specify the input and output paths. An input path is specified by calling the static addInputPath() method on FileInputFormat, and it canbe a single file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As the name suggests, addInputPath() can be called more than once to use input from multiple paths.The output path (of which there is only one) is specified by the static setOutputPath() method on FileOutputFormat. It specifies a directory where the output files from the reducer functions are written. The directory shouldn’t exist before running the job, as Hadoop will complain and not run the job. This precaution is to prevent data lossThe setOutputKeyClass() and setOutputValueClass() methods control the output types for the map and the reduce functions, which are often the same, as they are in our case.If they are different, then the map output types can be set using the methodssetMapOutputKeyClass() and setMapOutputValueClass().The input types are controlled via the input format, which we have not explicitly set since we are using the default TextInputFormat.After setting the classes that define the map and reduce functions, we are ready to run the job. The waitForCompletion() method on Job submits the job and waits for it to finish. The method’s boolean argument is a verbose flag, so in this case the job writes information about its progress to the console.The return value of the waitForCompletion() method is a boolean indicating success (true) or failure (false), which we translate into the program’s exit code of 0 or 1.

7

Data FlowFirst, some terminology. A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types:map tasks and reduce tasks.There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.Hadoop divides the input to a MapReduce job into fixed-size pieces called inputsplits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split.Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced if the splits are small, since a faster machine will be able to process proportionally more splits over the course of the job than a slower machine.Even if the machines are identical, failed processes or other jobs running concurrently make load balancing desirable, and the quality of the load balancing increases as the splits become more fine-grained.On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the cluster (for all newly created files), or specified when each file is created.Hadoop does its best to run the map task on a node where the input data resides inHDFS. This is called the data locality optimization since it doesn’t use valuable cluster bandwidth. Sometimes, however, all three nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer. The three possibilities are illustrated in

8

Data-local (a), rack-local (b), and off-rack (c) map tasks

It should now be clear why the optimal split size is the same as the block size: it is the largest size of input that can be guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split would have to be transferred across the network to the node running the map task, which is clearly less efficient than running the whole map task using local data.Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to re-create the map output.Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers. In the present example, we have a single reduce task that is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of

9

the reduce is normally stored in HDFS for reliability. As explained in Chapter 3, for each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth, but only as much as a normal HDFS write pipeline consumes.The whole data flow with a single reduce task is illustrated in Figure 2-3. The dotted boxes indicate nodes, the light arrows show data transfers on a node, and the heavy arrows show data transfers between nodes.

MapReduce data flow with a single reduce task

The number of reduce tasks is not governed by the size of the input, but is specified Independently

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally theThe data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4.This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as “the shuffle,” as each reduce task is fed by many map tasks

10

MapReduce data flow with multiple reduce tasksit’s also possible to have zero reduce tasks

MapReduce data flow with no reduce tasks

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop

11

allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducerThe contract for the combiner function constrains the type of function that may beused. This is best illustrated with an example. Suppose that for the maximum temperature example, readings for the year 1950 were processed by two maps (because they were in different splits). Imagine the first map produced the output:(1950, 0)(1950, 20)(1950, 10)And the second produced:(1950, 25)(1950, 15)The reduce function would be called with a list of all the values:(1950, [0, 20, 10, 25, 15]) with output:(1950, 25)since 25 is the maximum value in the list. We could use a combiner function that, justlike the reduce function, finds the maximum temperature for each map output. Thereduce would then be called with:(1950, [20, 25])and the reduce would produce the same output as before. More succinctly, we mayexpress the function calls on the temperature values in this case as followsmax(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25The combiner function doesn’t replace the reduce functionBut it can help cut down the amount of data shuffled between the maps and the reduces, and for this reason alone it is always worth considering whether you can use a combiner function in your MapReduce job.This is the point of MapReduce: it scales to the size of your data and the size of your hardware

Hadoop StreamingHadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix

12

standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.Streaming is naturally suited for text processing (although, as of version 0.21.0, it can handle binary streams, too), and when used in text mode, it has a line-oriented view of data. Map input data is passed over standard input to your map function, which processes it line by line and writes lines to standard output. A map output key-value pair is written as a single tab-delimited line. Input to the reduce function is in the same format—a tab-separated key-value pair—passed over standard input. The reduce function reads lines from standard input, which the framework guarantees are sorted by key, and writes its results to standard output.Streaming supports any programming language that can read from standard input, and write to standard output,

Hadoop PipesHadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. JNI is not usedThe application links against the Hadoop C++ library, which is a thin wrapper forcommunicating with the tasktracker child process. The map and reduce functions are defined by extending the Mapper and Reducer classes defined in the HadoopPipes namespaceand providing implementations of the map() and reduce() methods in each case.These methods take a context object (of type MapContext or ReduceContext), which provides the means for reading input and writing output, as well as accessing job configuration information via the JobConf class. The processing in this example is very similar to the Java equivalent.Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as Standard Template Library (STL) strings. This makes the interface simpler, although it does put a slightly greater burden on the application developer, who has to convert to and from richer domain-level types.

13

The Configuration API Components in Hadoop are configured using Hadoop’s own configuration API. An instance of the Configuration class (found in the org.apache.hadoop.conf package)

represents a collection of configuration properties and their values. Each property is named by a String, and the type of a value may be one of several types,

including Java primitives such as boolean, int, long, float, and other useful types such as String, Class, java.io.File, and collections of Strings.

<configuration><property><name>color</name>// size<value>yellow</value><description>Color</description></property>

configure the IDEWhen developing Hadoop applications, it is common to switch between running theapplication locally and running it on a cluster. In fact, you may have several clustersyou work with, or you may have a local “pseudo-distributed” cluster that you like totest on (a pseudo-distributed cluster is one whose daemons all run on the local machine;GenericOptionsParser, Tool, and ToolRunnerHadoop comes with a few helper classes for making it easier to run jobs from thecommand line. GenericOptionsParser is a class that interprets common Hadoopcommand-line options and sets them on a Configuration object for your application touse as desired. You don’t usually use GenericOptionsParser directly, as it’s moreconvenient to implement the Tool interface and run your application with theToolRunner, which uses GenericOptionsParser internallyoptions that GenericOptionsParser and ToolRunner support are

Table 5-1. GenericOptionsParser and ToolRunner optionsOption Description-D property=value Sets the given Hadoop configuration property to the given value. Overrides any defaultor site properties in the configuration, and any properties set via the -conf option.-conf filename ... Adds the given files to the list of resources in the configuration. This is a convenient wayto set site properties or to set a number of properties at once.-fs uri Sets the default filesystem to the given URI. Shortcut for -D fs.default.name=uri-jt host:port Sets the jobtracker to the given host and port. Shortcut for -Dmapred.job.tracker=host:port-files file1,file2,... Copies the specified files from the local filesystem (or any filesystem if a scheme isspecified) to the shared filesystem used by the jobtracker (usually HDFS) and makesthem available to MapReduce programs in the task’s working directory. -archives:archive1,archive2,...Copies the specified archives from the local filesystem (or any filesystem if a scheme isspecified) to the shared filesystem used by the jobtracker (usually HDFS), un archives

14

Writing a Unit TestThe map and reduce functions in MapReduce are easy to test in isolation, which is aconsequence of their functional style. For known inputs, they produce known outputs.However, since outputs are written to a Context (or an OutputCollector in the old API),rather than simply being returned from the method call, the Context needs to be replacedwith a mock so that its outputs can be verified. There are several Java mockobject frameworks that can help build mocks; here we use Mockito, which is noted forits clean syntax, although any mock framework should work just as well.All of the tests can be run from within an IDE

Testing the DriverApart from the flexible configuration options offered by making your application implementTool, you also make it more testable because it allows you to inject an arbitraryConfiguration. You can take advantage of this to write a test that uses a local job runnerto run a job against known input data, which checks that the output is as expected.There are two approaches to doing this. The first is to use the local job runner and runthe job against a test file on the local filesystem

The second way of testing the driver is to run it using a “mini-” cluster. Hadoop has apair of testing classes, called MiniDFSCluster and MiniMRCluster, which provide a programmaticway of creating in-process clusters. Unlike the local job runner, these allowtesting against the full HDFS and MapReduce machinery. Bear in mind, too, that tasktrackersin a mini-cluster launch separate JVMs to run tasks in, which can make debuggingmore difficult.Mini-clusters are used extensively in Hadoop’s own automated test suite, but they canbe used for testing user code, too. Hadoop’s ClusterMapReduceTestCase abstract classprovides a useful base for writing such a test, handles the details of starting and stopping the in-process HDFS and MapReduce clusters in its setUp() and tearDown() methods,and generates a suitable configuration object that is set up to work with them. Subclassesneed populate only data in HDFS (perhaps by copying from a local file), run aMapReduce job, then confirm the output is as expected. Refer to the MaxTemperatureDriverMiniTest class in the example code that comes with this book for the listing.

Running on a ClusterPackagingWe don’t need to make any modifications to the program to run on a cluster ratherthan on a single machine, but we do need to package the program as a JAR file to sendto the clusterLaunching a JobTo launch the job, we need to run the driver, specifying the cluster that we want to runthe job on with the -conf option:Job, Task, and Task Attempt IDsThe format of a job ID is composed of the time that the jobtracker (not the job) startedand an incrementing counter maintained by the jobtracker to uniquely identify the jobto that instance of the jobtracker.

15

Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job IDwith a task prefix, and adding a suffix to identify the task within the job. For example:task_200904110811_0002_m_000003is the fourth (000003, task IDs are 0-based) map (m) task of the job with IDjob_200904110811_0002. The task IDs are created for a job when it is initialized, so theydo not necessarily dictate the order that the tasks will be executed in.Tasks may be executed more than once, due to failure or speculative execution (so to identify different instances of a task execution, task attemptsare given unique IDs on the jobtracker. For example:attempt_200904110811_0002_m_000003_0is the first (0, attempt IDs are 0-based) attempt at running task task_200904110811_0002_m_000003. Task attempts are allocated during the job run asneeded, so their ordering represents the order that they were created for tasktrackersto run.The final count in the task attempt ID is incremented by 1,000 if the job is restartedafter the jobtracker is restarted and recovers its running jobs (although this behavior isdisabled by default

Job HistoryJob history refers to the events and configuration for a completed job. It is retainedwhether the job was successful or not, in an attempt to provide interesting informationfor the user running a job.Job history files are stored on the local filesystem of the jobtracker in a history subdirectoryof the logs directory. It is possible to set the location to an arbitrary Hadoopfilesystem via the hadoop.job.history.location property. The jobtracker’s history filesare kept for 30 days before being deleted by the system.A second copy is also stored for the user in the _logs/history subdirectory of thejob’s output directory. This location may be overridden by settinghadoop.job.history.user.location. By setting it to the special value none, no user jobhistory is saved, although job history is still saved centrally. A user’s job history filesare never deleted by the system.The history log includes job, task, and attempt events, all of which are stored in a plaintextfile. The history for a particular job may be viewed through the web UI, or via thecommand line, using hadoop job -history (which you point at the job’s outputdirectory).

16

Debugging a JobThe time-honored way of debugging programs is via print statements, and this is certainlypossible in Hadoop. However, there are complications to consider: with programsrunning on tens, hundreds, or thousands of nodes, how do we find and examinethe output of the debug statements, which may be scattered across these nodes? Forthis particular case, where we are looking for (what we think is) an unusual case, wecan use a debug statement to log to standard error, in conjunction with a message toupdate the task’s status message to prompt us to look in the error log. The web UImakes this easy, as we will see.We also create a custom counter to count the total number of records with implausibletemperatures in the whole dataset. This gives us valuable information about how todeal with the condition—if it turns out to be a common occurrence, then we mightneed to learn more about the condition and how to extract the temperature in thesecases, rather than simply dropping the record. In fact, when trying to debug a job, youshould always ask yourself if you can use a counter to get the information you need tofind out what’s happening. Even if you need to use logging or a status message, it maybe useful to use a counter to gauge the extent of the problem. If the amount of log data you produce in the course of debugging is large, then you’vegot a couple of options. The first is to write the information to the map’s output, ratherthan to standard error, for analysis and aggregation by the reduce. This approach usuallynecessitates structural changes to your program, so start with the other techniquesfirst. Alternatively, you can write a program (in MapReduce of course) to analyze thelogs produced by your job.

17

4 The time-honored way of debugging programs is via print statements, and this is certainlypossible in Hadoop. However, there are complications to consider: with programsrunning on tens, hundreds, or thousands of nodes, how do we find and examinethe output of the debug statements, which may be scattered across these nodes? Forthis particular case, where we are looking for (what we think is) an unusual case, wecan use a debug statement to log to standard error, in conjunction with a message toupdate the task’s status message to prompt us to look in the error log. The web UImakes this easy, as we will see.We also create a custom counter to count the total number of recordsIf the amount of log data you produce in the course of debugging is large, then you’vegot a couple of options. The first is to write the information to the maps output, rather than to standard error, for analysis and aggregation by the reduce. This approach usuallynecessitates structural changes to your program, Alternatively, you can write a program (in MapReduce of course) to analyze thelogs produced by your jobRemote DebuggingWhen a task fails and there is not enough information logged to diagnose the error,you may want to resort to running a debugger for that task. This is hard to arrangewhen running the job on a cluster, as you don’t know which node is going to process which part of the input, so you can’t set up your debugger ahead of the failure. However, there are a few other options available:Reproduce the failure locallyOften the failing task fails consistently on a particular input. You can try to reproduce the problem locally by downloading the file that the task is failing on and running the job locally, possibly using a debugger such as Java’s VisualVM.Use JVM debugging optionsUse task profilingJava profilers give a lot of insight into the JVM, and Hadoop provides a mechanism to profile a subset of the tasks in a job.Use IsolationRunnerOlder versions of Hadoop provided a special task runner called IsolationRunnerthat could rerun failed tasks in situ on the cluster

In some cases it’s useful to keep the intermediate files for a failed task attempt for later inspection, particularly if supplementary dump or profile files are created in the task’s working directory.You can keep the intermediate files for successful tasks, too, which may be handy if you want to examine a task that isn’t failingTo examine the intermediate files, log into the node that the task failed on and look for the directory for that task attempt.

Fine TuningAfter a job is working, the question many developers ask is, “Can I make it run faster?”There are a few Hadoop-specific “usual suspects” that are worth checking to see if they are responsible for a performance problem. You should run through the checklist in Table 5-3 before you start trying to profile or optimize at the task level.

18

HPROF is a profiling tool that comes with the JDK that, although basic, can give valuable information about a program’s CPU and heap usage

19

MapReduce WorkflowsWhen the processing gets more complex, this complexity is generally manifested by having more MapReduce jobs, rather than having more complex map and reduce functions. In other words, asa rule of thumb, think about adding more jobs, rather than adding complexity to jobsA mapper commonly performs input format parsing, projection (selecting therelevant fields), and filtering (removing records that are not of interest). In the mappersyou have seen so far, we have implemented all of these functions in a single mapper.However, there is a case for splitting these into distinct mappers and chaining them into a single mapper using the ChainMapper library class that comes with Hadoop.Combined with a ChainReducer, you can run a chain of mappers, followed by a reducerand another chain of mappers in a single MapReduce job.

JobControlWhen there is more than one job in a MapReduce workflow, the question arises: howdo you manage the jobs so they are executed in order? There are several approaches,and the main consideration is whether you have a linear chain of jobs, or a more complexdirected acyclic graph (DAG) of jobs. For a linear chain, the simplest approach is to run each job one after another, waiting until a job completes successfully before running the next:JobClient.runJob(conf1);JobClient.runJob(conf2);If a job fails, the runJob() method will throw an IOException, so later jobs in the pipelinedon’t get executed. Depending on your application, you might want to catch the exceptionand clean up any intermediate data that was produced by any previous jobs.For anything more complex than a linear chain, there are libraries that can help orchestrateyour workflow (although they are suited to linear chains, or even one-off jobs,too). An instance of JobControl represents a graph of jobs to be run. Youadd the job configurations, then tell the JobControl instance the dependencies betweenjobs. You run the JobControl in a thread, and it runs the jobs in dependency order. Youcan poll for progress, and when the jobs have finished, you can query for all the jobs’statuses and the associated errors for any failures. If a job fails, JobControl won’t runits dependencies.

Apache OozieIf you need to run a complex workflow, or one on a tight production schedule, or youhave a large number of connected workflows with data dependencies between them,then a more sophisticated approach is required. Apache Oozie(http://incubator.apache.org/oozie/) fits the bill in any or all of these cases. It has beendesigned to manage the executions of thousands of dependent workflows, each composedof possibly thousands of consistuent actions at the level of an individual Map- Reduce job.Oozie has two main parts: a workflow engine that stores and runs workflows composedof Hadoop jobs, and a coordinator engine that runs workflow jobs based on predefinedschedules and data availability. The latter property is especially powerful since it allowsa workflow job to wait until its input data has been produced by a dependent workflow;also, it make rerunning failed workflows more tractable, since no time is wasted running successful parts of a workflow. Anyone who has managed a complex batch systemknows how difficult it can be to catch up from jobs missed due to downtime or failure,

20

and will appreciate this feature.Unlike JobControl, which runs on the client machine submitting the jobs, Oozie runsas a service in the cluster, and clients submit a workflow definitions for immediate orlater execution. In Oozie parlance, a workflow is a DAG of action nodes and controlflownodes. An action node performs a workflow task, like moving files in HDFS, runninga MapReduce, Streaming, Pig or Hive job, performing a Sqoop import, or running a MapReduce, Streaming, Pig or Hive job, performing a Sqoop import, or runningan arbitrary shell script or Java program. A control-flow node governs the workflowexecution between actions by allowing such constructs as conditional logic (so differentexecution branches may be followed depending on the result of an earlier action node)or parallel execution. When the workflow completes, Oozie can make an HTTP callbackto the client to inform it of the workflow status. It is also possible to receivecallbacks every time the workflow enters or exits an action node.

All workflows must have one start and one end node. When the workflow job startsit transitions to the node specified by the start node (the max-temp-mr action in thisexample). A workflow job succeeds when it transitions to the end node. However, ifthe workflow job transitions to a kill node, then it is considered to have failed andreports an error message as specified by the message element in the workflow definition.The bulk of this workflow definition file specifies the map-reduce action. The first twoelements, job-tracker and name-node, are used to specify the jobtracker to submit thejob to, and the namenode (actually a Hadoop filesystem URI) for input and outputdata. Both are parameterized so that the workflow definition is not tied to a particularcluster (which makes it easy to test). The parameters are specified as workflow jobproperties at submission time,

21

bdamcanmit.files.wordpress.com · web viewfor example, consider a mapreduce job that counts the...

Documents