aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/awsandhadoop.docx · web viewaws and...

25
AWS and Hadoop Authors: April Song and Andrew Tran Introduction As distributed computing emerges into the mainstream, many companies look to gain a business advantage by leveraging high volume, variability, and velocity data, known as big data. Current technology for distributed computing favors Apache Software Foundation’s widely used distributed filing system known as Apache Hadoop, which provides a raid array like redundancy for all data files within the system. Apache Hadoop is based on Google’s distributed filing system, which was developed to support the large amount of data being gathered from Google’s search engine [1]. Both filing systems support the distributed and parallel batch computing framework known as MapReduce, where the algorithm was also developed by Google [2]. Apache Hadoop and MapReduce provides a reliable system that is extremely scalable, which is ideal for any business because the software is free and open source. Apache has several other software applications that work on top of Apache Hadoop and MapReduce. For the purpose of this work, the focus will be on Hadoop and MapReduce. Apache Hadoop can be run on a single core and single thread processor or thousands of cores and

Upload: others

Post on 02-Sep-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

AWS and HadoopAuthors: April Song and Andrew Tran

Introduction

As distributed computing emerges into the mainstream, many companies look to gain a

business advantage by leveraging high volume, variability, and velocity data, known as big data.

Current technology for distributed computing favors Apache Software Foundation’s widely used

distributed filing system known as Apache Hadoop, which provides a raid array like redundancy

for all data files within the system. Apache Hadoop is based on Google’s distributed filing

system, which was developed to support the large amount of data being gathered from Google’s

search engine [1]. Both filing systems support the distributed and parallel batch computing

framework known as MapReduce, where the algorithm was also developed by Google [2].

Apache Hadoop and MapReduce provides a reliable system that is extremely scalable, which is

ideal for any business because the software is free and open source.

Apache has several other software applications that work on top of Apache Hadoop and

MapReduce. For the purpose of this work, the focus will be on Hadoop and MapReduce.

Apache Hadoop can be run on a single core and single thread processor or thousands of cores

and thousands of threads, which was mentioned above under scalability. However, the

hardware and software setup is a non-trivial task for users wanting to get started with optimizing

performance. Companies provide free software to help with the set up, but the information to

build system from scratch is not as user friendly. The goal of this project is to set up a Apache

Hadoop cluster, without any financial overhead, so that processes can be created from relatively

low level instruction through Java.

Literature Survey

Page 2: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

The barebones understanding of Hadoop is to get results faster from the large and

distributed dataset. Tom White's The Definitive Guide focuses on the implementation of Hadoop

and MapReduce.[3] MapReduce is based from the concept of parallel processing, where the

data is processed in batches from different locations and sorted back into an understandable

format. Additional benefits inherent with Hadoop is the scalability of the system, where more

hardware can be added to increase the throughput for MapReduce and storage for the filing

system. The information provided by White gives a better understanding of the foundation of

Hadoop and MapReduce, but does not provide explicit real world uses.

Different companies provide tools to help users make effective analysis on this big data

without the overhead of learning the lower level coding structure and algorithm design.

Hortonworks poses the problem of a data lake, where the data is a large pool of variably

structured data, as compared to a data warehouse, where the data warehouse has a defined

structure to the data, in a white paper.[4] As mentioned, one of the main focuses of big data is

to deal with unstructured data. Companies like Google, who take in massive amounts of data,

must find ways to handle their data that will be most beneficial to their users. When data is

unstructured, analysis becomes difficult because sorting or using the data in a meaningful way

can tedious.

Using standard non distributed techniques will yield issues with speed and performance

for data analysis and usage because of the size and lacking structure. The solution is to use

Hadoop, which focuses its CPU to handle data processing for analysis rather than handling low

value computing like data organization. Thus, Hadoop increase performance of data analysis

tasks and provides a format to use unstructured data efficiently. As efficient as Hadoop may be,

Page 3: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

the shortcomings of this work is highlighted with the cost to create this system and the cost for

the expertise in this system through Hortonworks.

Beyond the cost, Hadoop also provides different outlets for companies to manage their

live streaming time series data, particularly field equipment. Oil and gas equipment use

telemetry to transfer the data to a filing system. Bad measurements can occur when the

equipment is faulty, and the company will need to repair the equipment to collect data. MapR,

another company that leverages Hadoop, provides a solution by implementing a real time

streaming database using Hadoop and MapReduce. [5] As data is collected, MapR uses an

algorithm designed to work with MapReduce will identify issue with the field equipment, rather

than having technicians survey the field to find faults. MapR does not go into detail on how

equipment can be found to be faulty. One way to find determine abnormal behavior is to look at

outliers, and in a larger scope, finding patterns in data.

Outlier detection with MapReduce is done with categorical data, where the data can be

mined to find patterns that will lead to a better understanding of the data. With streaming time

series data, models can be made to find data points that do not conform to patterns in the whole

dataset. Common data mining techniques involve clustering, distance approaches, density-

based methods, and outlier detection methods. [6] Numerical outlier detection is done using

statistics, for example standard deviations or probabilities of data points being outside a defined

limit. Both categorical and numeric outlier detection methods help determine whether the data

being mined can provide some information to make useful decisions. In 2015, Microsoft

published a paper on how to find outliers from their search engine Bing.[7] “Sparsity” in big data

structures was a common referred to because the algorithms must adhere to the data structure

within Hadoop. Sparsity can be denoted as unstructured data. Multiple methods were evaluated

Page 4: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

to detect outliers but overall, the highlight of the paper was on compressive sensing because it

can handle the sparsely structured data. Literature on compressive or compressed sensing can

be found, but the focus of the project is to find a cost effective solution to use Hadoop without

using companies like Hortonworks and MapR.

As mentioned, the goal of this project is to build a Hadoop distributed filing system using

cost effective hardware. One can used old computers to build a cluster, but the Amazon Web

Service (AWS) provides a cost effective solution to scale any user's’ hardware needs, which is

essentially free. [8] Building a cluster using AWS, Hadoop can be installed on these machines

and accessed using SSH from a laptop, which can be considered the master, head, or name

node. Data can be stored in the cluster and MapReduce computations can be run. Techniques

for outlier detection using MapReduce can also be implemented after the cluster is created.

Method

We created a JAR file of Maximum.java and typed in the command "hadoop jar Maximum2.jar

Maximum GlobalTemperatures.csv gtoutput3":

Page 5: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

The result we got in gtoutput3 is the maximum average temperature in the world:

This was done by downloading Cloudera and adding it to VMware.

To test the code from Cloudera on Amazon, a cluster will need to be provisioned and

setup with Apache Hadoop. An array of tools will be needed to set up the cluster on Amazon: a

Page 6: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

local Linux based operating system, where Mac OS was used, access to Secure Shell (SSH,

and an Amazon Web Service (AWS) account. The detailed instructions to produce this cluster

can be found from Insight Engineering [9]. First, the AWS cluster will be provisioned using the

setup provided by Amazon. For this process, 4 unique virtual machines (VM) will be used,

where 1 will be the namenode and the other 3 will be the datanode. Each VM will need a

permissions key to access the VM, which is saved on the local computer. Under the target

specifications, the free tier of AWS VM will be used, which have a single core of 8gb hard drive

space, 1gb of memory and use the Ubuntu OS. The amount of space for the cluster will be the

number of datanodes times the hard drive space available in each VM. These values may be

needed to provide specifications for the amount of data that can be stored on the HDFS.

Once the AWS VMs have been provisioned, the VMs will need to be split between the

namenode and the datanodes. This can be done by changing the vanity name of each of the

VMs and associating their vanity names with their public DNS names, shown below:

Figure 1: The images above show the associated name for the headnode/namenode and the

public DNS name for its VM.

The information above will be used to access the VM from the local bash shell, where the Mac

OS used its “Terminal” command window although any bash shell would work. The DNS name,

username (ubuntu), and the permissions key information is saved in a file so that the SSH

Page 7: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

command in bash can quickly access the remote VM. A sample of the SSH configuration file is

below:

Figure 2: The SSH configuration file is setup to use a passwordless connection into each VM.

After completing the configurations setup, the namenode VM must be accessed to port

the permission keys into its filing system, so that the namenode has permission access to the

other datanodes. This process can be achieved used the secure copy function. Using the

permissions key, an authorization key will be made within the namenode and sent to each of the

namenodes. To test whether the process works, the namenode will be able to access another of

the other datanodes using SSH.

Once the namenode has SSH access to each of the datanodes, the process to

implement Apache Hadoop can finally be started. First, Java must be downloaded and installed

onto each of the VMs, with the current version. After the Java installation, Apache Hadoop can

be installed and unloaded in the same directories for all VMs, which is shown below:

Page 8: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Figure 3: The commands within each VM needed to download and install Java and Apache

Hadoop.

From each of the VMs, the profile paths will be setup to access the Hadoop directories:

bin, hadoop, and /etc/hadoop. Each paths are used to access configuration files that will setup

the connections to the namenode VM: hadoop-env.sh, core-site.xml, yarn-site.xml, and mapred-

site.xml. The datanode files to configure is the hdfs-site.xml. After the configuration was

complete, the Hadoop cluster can be run.

Results

When using Hadoop and Java, org.apache.hadoop should be imported. One of the

ways to add a mapper is to do job.setMapperClass(MaxTemperatureMapper.class).

MaxTemperatureMapper contains the map function. In the map function, the write method can

be used to send the information to the reduce method. An example is context.write(new

Text(year), new IntWritable(airTemperature)). This makes new Text(year) a key with the

airTemperature as the type IntWritable. There can be a class that parses. For example, an

NcdcRecordParser class can be created. An if statement, such as if(record.charAt(87) == '+')

{...} can be used to split the data based on the symbol they were divided by. The parser's

method can be called from the map function in the mapper class.

Page 9: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

The reducer also needs to be added. An example is

job.setReducerClass(MaxTemperatureReducer.class). Then the MaxTemperatureReducer

class needs to be created with a reduce method. context.write(key, new

IntWritable(maxValue)) writes the maximum value to its matching key. In this case, the

maximum value of the values with the same key will be chosen.

One of Hadoop's interface is Tool. This is what it looks like:

public interface Tool extends Configurable{ int run(String [] args) throws Exception;} The ToolRunner class aggregates Tool. It can be used to run a class with the run function. When running the program in a terminal, the output should look something like this:14/09/12 06:38:11 INFO input.FileInputFormat: Total input paths to process : 10114/09/12 06:38:11 INFO impl.YarnClientImpl: Submitted applicationapplication_1410450250506_000314/09/12 06:38:12 INFO mapreduce.Job: Running job: job_1410450250506_000314/09/12 06:38:26 INFO mapreduce.Job: map 0% reduce 0%...14/09/12 06:45:24 INFO mapreduce.Job: map 100% reduce 100% File System Counters FILE: Number of bytes read=93995 FILE: Number of bytes written=10273563 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=33485855415 HDFS: Number of bytes written=904 HDFS: Number of read operations=327 HDFS: Number of large read operations=0 HDFS: Number of write operations=16

Job Counters Launched map tasks=101 Launched reduce tasks=8 Data-local map tasks=101 Total time spent by all maps in occupied slots (ms)=5954495 Total time spent by all reduces in occupied slots (ms)=74934 Total time spent by all map tasks (ms)=5954495

Page 10: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Total time spent by all reduce tasks (ms)=74934 Total vcore-seconds taken by all map tasks=5954495 Total vcore-seconds taken by all reduce tasks=74934 Total megabyte-seconds taken by all map tasks=6097402880 Total megabyte-seconds taken by all reduce tasks=76732416

Map-Reduce Framework Map input records=1209901509 Map output records=1143764653 Map output bytes=10293881877 Map output materialized bytes=14193 Input split bytes=14140 Combine input records=1143764772 Combine output records=234 Reduce input groups=100 Reduce shuffle bytes=14193 Reduce input records=115 Reduce output records=100 Spilled Records=379 Shuffled Maps =808 Failed Shuffles=0 Merged Map outputs=808 GC time elapsed (ms)=101080 CPU time spent (ms)=5113180 Physical memory (bytes) snapshot=60509106176 Virtual memory (bytes) snapshot=167657209856 Total committed heap usage (bytes)=68220878848Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0

File Input Format Counters Bytes Read=33485841275

File Output Format Counters Bytes Written=90 If the number of errors for each variable in "shuffle errors" is 0, the program ran successfully.

The bytes written depends on the size of the output. Each character is 1 byte, plus the null

terminators. "Launched reduce tasks" and "launched map tasks" refer to the number of

branches taken by reduce and the number from map, respectively.

Page 11: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Jobs can be viewed using the MapReduce Web UI. In our case, the URL for viewing the

MapReduce Web UI was http://quickstart.cloudera:8088. Here is another example:

Figure 4: The Hadoop job interface is shown above, based on the Cloudera system setup.

It contains information such as the user, name, application type, and queue. In our case,

the username was cloudera, the name was Maximum, the application type was MAPREDUCE,

and the queue was named root.cloudera. The state can be RUNNING or FINISHED. The

FinalStatus column can be SUCCEEDED, FAILED, and UNDEFINED. All of the jobs, including

the failed jobs, are listed. Progress has a bar which is completely full once the job is completed.

Here is a picture of a job in the web UI:

Page 12: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Figure 5: More information about the specific jobs can be found in the “Job” tab of the Hadoop

job interface.

In our case, our job name was “Maximum” and our state was FINISHED after it was

done. We had 1 map and 1 reduce.

Configurations are in the XML files. Here is an example:

<configuration> <property>

<name>color</name><value>yellow</value><description>Color</description>

</property> <property>

<name>size</name><value>10</value><description>Size</description>

</property> <property>

<name>weight</name><value>heavy</value><final>true</final><description>Weight</description>

</property> <property>

<name>size-weight</name>

Page 13: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

<value>${size},${weight}</value><description>Size and weight</description>

</property></configuration>

Properties can be added to the configuration. The final tags has a similar function to the

final keyword in Java: It prevents the information from being altered. Each property has a name,

value, and description. It can be read by the Java file. For example, if the configuration file is

called "configuration-1.xml" then add:

Configuration conf = new Configuration();

conf.addResource("configuration-1.xml");

To assert that a value in an XML is the same as the expected value, use the assertThat

method. For example, to check that size is 10 in the configuration-1.xml file, do

assertThat(conf.getInt("size", 0), is(10)), which gets size from column 0.

A property can be set by using System.setProperty(...). An example is

System.setProperty("size", "14"); which changes size to 14.

Tests can be done by creating a test function and adding @Test before it. Here is an

example to test a map function:

@Test public void ignoresMissingTemperatureRecord() throws IOException, InterruptedException {

Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" + // Year ^^^^ "99999V0203201N00261220001CN9999999N9+99991+99999999999"); // Temperature ^^^^^

new MapDriver<LongWritable, Text, Text, IntWritable>() .withMapper(new MaxTemperatureMapper()) .withInput(new LongWritable(0), value) .runTest(); }MaxTemperatureMapper contains the map function. The withInput method adds a key, with its

corresponding value. Run the test by calling runTest().

To run a program, package it into a JAR file. HADOOP_CLASSPATH needs to be added

to the environment variables for dependent classes.

Page 14: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Testing for potential corrupt data can be done. Here is an example:

public class MaxTemperatureMapperextends Mapper<LongWritable, Text, Text, IntWritable> {

enum Temperature {

OVER_100 }@Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

...if (parser.isValidTemperature()) {

int airTemperature = parser.getAirTemperature(); if (airTemperature > 1000) { System.err.println("Temperature over 100 degrees for input: " + value); context.setStatus("Detected possibly corrupt record: see logs."); context.getCounter(Temperature.OVER_100).increment(1); } context.write(new Text(parser.getYear()), new IntWritable(airTemperature));

} }

There is an enum with the variable OVER_100, which gets incremented with the increment(1)

method every time there's a value that's over 100. In this example, the temperatures are listed

as tenths of a temperature, so 1000 is 100 degrees. [4]

After the Hadoop functions in Java are tested, the jar files will be sent over to the AWS

cluster. From the cluster, some initial setup is needed to complete the proper setup. The HDFS

was formatted using the namenode of the cluster, then the filing system was started using

Hadoop’s start file. Once the startup process is complete, the Hadoop summary page is

available using the 50070 port to check the summary of the all nodes used in the cluster. A

screenshot of the summary page for the AWS cluster is shown below:

Page 15: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Figure 6: The summary page for the AWS cluster used 1 live node outside of the namenode.

Because of cost restraints, the amount of VMs that were provisioned using AWS was reduced to

2 VMs, one namenode and one datanode. The two files needed to run on the cluster will be the

same files used for the Cloudera VM tests. The Maximum.jar file and the

GlobalTemperatures.csv file was sent to the namenode from the local machine, using the same

process from the initial Hadoop setup. From the namenode, these file were sent into the HDFS

using the HDFS commands provided by Apache. Examples to make a directory and file in the

HDFS is shown below:

Page 16: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Figure 7: The commands are used in the namenode to set up the HDFS in order to run MapReduce tasks.

Using the commands above, the data file, GlobalTemperatures.csv, must be sent to the

HDFS. From the HDFS, the file can be partitioned to each datanode for MapReduce processing.

The namenode must retain the jar file, Maximum.jar, and run the file using the Hadoop and Java

implementation setup. The function call of the jar file is the same as was done in the Cloudera

VM; however, the directory for the data file will be pointed at the HDFS.

namenode$ ~/directory/Maximum.jar Maximum hdfs:/directory/GlobalTemperature.csv

hdfs:/directory/output

Command Figure: The namenode command is the set up

After running the command above, the MapReduce process was executed on the AWS cluster.

The process can be viewed from the command line, like what was seen from command line in

the Cloudera VM. The process can also be viewed from the 8088 port of the AWS cluster. The

view is shown in the figure below:

Page 17: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

Figure 8: The AWS namenode 8088 port shows the Hadoop job status.

After completion of the job, the output directory must be exported from the HDFS into the

namenode directory. From the namenode directory, the output files can be viewed. From the

Maximum.jar file, the output was successful and the output is consistent with the output in from

the test, shown below:

Figure 9: The output file from the AWS cluster is shown above.

Conclusion

The work completed in the Cloudera test environment was successfully applied to the

AWS implementation of the virtual machine cluster. From this level, addition computation can be

Page 18: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

done. Future work would focus on developing data mining and machine learning algorithms

from these low level process to optimize performance where languages like Java falter and

where functional languages may be better suited (F# or Scala). Applying other Apache software

like Spark would help memory caching for long iterative algorithms, thereby improving

performance speed and model possible model quality.

Author Contribution

Andrew Tran: For the paper, the introduction, literature review using the literature

summaries, AWS cluster method, AWS cluster results, and conclusion was completed. In the

code, the added computation from the data file was implemented to get mean and standard

deviation values.

April Song: I created and updated the website. I contributed to the bibliography, worked

on calculating the maximum number in a text file for our first prototype, contributed to the

presentation, and helped with the mathematical calculations in the final Cloudera and Hadoop

prototype. I gave summaries of the chapters 1, 2, 3, and 6 from the Hadoop: The Definitive

Guide textbook.

References

[1] http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf[2] http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf[3] http://zohararad.github.io/presentations/big-data-introduction/[4] Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc..[5] http://info.hortonworks.com/rs/h2source/images/Hadoop-Data-Lake-white-paper.pdf%20[6] https://www.mapr.com/resources/predictive-maintenance-using-hadoop-oil-and-gas-industry

Page 19: aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/AWSandHadoop.docx · Web viewAWS and Hadoop. Authors: April Song and Andrew Tran. Introduction. As distributed computing emerges

[7] Koufakou, Anna, et al. "Fast parallel outlier detection for categorical datasets using MapReduce." Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008.[8] Yan, Ying, et al. "Distributed outlier detection using compressive sensing."Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.[9] http://insightdataengineering.com/blog/hadoopdevops