aprilswebsite.atwebpages.comaprilswebsite.atwebpages.com/awsandhadoop.docx · web viewaws and...
TRANSCRIPT
AWS and HadoopAuthors: April Song and Andrew Tran
Introduction
As distributed computing emerges into the mainstream, many companies look to gain a
business advantage by leveraging high volume, variability, and velocity data, known as big data.
Current technology for distributed computing favors Apache Software Foundation’s widely used
distributed filing system known as Apache Hadoop, which provides a raid array like redundancy
for all data files within the system. Apache Hadoop is based on Google’s distributed filing
system, which was developed to support the large amount of data being gathered from Google’s
search engine [1]. Both filing systems support the distributed and parallel batch computing
framework known as MapReduce, where the algorithm was also developed by Google [2].
Apache Hadoop and MapReduce provides a reliable system that is extremely scalable, which is
ideal for any business because the software is free and open source.
Apache has several other software applications that work on top of Apache Hadoop and
MapReduce. For the purpose of this work, the focus will be on Hadoop and MapReduce.
Apache Hadoop can be run on a single core and single thread processor or thousands of cores
and thousands of threads, which was mentioned above under scalability. However, the
hardware and software setup is a non-trivial task for users wanting to get started with optimizing
performance. Companies provide free software to help with the set up, but the information to
build system from scratch is not as user friendly. The goal of this project is to set up a Apache
Hadoop cluster, without any financial overhead, so that processes can be created from relatively
low level instruction through Java.
Literature Survey
The barebones understanding of Hadoop is to get results faster from the large and
distributed dataset. Tom White's The Definitive Guide focuses on the implementation of Hadoop
and MapReduce.[3] MapReduce is based from the concept of parallel processing, where the
data is processed in batches from different locations and sorted back into an understandable
format. Additional benefits inherent with Hadoop is the scalability of the system, where more
hardware can be added to increase the throughput for MapReduce and storage for the filing
system. The information provided by White gives a better understanding of the foundation of
Hadoop and MapReduce, but does not provide explicit real world uses.
Different companies provide tools to help users make effective analysis on this big data
without the overhead of learning the lower level coding structure and algorithm design.
Hortonworks poses the problem of a data lake, where the data is a large pool of variably
structured data, as compared to a data warehouse, where the data warehouse has a defined
structure to the data, in a white paper.[4] As mentioned, one of the main focuses of big data is
to deal with unstructured data. Companies like Google, who take in massive amounts of data,
must find ways to handle their data that will be most beneficial to their users. When data is
unstructured, analysis becomes difficult because sorting or using the data in a meaningful way
can tedious.
Using standard non distributed techniques will yield issues with speed and performance
for data analysis and usage because of the size and lacking structure. The solution is to use
Hadoop, which focuses its CPU to handle data processing for analysis rather than handling low
value computing like data organization. Thus, Hadoop increase performance of data analysis
tasks and provides a format to use unstructured data efficiently. As efficient as Hadoop may be,
the shortcomings of this work is highlighted with the cost to create this system and the cost for
the expertise in this system through Hortonworks.
Beyond the cost, Hadoop also provides different outlets for companies to manage their
live streaming time series data, particularly field equipment. Oil and gas equipment use
telemetry to transfer the data to a filing system. Bad measurements can occur when the
equipment is faulty, and the company will need to repair the equipment to collect data. MapR,
another company that leverages Hadoop, provides a solution by implementing a real time
streaming database using Hadoop and MapReduce. [5] As data is collected, MapR uses an
algorithm designed to work with MapReduce will identify issue with the field equipment, rather
than having technicians survey the field to find faults. MapR does not go into detail on how
equipment can be found to be faulty. One way to find determine abnormal behavior is to look at
outliers, and in a larger scope, finding patterns in data.
Outlier detection with MapReduce is done with categorical data, where the data can be
mined to find patterns that will lead to a better understanding of the data. With streaming time
series data, models can be made to find data points that do not conform to patterns in the whole
dataset. Common data mining techniques involve clustering, distance approaches, density-
based methods, and outlier detection methods. [6] Numerical outlier detection is done using
statistics, for example standard deviations or probabilities of data points being outside a defined
limit. Both categorical and numeric outlier detection methods help determine whether the data
being mined can provide some information to make useful decisions. In 2015, Microsoft
published a paper on how to find outliers from their search engine Bing.[7] “Sparsity” in big data
structures was a common referred to because the algorithms must adhere to the data structure
within Hadoop. Sparsity can be denoted as unstructured data. Multiple methods were evaluated
to detect outliers but overall, the highlight of the paper was on compressive sensing because it
can handle the sparsely structured data. Literature on compressive or compressed sensing can
be found, but the focus of the project is to find a cost effective solution to use Hadoop without
using companies like Hortonworks and MapR.
As mentioned, the goal of this project is to build a Hadoop distributed filing system using
cost effective hardware. One can used old computers to build a cluster, but the Amazon Web
Service (AWS) provides a cost effective solution to scale any user's’ hardware needs, which is
essentially free. [8] Building a cluster using AWS, Hadoop can be installed on these machines
and accessed using SSH from a laptop, which can be considered the master, head, or name
node. Data can be stored in the cluster and MapReduce computations can be run. Techniques
for outlier detection using MapReduce can also be implemented after the cluster is created.
Method
We created a JAR file of Maximum.java and typed in the command "hadoop jar Maximum2.jar
Maximum GlobalTemperatures.csv gtoutput3":
The result we got in gtoutput3 is the maximum average temperature in the world:
This was done by downloading Cloudera and adding it to VMware.
To test the code from Cloudera on Amazon, a cluster will need to be provisioned and
setup with Apache Hadoop. An array of tools will be needed to set up the cluster on Amazon: a
local Linux based operating system, where Mac OS was used, access to Secure Shell (SSH,
and an Amazon Web Service (AWS) account. The detailed instructions to produce this cluster
can be found from Insight Engineering [9]. First, the AWS cluster will be provisioned using the
setup provided by Amazon. For this process, 4 unique virtual machines (VM) will be used,
where 1 will be the namenode and the other 3 will be the datanode. Each VM will need a
permissions key to access the VM, which is saved on the local computer. Under the target
specifications, the free tier of AWS VM will be used, which have a single core of 8gb hard drive
space, 1gb of memory and use the Ubuntu OS. The amount of space for the cluster will be the
number of datanodes times the hard drive space available in each VM. These values may be
needed to provide specifications for the amount of data that can be stored on the HDFS.
Once the AWS VMs have been provisioned, the VMs will need to be split between the
namenode and the datanodes. This can be done by changing the vanity name of each of the
VMs and associating their vanity names with their public DNS names, shown below:
Figure 1: The images above show the associated name for the headnode/namenode and the
public DNS name for its VM.
The information above will be used to access the VM from the local bash shell, where the Mac
OS used its “Terminal” command window although any bash shell would work. The DNS name,
username (ubuntu), and the permissions key information is saved in a file so that the SSH
command in bash can quickly access the remote VM. A sample of the SSH configuration file is
below:
Figure 2: The SSH configuration file is setup to use a passwordless connection into each VM.
After completing the configurations setup, the namenode VM must be accessed to port
the permission keys into its filing system, so that the namenode has permission access to the
other datanodes. This process can be achieved used the secure copy function. Using the
permissions key, an authorization key will be made within the namenode and sent to each of the
namenodes. To test whether the process works, the namenode will be able to access another of
the other datanodes using SSH.
Once the namenode has SSH access to each of the datanodes, the process to
implement Apache Hadoop can finally be started. First, Java must be downloaded and installed
onto each of the VMs, with the current version. After the Java installation, Apache Hadoop can
be installed and unloaded in the same directories for all VMs, which is shown below:
Figure 3: The commands within each VM needed to download and install Java and Apache
Hadoop.
From each of the VMs, the profile paths will be setup to access the Hadoop directories:
bin, hadoop, and /etc/hadoop. Each paths are used to access configuration files that will setup
the connections to the namenode VM: hadoop-env.sh, core-site.xml, yarn-site.xml, and mapred-
site.xml. The datanode files to configure is the hdfs-site.xml. After the configuration was
complete, the Hadoop cluster can be run.
Results
When using Hadoop and Java, org.apache.hadoop should be imported. One of the
ways to add a mapper is to do job.setMapperClass(MaxTemperatureMapper.class).
MaxTemperatureMapper contains the map function. In the map function, the write method can
be used to send the information to the reduce method. An example is context.write(new
Text(year), new IntWritable(airTemperature)). This makes new Text(year) a key with the
airTemperature as the type IntWritable. There can be a class that parses. For example, an
NcdcRecordParser class can be created. An if statement, such as if(record.charAt(87) == '+')
{...} can be used to split the data based on the symbol they were divided by. The parser's
method can be called from the map function in the mapper class.
The reducer also needs to be added. An example is
job.setReducerClass(MaxTemperatureReducer.class). Then the MaxTemperatureReducer
class needs to be created with a reduce method. context.write(key, new
IntWritable(maxValue)) writes the maximum value to its matching key. In this case, the
maximum value of the values with the same key will be chosen.
One of Hadoop's interface is Tool. This is what it looks like:
public interface Tool extends Configurable{ int run(String [] args) throws Exception;} The ToolRunner class aggregates Tool. It can be used to run a class with the run function. When running the program in a terminal, the output should look something like this:14/09/12 06:38:11 INFO input.FileInputFormat: Total input paths to process : 10114/09/12 06:38:11 INFO impl.YarnClientImpl: Submitted applicationapplication_1410450250506_000314/09/12 06:38:12 INFO mapreduce.Job: Running job: job_1410450250506_000314/09/12 06:38:26 INFO mapreduce.Job: map 0% reduce 0%...14/09/12 06:45:24 INFO mapreduce.Job: map 100% reduce 100% File System Counters FILE: Number of bytes read=93995 FILE: Number of bytes written=10273563 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=33485855415 HDFS: Number of bytes written=904 HDFS: Number of read operations=327 HDFS: Number of large read operations=0 HDFS: Number of write operations=16
Job Counters Launched map tasks=101 Launched reduce tasks=8 Data-local map tasks=101 Total time spent by all maps in occupied slots (ms)=5954495 Total time spent by all reduces in occupied slots (ms)=74934 Total time spent by all map tasks (ms)=5954495
Total time spent by all reduce tasks (ms)=74934 Total vcore-seconds taken by all map tasks=5954495 Total vcore-seconds taken by all reduce tasks=74934 Total megabyte-seconds taken by all map tasks=6097402880 Total megabyte-seconds taken by all reduce tasks=76732416
Map-Reduce Framework Map input records=1209901509 Map output records=1143764653 Map output bytes=10293881877 Map output materialized bytes=14193 Input split bytes=14140 Combine input records=1143764772 Combine output records=234 Reduce input groups=100 Reduce shuffle bytes=14193 Reduce input records=115 Reduce output records=100 Spilled Records=379 Shuffled Maps =808 Failed Shuffles=0 Merged Map outputs=808 GC time elapsed (ms)=101080 CPU time spent (ms)=5113180 Physical memory (bytes) snapshot=60509106176 Virtual memory (bytes) snapshot=167657209856 Total committed heap usage (bytes)=68220878848Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0
File Input Format Counters Bytes Read=33485841275
File Output Format Counters Bytes Written=90 If the number of errors for each variable in "shuffle errors" is 0, the program ran successfully.
The bytes written depends on the size of the output. Each character is 1 byte, plus the null
terminators. "Launched reduce tasks" and "launched map tasks" refer to the number of
branches taken by reduce and the number from map, respectively.
Jobs can be viewed using the MapReduce Web UI. In our case, the URL for viewing the
MapReduce Web UI was http://quickstart.cloudera:8088. Here is another example:
Figure 4: The Hadoop job interface is shown above, based on the Cloudera system setup.
It contains information such as the user, name, application type, and queue. In our case,
the username was cloudera, the name was Maximum, the application type was MAPREDUCE,
and the queue was named root.cloudera. The state can be RUNNING or FINISHED. The
FinalStatus column can be SUCCEEDED, FAILED, and UNDEFINED. All of the jobs, including
the failed jobs, are listed. Progress has a bar which is completely full once the job is completed.
Here is a picture of a job in the web UI:
Figure 5: More information about the specific jobs can be found in the “Job” tab of the Hadoop
job interface.
In our case, our job name was “Maximum” and our state was FINISHED after it was
done. We had 1 map and 1 reduce.
Configurations are in the XML files. Here is an example:
<configuration> <property>
<name>color</name><value>yellow</value><description>Color</description>
</property> <property>
<name>size</name><value>10</value><description>Size</description>
</property> <property>
<name>weight</name><value>heavy</value><final>true</final><description>Weight</description>
</property> <property>
<name>size-weight</name>
<value>${size},${weight}</value><description>Size and weight</description>
</property></configuration>
Properties can be added to the configuration. The final tags has a similar function to the
final keyword in Java: It prevents the information from being altered. Each property has a name,
value, and description. It can be read by the Java file. For example, if the configuration file is
called "configuration-1.xml" then add:
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
To assert that a value in an XML is the same as the expected value, use the assertThat
method. For example, to check that size is 10 in the configuration-1.xml file, do
assertThat(conf.getInt("size", 0), is(10)), which gets size from column 0.
A property can be set by using System.setProperty(...). An example is
System.setProperty("size", "14"); which changes size to 14.
Tests can be done by creating a test function and adding @Test before it. Here is an
example to test a map function:
@Test public void ignoresMissingTemperatureRecord() throws IOException, InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" + // Year ^^^^ "99999V0203201N00261220001CN9999999N9+99991+99999999999"); // Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>() .withMapper(new MaxTemperatureMapper()) .withInput(new LongWritable(0), value) .runTest(); }MaxTemperatureMapper contains the map function. The withInput method adds a key, with its
corresponding value. Run the test by calling runTest().
To run a program, package it into a JAR file. HADOOP_CLASSPATH needs to be added
to the environment variables for dependent classes.
Testing for potential corrupt data can be done. Here is an example:
public class MaxTemperatureMapperextends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {
OVER_100 }@Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
...if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature(); if (airTemperature > 1000) { System.err.println("Temperature over 100 degrees for input: " + value); context.setStatus("Detected possibly corrupt record: see logs."); context.getCounter(Temperature.OVER_100).increment(1); } context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
} }
There is an enum with the variable OVER_100, which gets incremented with the increment(1)
method every time there's a value that's over 100. In this example, the temperatures are listed
as tenths of a temperature, so 1000 is 100 degrees. [4]
After the Hadoop functions in Java are tested, the jar files will be sent over to the AWS
cluster. From the cluster, some initial setup is needed to complete the proper setup. The HDFS
was formatted using the namenode of the cluster, then the filing system was started using
Hadoop’s start file. Once the startup process is complete, the Hadoop summary page is
available using the 50070 port to check the summary of the all nodes used in the cluster. A
screenshot of the summary page for the AWS cluster is shown below:
Figure 6: The summary page for the AWS cluster used 1 live node outside of the namenode.
Because of cost restraints, the amount of VMs that were provisioned using AWS was reduced to
2 VMs, one namenode and one datanode. The two files needed to run on the cluster will be the
same files used for the Cloudera VM tests. The Maximum.jar file and the
GlobalTemperatures.csv file was sent to the namenode from the local machine, using the same
process from the initial Hadoop setup. From the namenode, these file were sent into the HDFS
using the HDFS commands provided by Apache. Examples to make a directory and file in the
HDFS is shown below:
Figure 7: The commands are used in the namenode to set up the HDFS in order to run MapReduce tasks.
Using the commands above, the data file, GlobalTemperatures.csv, must be sent to the
HDFS. From the HDFS, the file can be partitioned to each datanode for MapReduce processing.
The namenode must retain the jar file, Maximum.jar, and run the file using the Hadoop and Java
implementation setup. The function call of the jar file is the same as was done in the Cloudera
VM; however, the directory for the data file will be pointed at the HDFS.
namenode$ ~/directory/Maximum.jar Maximum hdfs:/directory/GlobalTemperature.csv
hdfs:/directory/output
Command Figure: The namenode command is the set up
After running the command above, the MapReduce process was executed on the AWS cluster.
The process can be viewed from the command line, like what was seen from command line in
the Cloudera VM. The process can also be viewed from the 8088 port of the AWS cluster. The
view is shown in the figure below:
Figure 8: The AWS namenode 8088 port shows the Hadoop job status.
After completion of the job, the output directory must be exported from the HDFS into the
namenode directory. From the namenode directory, the output files can be viewed. From the
Maximum.jar file, the output was successful and the output is consistent with the output in from
the test, shown below:
Figure 9: The output file from the AWS cluster is shown above.
Conclusion
The work completed in the Cloudera test environment was successfully applied to the
AWS implementation of the virtual machine cluster. From this level, addition computation can be
done. Future work would focus on developing data mining and machine learning algorithms
from these low level process to optimize performance where languages like Java falter and
where functional languages may be better suited (F# or Scala). Applying other Apache software
like Spark would help memory caching for long iterative algorithms, thereby improving
performance speed and model possible model quality.
Author Contribution
Andrew Tran: For the paper, the introduction, literature review using the literature
summaries, AWS cluster method, AWS cluster results, and conclusion was completed. In the
code, the added computation from the data file was implemented to get mean and standard
deviation values.
April Song: I created and updated the website. I contributed to the bibliography, worked
on calculating the maximum number in a text file for our first prototype, contributed to the
presentation, and helped with the mathematical calculations in the final Cloudera and Hadoop
prototype. I gave summaries of the chapters 1, 2, 3, and 6 from the Hadoop: The Definitive
Guide textbook.
References
[1] http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf[2] http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf[3] http://zohararad.github.io/presentations/big-data-introduction/[4] Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc..[5] http://info.hortonworks.com/rs/h2source/images/Hadoop-Data-Lake-white-paper.pdf%20[6] https://www.mapr.com/resources/predictive-maintenance-using-hadoop-oil-and-gas-industry
[7] Koufakou, Anna, et al. "Fast parallel outlier detection for categorical datasets using MapReduce." Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008.[8] Yan, Ying, et al. "Distributed outlier detection using compressive sensing."Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.[9] http://insightdataengineering.com/blog/hadoopdevops