apache hadoop an introduction - todd lipcon - gluecon 2010
Post on 20-Aug-2015
5.380 Views
Preview:
TRANSCRIPT
Apache Hadoopan introduction
Todd Lipcontodd@cloudera.com@tlipcon @cloudera
May 27, 2010
Thursday, May 27, 2010
Software Engineer at
Hadoop contributor, HBase committer
Previously: systems programming, operations, large scale data analysis
I love data and data systems
Hi there!
Thursday, May 27, 2010
Outline
Why should you care? (Intro)
What is Hadoop?
The MapReduce Model
HDFS, Hadoop Map/Reduce
The Hadoop Ecosystem
Questions
Thursday, May 27, 2010
“I keep saying that the sexy job in the next 10 years will be
statisticians, and I’m not kidding.”
Hal Varian (Google’s chief economist)
Thursday, May 27, 2010
Are you throwing away data?
Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail), … .
Are you throwing it away because it doesn’t ‘fit’?
Thursday, May 27, 2010
So, what’s Hadoop?
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Thursday, May 27, 2010
Apache Hadoop is anopen-source system
to reliably store and process
gobs of informationacross many commodity computers.
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Thursday, May 27, 2010
Two Core Components
HDFS Map/Reduce
Self-healinghigh-bandwidth
clustered storage.
Fault-tolerant distributed computing.
Thursday, May 27, 2010
Image: MadMan the Mighty CC BY-NC-SA
Assumption 1: Machines can be reliable...
Thursday, May 27, 2010
Hadoop separates distributed system fault-tolerance code
from application logic.
Systems Programmers Statisticians
Unicorns
Thursday, May 27, 2010
Hadoop lets you interact with a cluster, not a bunch
of machines.
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Thursday, May 27, 2010
Image: Matthew J. Stinson CC-BY-NC
Assumption 3: Your analysis fits on one machine
Thursday, May 27, 2010
Hadoop scales linearlywith data size
or analysis complexity.Data-parallel or compute-parallel. For example:
Extensive machine learning on <100GB of image data
Simple SQL-style queries on >100TB of clickstream data
One Hadoop works for both applications!
Thursday, May 27, 2010
A Typical Look...
5-4000 commodity servers(8-core, 8-24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack
Thursday, May 27, 2010
STOP!REAL METAL?
Isn’t this some kind of“Cloud Computing” conference?
Image: Josh Hough CC BY-NC-SA
Hadoop runs as a cloud (a cluster)and maybe in a cloud (eg EC2).
Thursday, May 27, 2010
dramatis personae
NameNode (metadata server and database)
SecondaryNameNode (assistant to NameNode)
JobTracker (scheduler)
DataNodes (block storage)
TaskTrackers (task execution)
Thanks to Zak Stone for earmuff image!
Starring...
The Chorus…
Thursday, May 27, 2010
HDFS(fs metadata)
Namenode
Datanodes
One Rack A Different Rack
3x64MB file, 3 rep
4x64MB file, 3 rep
Small file, 7 rep
Thursday, May 27, 2010
HDFS Write Path
file in the filesystem’s namespace, with no blocks associated with it. (Step 2.) Thenamenode performs various checks to make sure the file doesn’t already exist, and thatthe client has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file, otherwise file creation fails and the client is thrown anIOException. The DistributedFileSystem returns a FSDataOutputStream for the client tostart writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.
As the client writes data (step 3.), DFSOutputStream splits it into packets, which it writesto an internal queue, called the data queue. The data queue is consumed by the DataStreamer, whose responsibility it is to ask the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline—we’ll assume the replication level is three, so there are three nodes in thepipeline. The DataStreamer streams the packets to the first datanode in the pipeline,which stores the packet and forwards it to the second datanode in the pipeline. Similarlythe second datanode stores the packet and forwards it to the third (and last) datanodein the pipeline. (Step 4.)
DFSOutputStream also maintains an internal queue of packets that are waiting to beacknowledged by datanodes, called the ack queue. A packet is only removed from theack queue when it has been acknowledged by all the datanodes in the pipeline. (Step 5.)
If a datanode fails while data is being written to it, then the following actions are taken,which are transparent to the client writing the data. First the pipeline is closed, and anypackets in the ack queue are added to the front of the data queue so that datanodesthat are downstream from the failed node will not miss any packets. The current blockon the good datanodes is given a new identity, which is communicated to the name-node, so that the partial block on the failed datanode will be deleted if the failed data-node recovers later on. The failed datanode is removed from the pipeline and the re-mainder of the block’s data is written to the two good datanodes in the pipeline. Thenamenode notices that the block is under-replicated, and it arranges for a further replicato be created on another node. Subsequent blocks are then treated as normal.
It’s possible, but unlikely, that multiple datanodes fail while a block is being written.As long as dfs.replication.min replicas (default one) are written the write will succeed,and the block will be asynchronously replicated across the cluster until its target rep-lication factor is reached (dfs.replication which defaults to three).
Figure 3-3. A client writing data to HDFS
Data Flow | 61
Thursday, May 27, 2010
HDFS Failures?Datanode crash?
Clients read another copy
Background rebalance/rereplicate
Namenode crash?
uh-oh
not responsible formajority of downtime!
Thursday, May 27, 2010
map()
map: K₁,V₁→list K₂,V₂
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); }}
Thursday, May 27, 2010
(the shuffle)
map output is assigned to a “reducer”
map output is sorted by key
Thursday, May 27, 2010
reduce()
K₂, iter(V₂)→list(K₃,V₃)
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } }}
Thursday, May 27, 2010
Some samples...Build an inverted index.
Summarize data grouped by a key.
Build map tiles from geographic data.
OCRing many images.
Learning ML models. (e.g., Naive Bayes for text classification)
Augment traditional BI/DWH technologies (by archiving raw data).
Thursday, May 27, 2010
M/R
Tasktrackers on the same machines as datanodes
One Rack A Different Rack
Job on starsDifferent jobIdle
Thursday, May 27, 2010
CHAPTER 6
How MapReduce Works
In this chapter we’ll look at how MapReduce in Hadoop works in detail. This knowl-edge provides a good foundation for writing more advanced MapReduce programs,which we will cover in the following two chapters.
Anatomy of a MapReduce Job RunYou can run a MapReduce job with a single line of code: JobClient.runJob(conf). It’svery short, but it conceals a great deal of processing behind the scenes. This sectionuncovers the steps Hadoop takes to run a job.
The whole process is illustrated in Figure 6-1. At the highest level there are four inde-pendent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker is a Java applicationwhose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been split into. Tasktrackersare Java applications whose main class is TaskTracker.
• The distributed filesystem (normally HDFS, covered in Chapter 3), which is usedfor sharing job files between the other entities.
Figure 6-1. How Hadoop runs a MapReduce job
145
M/R
Thursday, May 27, 2010
Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence
M/R Failures
Thursday, May 27, 2010
There’s more than the Java API
perl, python, ruby, whatever.
stdin/stdout/stderr
Higher-level dataflow language for easy ad-hoc analysis.
Developed at Yahoo!
SQL interface.
Great for analysts.
Developed at Facebook
Streaming Pig
Many tasks actually require a series of M/R jobs; that’s ok!
Hive
Thursday, May 27, 2010
The Hadoop Ecosystem
HDFS(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Avro
(S
erializ
ation)
Zookeepr
(Coord
ination) Sqoop
RDBMS
(Column DB)
Thursday, May 27, 2010
Hadoop in the Wild(yes, it’s used in production)
Yahoo! Hadoop Clusters: > 82PB, >25k machines(Eric14, HadoopWorld NYC ’09)
Facebook: 15TB new data per day;10000+ cores, 12+ PB
Twitter: ~1TB per day, ~80 nodes
Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research)
Thursday, May 27, 2010
Ok, fine, what next?
Get Hadoop!
Cloudera’s Distribution for Hadoop
http://hadoop.apache.org/
Try it out! (Locally, or on EC2)
Watch free training videos on
http://cloudera.com/
Door Prize
Thursday, May 27, 2010
Questions?
todd@cloudera.com
(feedback? yes!)
(hiring? yes!)
Thursday, May 27, 2010
Important APIsInput Format
Mapper
Reducer
Partitioner
Combiner
Out. Format
M/R
Flo
w
Oth
er
Writable
JobClient
*Context
Filesystem
K₁,V₁→K₂,V₂
data→K₁,V₁
K₂,iter(V₂)→K₂,V₂
K₂,V₂→int
K₂, iter(V₂)→K₃,V₃
K₃, V₃→data
→ is 1:many
Thursday, May 27, 2010
public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path tempDir = new Path("grep-temp-"+Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
// write a single file sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass(LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0;}
the “grep” example
Thursday, May 27, 2010
$ cat input.txt adams dunster kirkland dunsterkirland dudley dunsteradams dunster winthrop
$ bin/hadoop jar hadoop-0.18.3-examples.jar grep input.txt output1 'dunster|adams'
$ cat output1/part-00000 4 dunster2 adams
Thursday, May 27, 2010
JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob); } ...
Job1of 2
Thursday, May 27, 2010
JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
// write a single file sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0;}
Job2 of 2
(implicit identity reducer)
Thursday, May 27, 2010
top related