data-intensive text processing with mapreduce jimmy lin the ischool university of maryland sunday,...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Data-Intensive Text Processing with MapReduce
Jimmy LinThe iSchoolUniversity of Maryland
Sunday, May 31, 2009
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Chris DyerDepartment of LinguisticsUniversity of Maryland
Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009)
(Bonus session)
Agenda Hadoop “nuts and bolts”
“Hello World” Hadoop example(distributed word count)
Running Hadoop in “standalone” mode
Running Hadoop on EC2
Open-source Hadoop ecosystem
Exercises and “office hours”
Hadoop “nuts and bolts”
Source: http://davidzinger.wordpress.com/2007/05/page/2/
Hadoop Zen Don’t get frustrated (take a deep breath)…
Remember this when you experience those W$*#T@F! moments
This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none)
But… Hadoop is the path to data nirvana?
Cloud9
Library used for teaching cloud computing courses at Maryland
Demos, sample code, etc. Computing conditional probabilities Pairs vs. stripes Complex data types Boilerplate code for working various IR collections
Dog food for research
Open source, anonymous svn access
JobTracker
TaskTracker TaskTracker TaskTracker
Master node
Slave node Slave node Slave node
Client
From Theory to Practice
Hadoop ClusterYou
1. Scp data to cluster2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job4a. Go back to Step 3
5. Move data out of HDFS6. Scp data from cluster
Data Types in Hadoop
Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable.
WritableComprable Defines a sort order. All keys must be of this type (but not values).
IntWritableLongWritableText…
Concrete classes for different data types.
Complex Data Types in Hadoop How do you implement complex data types?
The easiest way: Encoded it as Text, e.g., (a, b) = “a:b” Use regular expressions to parse and extract data Works, but pretty hack-ish
The hard way: Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping
Alternatives: Cloud9 offers two other choices: Tuple and JSON Plus, a number of frequently-used data types
Input file (on HDFS)
InputSplit
RecordReader
Mapper
Partitioner
Reducer
RecordWriter
Output file (on HDFS)
InputFormat
OutputFormat
What version should I use?
“Hello World” Hadoop example
Hadoop in “standalone” mode
Hadoop in EC2
From Theory to Practice
Hadoop ClusterYou
1. Scp data to cluster2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job4a. Go back to Step 3
5. Move data out of HDFS6. Scp data from cluster
On Amazon: With EC2
You
1. Scp data to cluster2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job4a. Go back to Step 3
5. Move data out of HDFS6. Scp data from cluster
0. Allocate Hadoop cluster
EC2
Your Hadoop Cluster
7. Clean up!
Uh oh. Where did the data go?
On Amazon: EC2 and S3
Your Hadoop Cluster
S3(Persistent Store)
EC2(The Cloud)
Copy from S3 to HDFS
Copy from HFDS to S3
Open-source Hadoop ecosystem
Hadoop/HDFS
Hadoop streaming
HDFS/FUSE
EC2/S3/EBS
EMR
Pig
HBase
Hypertable
Hive
Mahout
Cassandra
Dryad
CUDA
CELL
Beware of toys!
Exercises
Questions?Comments?
Thanks to the organizations who support our work: