data-intensive text processing with mapreduce jimmy lin the ischool university of maryland sunday,...

Data-Intensive Text Processing with MapReduce

Jimmy LinThe iSchoolUniversity of Maryland

Sunday, May 31, 2009

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Chris DyerDepartment of LinguisticsUniversity of Maryland

Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009)

(Bonus session)

Agenda Hadoop “nuts and bolts”

“Hello World” Hadoop example(distributed word count)

Running Hadoop in “standalone” mode

Running Hadoop on EC2

Open-source Hadoop ecosystem

Exercises and “office hours”

Hadoop “nuts and bolts”

Source: http://davidzinger.wordpress.com/2007/05/page/2/

Hadoop Zen Don’t get frustrated (take a deep breath)…

Remember this when you experience those W$*#T@F! moments

This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none)

But… Hadoop is the path to data nirvana?

Cloud9

Library used for teaching cloud computing courses at Maryland

Demos, sample code, etc. Computing conditional probabilities Pairs vs. stripes Complex data types Boilerplate code for working various IR collections

Dog food for research

Open source, anonymous svn access

JobTracker

TaskTracker TaskTracker TaskTracker

Master node

Slave node Slave node Slave node

Client

From Theory to Practice

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

Data Types in Hadoop

Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable.

WritableComprable Defines a sort order. All keys must be of this type (but not values).

IntWritableLongWritableText…

Concrete classes for different data types.

Complex Data Types in Hadoop How do you implement complex data types?

The easiest way: Encoded it as Text, e.g., (a, b) = “a:b” Use regular expressions to parse and extract data Works, but pretty hack-ish

The hard way: Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping

Alternatives: Cloud9 offers two other choices: Tuple and JSON Plus, a number of frequently-used data types

Input file (on HDFS)

InputSplit

RecordReader

Mapper

Partitioner

Reducer

RecordWriter

Output file (on HDFS)

InputFormat

OutputFormat

What version should I use?

“Hello World” Hadoop example

Hadoop in “standalone” mode

Hadoop in EC2

From Theory to Practice

Hadoop ClusterYou

On Amazon: With EC2

0. Allocate Hadoop cluster

Your Hadoop Cluster

7. Clean up!

Uh oh. Where did the data go?

On Amazon: EC2 and S3

Your Hadoop Cluster

S3(Persistent Store)

EC2(The Cloud)

Copy from S3 to HDFS

Copy from HFDS to S3

Open-source Hadoop ecosystem

Hadoop/HDFS

Hadoop streaming

HDFS/FUSE

EC2/S3/EBS

Hypertable

Mahout

Cassandra

Beware of toys!

Exercises

Questions?Comments?

Thanks to the organizations who support our work:

data-intensive text processing with mapreduce jimmy lin the ischool university of maryland sunday,...

scp data

cluster slide

lost data

data nirvana

hadoop cluster ec2

ec2 slide

different data types

practice hadoop cluster

Documents

ischool, cloud computing class talk, oct 6 th 20081...

lbsc 690 session #11 multimedia jimmy lin the ischool...

syntax and context-free grammars cmsc 723: computational...

cloud computing lecture #5 graph algorithms with mapreduce...

cloud computing lecture #7 introduction to ajax jimmy lin...

cloud computing lecture #1 parallel and distributed...

infm 700: session 2 principles of information architecture...

cloud computing lecture #2 introduction to mapreduce jimmy...

lbsc 690 session #9 unstructured information: search engines...

word sense disambiguation cmsc 723: computational...

introduction to nlp cmsc 723: computational linguistics i...

infm 603: information technology and organizational context...

i data-intensive text processing with mapreduce ·...

mapreduce basics - github pages€¦ · mapreduce basics...

cloud computing lecture #5 ir with mapreduce jimmy lin the...

hidden markov models cmsc 723: computational linguistics i...

data-intensive text processing with mapreduce ›...

n-gram language models cmsc 723: computational linguistics i...

ischool e-club

jimmy lin the ischool university of maryland thursday, april...