data-intensive text processing with mapreduce jimmy lin the ischool university of maryland sunday,...

36
Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Chris Dyer Department of Linguistics University of Maryland Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009) (Bonus session)

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Data-Intensive Text Processing with MapReduce

Jimmy LinThe iSchoolUniversity of Maryland

Sunday, May 31, 2009

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Chris DyerDepartment of LinguisticsUniversity of Maryland

Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009)

(Bonus session)

Page 2: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Agenda Hadoop “nuts and bolts”

“Hello World” Hadoop example(distributed word count)

Running Hadoop in “standalone” mode

Running Hadoop on EC2

Open-source Hadoop ecosystem

Exercises and “office hours”

Page 3: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hadoop “nuts and bolts”

Page 4: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Source: http://davidzinger.wordpress.com/2007/05/page/2/

Page 5: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hadoop Zen Don’t get frustrated (take a deep breath)…

Remember this when you experience those W$*#T@F! moments

This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none)

But… Hadoop is the path to data nirvana?

Page 6: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Cloud9

Library used for teaching cloud computing courses at Maryland

Demos, sample code, etc. Computing conditional probabilities Pairs vs. stripes Complex data types Boilerplate code for working various IR collections

Dog food for research

Open source, anonymous svn access

Page 7: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

JobTracker

TaskTracker TaskTracker TaskTracker

Master node

Slave node Slave node Slave node

Client

Page 8: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

From Theory to Practice

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

Page 9: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Data Types in Hadoop

Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable.

WritableComprable Defines a sort order. All keys must be of this type (but not values).

IntWritableLongWritableText…

Concrete classes for different data types.

Page 10: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Complex Data Types in Hadoop How do you implement complex data types?

The easiest way: Encoded it as Text, e.g., (a, b) = “a:b” Use regular expressions to parse and extract data Works, but pretty hack-ish

The hard way: Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping

Alternatives: Cloud9 offers two other choices: Tuple and JSON Plus, a number of frequently-used data types

Page 11: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Input file (on HDFS)

InputSplit

RecordReader

Mapper

Partitioner

Reducer

RecordWriter

Output file (on HDFS)

InputFormat

OutputFormat

Page 12: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

What version should I use?

Page 13: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

“Hello World” Hadoop example

Page 14: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hadoop in “standalone” mode

Page 15: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hadoop in EC2

Page 16: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

From Theory to Practice

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

Page 17: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

On Amazon: With EC2

You

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

0. Allocate Hadoop cluster

EC2

Your Hadoop Cluster

7. Clean up!

Uh oh. Where did the data go?

Page 18: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

On Amazon: EC2 and S3

Your Hadoop Cluster

S3(Persistent Store)

EC2(The Cloud)

Copy from S3 to HDFS

Copy from HFDS to S3

Page 19: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Open-source Hadoop ecosystem

Page 20: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hadoop/HDFS

Page 21: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hadoop streaming

Page 22: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

HDFS/FUSE

Page 23: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

EC2/S3/EBS

Page 24: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

EMR

Page 25: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Pig

Page 26: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

HBase

Page 27: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hypertable

Page 28: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Hive

Page 29: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Mahout

Page 30: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Cassandra

Page 31: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Dryad

Page 32: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

CUDA

Page 33: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

CELL

Page 34: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Beware of toys!

Page 35: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Exercises

Page 36: Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative

Questions?Comments?

Thanks to the organizations who support our work: