introduction to mapreduce -geoinsyssoft

Upload: anandh-kumar

Post on 04-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    1/17

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    2/17

    What is MapReduce?

    A programming model (& its associatedimplementation)For processing large data setExploits large set of commodity computersExecutes process in distributed mannerOffers high degree of transparenciesIn other words:

    simple and maybe suitable for your tasks !!!

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    3/17

    Distributed Grep

    Verybigdata

    Split data

    Split data

    Split data

    Split data

    grepgrepgrep

    grep

    matches

    matches

    matches

    matches

    cat Allmatches

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    4/17

    Distributed Word Count

    Verybig

    data

    Split data

    Split data

    Split data

    Split data

    countcountcount

    count

    count

    count

    count

    count

    merge mergedcount

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    5/17

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    6/17

    Partitioning Function

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    7/17

    Partitioning Function (2)

    Default : hash(key) mod RGuarantee:

    Relatively well-balanced partitionsOrdering guarantee within partition

    Distributed SortMap:

    emit(key,value)Reduce (with R=1):

    emit(key,value)

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    8/17

    MapReduce

    Distributed GrepMap:

    if match(value,pattern) emit(value,1)

    Reduce:emit(key,sum(value*))

    Distributed Word Count

    Map:for all w in value do emit(w,1)Reduce:

    emit(key,sum(value*))

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    9/17

    MapReduce Transparencies

    Plus Google Distributed File System :Parallelization

    Fault-toleranceLocality optimizationLoad balancing

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    10/17

    Suitable for your task if

    Have a clusterWorking with large dataset

    Working with independent data (orassumed)Can be cast into map and reduce

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    11/17

    MapReduce outside Google

    Hadoop (Java)Emulates MapReduce and GFS

    The architecture of Hadoop MapReduceand DFS is master/slave

    Master Slave

    MapReduce jobtracker tasktrackerDFS namenode datanode

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    12/17

    Example Word Count (1)

    Mappublic static class MapClass extends MapReduceBaseimplements Mapper {

    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();

    public void map(WritableComparable key, Writable value,OutputCollector output, Reporter reporter)throws IOException {

    String line = ((Text)value).toString();

    StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) {

    word.set(itr.nextToken());output.collect(word, one);

    }}

    }

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    13/17

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    14/17

    Example Word Count (3)

    Mainpublic static void main(String[] args) throws IOException {

    //checking goes hereJobConf conf = new JobConf();

    conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(MapClass.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);

    conf.setInputPath(new Path(args[0]));conf.setOutputPath(new Path(args[1]));

    JobClient.runJob(conf);

    }

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    15/17

    One time setup

    set hadoop-site.xml and slaves

    Initiate namenode

    Run Hadoop MapReduce and DFSUpload your data to DFSRun your process

    Download your data from DFS

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    16/17

    Summary

    A simple programming model forprocessing large dataset on large set ofcomputer clusterFun to use, focus on problem, and let thelibrary deal with the messy detail

  • 8/13/2019 Introduction to MapReduce -Geoinsyssoft

    17/17

    References

    Original paper(http://labs.google.com/papers/mapreduce.html)

    On wikipedia(http://en.wikipedia.org/wiki/MapReduce )Hadoop MapReduce in Java

    (http://lucene.apache.org/hadoop/)Starfish - MapReduce in Ruby(http://rufy.com/starfish/)

    http://en.wikipedia.org/wiki/MapReducehttp://en.wikipedia.org/wiki/MapReduce