big data, a space adventure - mario cartia - codemotion milan 2014

61
Hands On Big Data: Getting Started With NoSQL And Hadoop Mario Cartia [email protected]

Upload: codemotion

Post on 14-Jul-2015

202 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hands On Big Data: Getting Started With NoSQL And Hadoop

Mario [email protected]

Page 2: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 3: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 4: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big Data Facts

• Google processes about 20Pb (E+15 bytes) of data each day

• About 5Eb (Exabytes, E+18 bytes) of data in the world. 90% generated over last 2 years

• Wearable computing and IoT…

Page 5: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 6: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big Data: 3V Model

• Big Data it’s not only about volume–Volume

>= Petabytes, not Gigabytes–Variety

Structured and unstructured data–Velocity

Real-time or near real-time

Page 7: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big DataRisk

Page 8: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big DataOpportunity

Page 9: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big Data Facts

Page 10: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big Data Success Stories

Amazon.com, a pioneer of targeted advertising became a big data user when Greg Linden, one of its software engineers realized

the potential of book reviewing from the average results of their in-house review project

When Amazon compared the results of the computer sales against the in house reviews,

the results were much better for the data-derived material, and revolutionized e-

commerce

Page 11: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big Data Success StoriesGoogle Flu Trends is a web service

operated by Google. It provides estimates of influenza activity for more

than 25 countries. By aggregating Google search queries, it attempts to make accurate predictions about flu

activity

In the 2009 flu pandemic Google Flu Trends tracked information about flu in

the United States. In February 2010, the CDC identified influenza cases spiking in

the mid-Atlantic region of the United States. However, Google’s data of

search queries about flu symptoms was able to show that same spike two weeks prior to the CDC report being released

Page 12: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big Data Success StoriesreCAPTCHA is a user-dialogue system originally

developed by Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum at Carnegie Mellon University's main Pittsburgh

campus, and acquired by Google in September 2009

The reCAPTCHA service supplies subscribing websites with images of words that optical

character recognition (OCR) software has been unable to read. The subscribing websites present

these images for humans to decipher as CAPTCHA words, as part of their normal validation

procedures. They then return the results to the reCAPTCHA service, which sends the results to

the digitization projects

Secondarydatausage

Page 13: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Big Data Techniques

Statistics

Data Warehouse Data VisualizationData Mining

Prediction Machine Learning

Advanced AnalyticsCorrelation Analysis

Business Intelligence

Page 14: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

The Traditional Approach

ETL: Extract, Transform, Load•Extracts data from outside sources•Transforms it to fit operational needs, which can include quality levels•Loads it into the end target (database, operational data store, data mart or data warehouse)

Does it fit “big data” needs?

Page 15: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 16: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop Basics

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of

Big Data on clusters of commodity hardware

Page 17: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop Basics

Hadoop was created by Doug Cutting and Mike Cafarella in 2005.

Cutting, who was working at Yahoo! at the time named it after

his son's toy elephant

Page 18: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop 1 vs. Hadoop 2

Page 19: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop Distributions

Page 20: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop Market

Page 21: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop vs. RDBMS

Page 22: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 23: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

From RDBMS to NoSQL

A NoSQL (often interpreted as Not Only SQL) database provides a

mechanism for storage and retrieval of data that is modeled in

means other than the tabular relations used in relational

databases

Page 24: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

From RDBMS to NoSQL

Motivations for this approach include simplicity of design, horizontal scaling and finer control over

availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and

some in RDBMS

Page 25: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 26: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

NoSQL Approaches

Most popular NoSQL database types•Document (MongoDB, CouchDB, Clusterpoint, Couchbase, MarkLogic, etc.)•Key-value (Redis, MemcacheDB, Dynamo, FoundationDB, Riak, FairCom c-treeACE, Aerospike, etc.)•Column (Accumulo, Cassandra, Druid, HBase, Vertica, etc.)•Graph (Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog, etc.)

Page 27: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

NoSQL Approaches

Page 28: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

NoSQL How To Choose(Brewer)CAP theorem (Brewer)

Page 29: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 30: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop Architecture Overview

Page 31: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Hadoop Core Components

Page 32: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 33: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 34: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

MapReduce Model

• MapReduce is a programming model, and an associated implementation, for processing and generating large data sets with a parallel, distributed algorithm on a cluster

• The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms

Page 35: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

MapReduce Paper

Page 36: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

MapReduce Overview

• Map step: Each worker node applies the map() function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed

• Shuffle step: Worker nodes redistribute data based on the output keys (produced by the map() function), such that all data belonging to one key is located on the same worker node

• Reduce step: Worker nodes now process each group of output data, per key, in parallel

Page 37: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Page 38: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

Dear <Your Name>,

As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, Acorss all blogs ever wrriten on blogger.com, how many times 1 character words occur(like 'a', 'I'), How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur.

I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck.

regds,

The CEO(src: http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/)

Page 39: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

The next day, You stand with a mike on the dias before 50,000 and proclaim. For a week, you will all be divided into many groups:

•The Mappers (tens of Thousands of people will be in this group)•The Grouper (Assume just one guy for now)•The Reducers ( Around 10 of em.) and..•The Master (That’s you)

Page 40: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

• Each mapper will get a set of 50 blog urls and really Big sheet of paper. Each one of you need to go to each of that url. and for each word in those blogs, write one line on the paper. The format of that line should be the number of characters in the word, then a commna, and then the actual word

• For example, if you find the word “a”, you write “1,a”, in a new line in your paper. since the word “a” has only 1 character. If you find the word “hello”, you write “5,hello” on the new line

Page 41: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

Each take 4 days. So, After 4 days, your sheet might look like this

•“1,a”•“5,hello”•“2,if”•.. and a million more lines

At the end of the 4th day. each one of you will give your sheet completely filled to the Grouper

Page 42: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

• I will give you 10 papers. The first paper will be marked 1, the second paper will be marked 2, and so on, till 10

• You collect the output from mappers and for each line in the mapper’s sheet, if it says “1,”, your write the on sheet 1, if it says “2, ”, you write it on sheet two

• For example, if the first line of a mapper’s sheet says “1,a”, you write “a” on sheet 1. if it says “2,if”, your write “if” on sheet 2. If it says “5,hello”, you write hello on sheet 5

Page 43: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

So at the end of your work, the 10 sheets you have might look like this

•Sheet 1: a, a ,a , I, I , i, a, i, i, i…. millions more•Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of … millions more•Sheet 3 :the, the, and, for, met, bet, the, the, and, … millions more•..•Sheet 10: ……

once you are done, you distribute, each sheet to one reducer. For example sheet 1 goes to reducer 1, sheet 2 goes to reducer 2 and so on.

Page 44: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

• Each one of you gets one sheet from the grouper. For each sheet you count the number of words written on it and write it in big bold letters on the back side of the paper.

• For ex, if you are reducer 2 you get sheet 2 from the grouper that looks like this:“Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of …”

• You count the number of words on that sheet, say the number of words is 28838380044, You write it on the back side of the paper , in big bold letters and give it to the Master

Page 45: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: A really simple introduction

You essentially did map reduce. The greatest advantage in your approach was this:•The mappers can work independently•The reducers can work independently•The grouper can work really fast, because, he din’t have to do any counting of words, all the had to do was to look at the first number and put that word in the appropriate sheet

The process can be easily applied to other kinds of problems

Page 46: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: formal definition

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs.Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:

•Map(k1 ,v1) → list(k2, v2)

Page 47: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: formal definition

The Map function is applied in parallel to every pair in the input dataset

This produces a list of pairs for each call

After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key

Page 48: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Map Reduce: formal definition

The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:

•Reduce(k2, list (v2)) → list(v3)

Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list

Page 49: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

MapReduce job examplepackage org.myorg;

import java.io.IOException;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}

Page 50: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

MapReduce job examplepublic static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

Page 51: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

MapReduce job examplepublic static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}

}

Page 52: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Machine Learning

Machine learning is a scientific discipline that deals with the construction

and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputs and

using that to make predictions or decisions, rather than following only explicitly programmed instructions

Page 53: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Machine Learning

Machine learning can be considered a subfield of computer science and statistics. It has strong

ties to artificial intelligence and optimization, which deliver

methods, theory and application domains to the field

Page 54: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Machine Learning

Example applications include spam filtering, optical character

recognition (OCR), search engines and computer vision. Machine

learning is sometimes conflated with data mining

Page 55: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Machine Learning Examples

Page 56: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Machine Learning Examples

Page 57: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Machine Learning Tools

Apache Mahout is a project of the Apache Software Foundation to produce

free implementations of distributed or otherwise scalable machine learning

algorithms focused primarily in the areas of collaborative filtering, clustering and

classification

Page 58: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Machine Learning Tools

Page 59: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Data Visualization

Studies show the brain processes images 60,000x faster than text. The final step in your big data analytics workflow, the big data analytics visualization is a visual representation of the insights gained from your analysis

Page 60: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Data Visualization Tools

Page 61: Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014

Data Visualization Tools