big data processing, 2014/15

60
Big Data Processing, 2014/15 Lecture 1: Introduction Claudia Hauff (Web Information Systems) [email protected] 1

Upload: dangduong

Post on 03-Jan-2017

235 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Big Data Processing, 2014/15

Big Data Processing, 2014/15Lecture 1: Introduction!

!Claudia Hauff (Web Information Systems)!

[email protected]

1

Page 2: Big Data Processing, 2014/15

Course organisation• 2 lectures a week (Tuesday, Thursday) for 7 weeks

• Content: • Data streaming algorithms • MapReduce (bulk of the course) • Approaches to iterative big-data problems

• One “free” lecture in January (suggestions?) • e.g. Big data visualisations, Industry talk, etc.

• Course style: interactive when possible

2

Page 3: Big Data Processing, 2014/15

Course content• Introduction!

• Data streams 1 & 2

• The MapReduce paradigm!

• Looking behind the scenes of MapReduce: HDFS & Scheduling!

• Algorithm design for MapReduce!

• A high-level language for MapReduce: Pig 1 & 2!

• MapReduce is not a database, but HBase nearly is

• Lets iterate a bit: Graph algorithms & Giraph!

• How does all of this work together? ZooKeeper/Yarn

3

Small changes are possible.

Page 4: Big Data Processing, 2014/15

4

• A mix of pen-and-paper and programming

• Individual work

• 5 lab sessions on Hadoop & Co between Nov. 24 and Jan. 5 on Mondays 8.45am - 12.45pm • Lab attendance is not compulsory !

• Two lab assistants will help you with your lab work (Kilian Grashoff and Robert Carosi)

• Assignments are handed out on Thursdays, and due the following Thursday (6 in total);

Assignments

Page 5: Big Data Processing, 2014/15

5

• Course grade: 25% assignments, 75% final exam

• Weekly quizzes: up to 1 additional grade point • +0.5 if you answer >50% correctly • +1 if you answer >75% correctly

• You can pass without handing in the assignments (not recommended!)

• Final exam: January 28, 2-5pm (MC & open questions)

• Resit: April 15, 2-5pm (MC & open questions)

Grading & exams

Assignments and quizzes?? Yes!

Page 6: Big Data Processing, 2014/15

6

• Questions are always welcome

• Contact preferably via mail: [email protected]

• Questions and answers may be posted on blackboard if they are relevant for others as well

• If the pace is too slow/fast, please let me know

Questions?

Page 7: Big Data Processing, 2014/15

7

• Recommended material are either book chapters or papers, usually available through the TU Delft campus network

• Course covers a recent topic, no single book available that covers all angles!

• MapReduce: Data-Intensive Text Processing with MapReduce by Lin, Dyer and Hirst, Morgan and Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/

Reading material (lectures)

Page 8: Big Data Processing, 2014/15

8

• A book on Java if you are not yet comfortable with it!

• Hadoop: Hadoop: The Definite Guide by Tom White, O’Reilly Media, 2012 (3rd edition)

Reading material (assignments)

Page 9: Big Data Processing, 2014/15

Big Data

Page 10: Big Data Processing, 2014/15

10

• Explain the ideas behind the “big data buzz”

• Understand and describe the three different paradigms covered in class

• Code productively in one of the most important big data software frameworks we have to date: Hadoop (and tools building on it)

• Transform big data problems into sensible algorithmic solutions

Course objectives

Page 11: Big Data Processing, 2014/15

11

• Explain and recognise the V’s of big data in use case scenarios

• Explain the main differences between data streaming and MapReduce algorithms

• Identify the correct approach (streaming vs. MapReduce) to be taken in an application setting

Today’s learning objectives

Page 12: Big Data Processing, 2014/15

12

• A buzz word, fuzzy boundaries

!

!

!

!

• Requires novel infrastructure to support storage and processing

What is “big data”?

“Massive amounts of diverse, unstructured data produced by high-performance applications.”

“Data too large & complex to be effectively handled by standard database technologies currently found in most organisations.”

Page 13: Big Data Processing, 2014/15

13

• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching

Large-scale computing is not new

ECMWF Source

Page 14: Big Data Processing, 2014/15

14

• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching

Large-scale computing is not new

ECMWF Source

Page 15: Big Data Processing, 2014/15

15

• So-called big data technologies are about discovering patterns (in semi/unstructured data)

• Main focus is on how to make computations on big data feasible, i.e. without a supercomputer • We use cluster(s) of commodity hardware!

• Next quarter: Data Mining course (will use Hadoop*) focuses on how to discover those patterns

Big data processing

*probably

Page 16: Big Data Processing, 2014/15

Just an academic exercise?• Cloud computing: “Anything running inside a browser

that gathers and stores user-generated content”

• Utility computing • Computing as a metered service!• A “cloud user” buys any amount of computing

power from a “cloud provider” (pay-per-use) • Virtual machine instances • IaaS: infrastructure as a service!• Amazon Web Services is the dominant provider

16

Page 17: Big Data Processing, 2014/15

Just an academic exercise?

17

No!!

AWS EC2 Pricing

You can run your own big data experiments!

Page 18: Big Data Processing, 2014/15

Progress often driven by industry• Development of big data standards & (open source)

software commonly driven by companies such as Google, Facebook, Twitter, Yahoo! …

• Why do they care about big data?

!

• More knowledge leads to • better customer engagement • fraud prevention • new products

18

More data More knowledge

More money

Page 19: Big Data Processing, 2014/15

Big data analytics: IBM pitch

19

https://www.youtube.com/watch?v=1RYKgj-QK4I

Page 20: Big Data Processing, 2014/15

A concrete example: big data vs. small data

20

Scaling to very very large corpora for !natural language disambiguation.!M. Banko and E. Brill, 2001.

qual

ity

perfect

poortraining datalittle a lot

Task: confusion set disambiguation then vs. than to vs. two vs. too

with little training simple algorithms fail

complex algorithm

but with a lot of data …

Page 21: Big Data Processing, 2014/15

The 3 V’s

• Volume: large amounts of data

• Variety: data comes in many different forms from diverse sources

• Velocity: the content is changing quickly

21

Page 22: Big Data Processing, 2014/15

The 5 V’s• Volume: large amounts of data

• Variety: data comes in many different forms from diverse sources

• Velocity: the content is changing quickly

• Value: data alone is not enough; how can value be derived from it?

• Veracity: can we trust the data? how accurate is it?

22

Page 23: Big Data Processing, 2014/15

The 7 V’s• Volume: large amounts of data

• Variety: data comes in many different forms from diverse sources

• Velocity: the content is changing quickly

• Value: data alone is not enough; how can value be derived from it?

• Veracity: can we trust the data? how accurate is it?

• Validity: ensure that the interpreted data is sound

• Visibility: data from diverse sources need to be stitched together

23

3/5 V’s most !

commonly used

Page 24: Big Data Processing, 2014/15

The 5 V’s• Volume: large amounts of data

• Variety: data comes in many different forms from diverse sources

• Velocity: the content is changing quickly

• Value: data alone is not enough; how can value be derived from it?

• Veracity: can we trust the data? how accurate is it?

24

Question: how do these attributes apply to the use case of Flickr?

Page 25: Big Data Processing, 2014/15

Instantiations

25

Page 26: Big Data Processing, 2014/15

Terminology

• Batch processing: running a series of computer programs without human intervention

• Near real-time: brief delay between the data becoming available and it being processed

• Real-time: guaranteed responses between the data becoming available and it being processed

26

Page 27: Big Data Processing, 2014/15

Terminology contd.

• Structured data (well defined fields)

• Semi-structured data

• Unstructured data(by humans for humans)

27most common today

standard in the past

Page 28: Big Data Processing, 2014/15

Unstructured text• To get value out of unstructured text we need to

impose structure automatically!• Parse text • Extract meaning from it (can be easy or difficult)

• Amount of data we create is more than doubling every two years, most new data is unstructured or at most semi-structured

• Text is not everything - images, video, audio, etc.

28

Page 29: Big Data Processing, 2014/15

Extracting meaning

29

easy

difficult

Page 30: Big Data Processing, 2014/15

IBM Watson: How it works

30

https://www.youtube.com/watch?v=_Xcmh1LQB9I

8 minutes worth your time - even if not in class. !Contains a nice piece about the use of unstructured text.

Page 31: Big Data Processing, 2014/15

Examples of Volume & Velocity: Twitter• >500 million tweets a day • On average >5700 tweets a second • Peaks of >100,000 tweets a second

• Super Bowl • US election • New Year’s Eve • Football World Cup (672M tweets

in total #WorldCup) !

• Messages are instantly accessible for search!• Messages are used in post-hoc analyses to gather insights

31

Page 32: Big Data Processing, 2014/15

32

http://bl.ocks.org/anonymous/raw/0c64880b3a791dffb6e4/

#WorldCup

Page 33: Big Data Processing, 2014/15

Examples of Volume: the Large Hadron Collider @ CERN• The world’s largest particle accelerator buried in a

tunnel with 27km circumference • Enormous amounts of

sensors register the passage of particles

• 40 million events/second (1MB raw data per event)

• Generates 15 Petabytes a year (15-25 million GB)

33

Image src

Page 34: Big Data Processing, 2014/15

Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide

• whether to show an ad • which ad to show

• Users willing at best to wait 2 seconds for their search results

• Feedback loop via user clicks, user searches, mouse movements, etc.

34

Page 35: Big Data Processing, 2014/15

Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide

• whether to show an ad • which ad to show

• Users willing at best to wait 2 seconds for their search results

• Feedback loop via user clicks, user searches, mouse movements, etc.

35

Page 36: Big Data Processing, 2014/15

More interactions/data on the Web• YouTube: 4 billion views a day, one hour of video

uploaded every second • Facebook: 483 million daily active users (Dec.

2011); 300 Petabytes of data • Google: >1 billion searches per day (March 2011) • Google processed 100 Terabytes of data per day

in 2004 and 20 Petabytes data/day in 2008 • Internet Archive: contains 2 Petabytes of data,

grows 20 Terabytes per month (2011)36

Page 37: Big Data Processing, 2014/15

Use case: movie recommendations

37

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

Bob

Alice

Tom

Jane

Frank

Joe

Bob likes X-MEN.

Jane dislikes X-MEN.Should we suggest

X-MEN to Frank?

Question: how can we come up with a suggestion for Frank?!

Think about data-intensive and non-intensive approaches.

Page 38: Big Data Processing, 2014/15

Use case contd.• Ignore the data, use experts instead (movie

reviewers); assumes no large subscriber/reviewer divergence

• Use all data but ignore individual preferences; assumes that most users are close to the average

• Lump people into preference groups based on shared likes/dislikes; compute group-based average score per movie

• Focus computational effort on difficult movies (some are universally liked/disliked)

38

A whole research field is concerned with this question: recommender systems.

Page 39: Big Data Processing, 2014/15

Use case contd.• Netflix Prize: open competition for the best

collaborative filtering algorithm to predict user ratings for films, based on previous ratings (>100 million ratings by 0.5 million users for ~17,000 movies)

• First competitor to improve over Netflix’s baseline by 10% receives $1,000,000

• Competition started in 2006, price money was paid out in 2009 (winner was 20 minutes quicker than the runner up - same performance!)

39Many research teams competed - innovation driven by industry again!

Page 40: Big Data Processing, 2014/15

Example of Variety: Restaurant Locator• Task: given a person’s location, list the top five

restaurants in the neighbourhood • Required data:

• World map • List of all restaurants in the world (opening

hours, GPS coordinates, menu, special offers) • Reviews/ratings • Optional: social media stream(s)

• Data is continuously changing (restaurants close, new ones open, data formats change, etc.)

40

Question: what kind of data do you need for this task?

Page 41: Big Data Processing, 2014/15

Society can benefit too, not just companies• Accurate predictions of natural disasters and

diseases • Better responses to disaster discovery

• Timely & effective decisions • Provide resources where need the most

!

• Complete disease/genomics databases to enable biomedical discoveries

• Accurate models to support forecasting of ecosystem developments

41

Page 42: Big Data Processing, 2014/15

Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,

Facebook, Instagram, i.e. portals with real-time posting abilities !

!

!

• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre

42

Earthquake!occurs

Warn people further away

people in the area tweet about it

Question: do you think this is possible? If so, what are the challenges?

Page 43: Big Data Processing, 2014/15

Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,

Facebook, Instagram, i.e. portals with real-time posting abilities !

!

!

• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre

43

Earthquake!occurs

Warn people further away

people in the area tweet about it

Page 44: Big Data Processing, 2014/15

Idea: earthquake warnings contd.• Goal: warning should reach people earlier than

the seismic waves • Travel times of seismic waves: 3-7km/s; arrival

time of a wave 100km away: 20 seconds!• Performance of an existing system (Sakaki et al.,

2010):

44 traditional warning system

Twitter-based

earthquake

It is possible!!

Page 45: Big Data Processing, 2014/15

A brief introduction to Streaming & MapReduce

Page 46: Big Data Processing, 2014/15

A brief introduction to Streaming

Page 47: Big Data Processing, 2014/15

Data streaming scenario• Continuous and rapid input of data (“stream of data”) • Limited memory to store the data - less than linear in

the input size • Limited time to process each data item - sequential

access • Algorithms have one (or very few passes) over the

data • Can be approached from a practical or mathematical

point of view: metric embedding, pseudo-random computations, sparse approx. theory …

47

We go for the practical setup!

Page 48: Big Data Processing, 2014/15

Data streaming example

48

3 6 5 4 1 8 2What is the

missing number?

stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data

Solution 1: memorise all number seen so far; memory requirements: n bit (impractical for large n)

Page 49: Big Data Processing, 2014/15

Data streaming example

49

3 6 5 4 1 8 2What is the

missing number?

stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data

Solution 2: (closed form)

sum of all numbers from 1 to n

subtract seen numbers

memory: 2logn

Page 50: Big Data Processing, 2014/15

Data streaming contd.

50

3 4 9 4 9 8 2What is the average?!

What is the median?

we are allowed one pass over the data, can only store 3 numbers

• Average: can be computed by keeping track of two numbers (sum and #numbers seen)

!• Median: sample data points - but how?

Page 51: Big Data Processing, 2014/15

Data streaming contd.

51

• Typically: simple functions of the stream are computed and used as input to other algorithms • Median • Number of distinct elements • Longest increasing sequence • …

• Closed form solutions are rare • Common approaches are approximations of the

true value: sampling, hashing

Page 52: Big Data Processing, 2014/15

A brief introduction to MapReduce

Page 53: Big Data Processing, 2014/15

MapReduce is an industry standard

53

QCon 2013 (San Francisco)“QCon empowers software development by facilitating the spread of knowledge and

innovation in the developer community. A practitioner-driven conference …”

Hadoop is the open-source implementation of the MapReduce framework.

Page 54: Big Data Processing, 2014/15

Industry is moving fast

54 QCon 2014 (San Francisco)

Page 55: Big Data Processing, 2014/15

MapReduce

55

• Designed for batch processing over large data sets

• No limits on the number of passes, memory or time

• Programming model for distributed computations inspired by the functional programming paradigm

Page 56: Big Data Processing, 2014/15

MapReduce example

56

WordCount: given an input text, determine the frequency of each word. The “Hello World” of the MapReduce realm.

Input text: The dog walks around the house. The dog is in the house.

Input: text

text 1

text 2

text 3

Mapper

Mapper

sort Reducer!(add)

(in,1)!(the,1)!

(house,1)

(dog,1)!(dog,1)!

(walks,1)!(house,1)!(house,1)!

dog: 2!walks: 1!house: 2!

Mapper

Page 57: Big Data Processing, 2014/15

MapReduce example

57

Input: text

text 1

text 2

text 3

Mapper

Mapper

sort Reducer!(add)

(in,1)!(the,1)!

(house,1)

(dog,1)!(dog,1)!

(walks,1)!(house,1)!(house,1)!

dog: 2!walks: 1!house: 2!

Mapper

We implement the Mapper and the Reducer. Hadoop (and other tools) are responsible for the “rest”.

Page 58: Big Data Processing, 2014/15

Summary• What are the characteristics of “big data”?

• Example use cases of big data

• Hopefully a convincing argument why you should care

• A brief introduction of data streams and MapReduce

58

Page 59: Big Data Processing, 2014/15

Reading material

59

Required reading!None. !

Recommended reading!Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information by Jules Berman. Chapters 1, 14 and 15.

Page 60: Big Data Processing, 2014/15

THE END