big data processing, 2014/15

Big Data Processing, 2014/15Lecture 1: Introduction!

!Claudia Hauff (Web Information Systems)!

[email protected]

1

mailto:[email protected]

Course organisation• 2 lectures a week (Tuesday, Thursday) for 7 weeks

• Content: • Data streaming algorithms • MapReduce (bulk of the course) • Approaches to iterative big-data problems

• One “free” lecture in January (suggestions?) • e.g. Big data visualisations, Industry talk, etc.

• Course style: interactive when possible

2

Course content• Introduction!

• Data streams 1 & 2

• The MapReduce paradigm!

• Looking behind the scenes of MapReduce: HDFS & Scheduling!

• Algorithm design for MapReduce!

• A high-level language for MapReduce: Pig 1 & 2!

• MapReduce is not a database, but HBase nearly is

• Lets iterate a bit: Graph algorithms & Giraph!

• How does all of this work together? ZooKeeper/Yarn

3

Small changes are possible.

4

• A mix of pen-and-paper and programming

• Individual work

• 5 lab sessions on Hadoop & Co between Nov. 24 and Jan. 5 on Mondays 8.45am - 12.45pm • Lab attendance is not compulsory !

• Two lab assistants will help you with your lab work (Kilian Grashoff and Robert Carosi)

• Assignments are handed out on Thursdays, and due the following Thursday (6 in total);

Assignments

5

• Course grade: 25% assignments, 75% final exam

• Weekly quizzes: up to 1 additional grade point • +0.5 if you answer >50% correctly • +1 if you answer >75% correctly

• You can pass without handing in the assignments (not recommended!)

• Final exam: January 28, 2-5pm (MC & open questions)

• Resit: April 15, 2-5pm (MC & open questions)

Grading & exams

Assignments and quizzes?? Yes!

6

• Questions are always welcome

• Contact preferably via mail: [email protected]

• Questions and answers may be posted on blackboard if they are relevant for others as well

• If the pace is too slow/fast, please let me know

Questions?

mailto:[email protected]

7

• Recommended material are either book chapters or papers, usually available through the TU Delft campus network

• Course covers a recent topic, no single book available that covers all angles!

• MapReduce: Data-Intensive Text Processing with MapReduce by Lin, Dyer and Hirst, Morgan and Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/

Reading material (lectures)

http://lintool.github.io/MapReduceAlgorithms/

8

• A book on Java if you are not yet comfortable with it!

• Hadoop: Hadoop: The Definite Guide by Tom White, O’Reilly Media, 2012 (3rd edition)

Reading material (assignments)

Big Data

10

• Explain the ideas behind the “big data buzz”

• Understand and describe the three different paradigms covered in class

• Code productively in one of the most important big data software frameworks we have to date: Hadoop (and tools building on it)

• Transform big data problems into sensible algorithmic solutions

Course objectives

11

• Explain and recognise the V’s of big data in use case scenarios

• Explain the main differences between data streaming and MapReduce algorithms

• Identify the correct approach (streaming vs. MapReduce) to be taken in an application setting

Today’s learning objectives

12

• A buzz word, fuzzy boundaries

!

!

!

!

• Requires novel infrastructure to support storage and processing

What is “big data”?

“Massive amounts of diverse, unstructured data produced by high-performance applications.”

“Data too large & complex to be effectively handled by standard database technologies currently found in most organisations.”

13

• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching

Large-scale computing is not new

ECMWF Source

http://old.ecmwf.int/services/computing/overview/supercomputer_history.html

14

• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching

Large-scale computing is not new

ECMWF Source

http://old.ecmwf.int/services/computing/overview/supercomputer_history.html

15

• So-called big data technologies are about discovering patterns (in semi/unstructured data)

• Main focus is on how to make computations on big data feasible, i.e. without a supercomputer • We use cluster(s) of commodity hardware!

• Next quarter: Data Mining course (will use Hadoop*) focuses on how to discover those patterns

Big data processing

*probably

Just an academic exercise?• Cloud computing: “Anything running inside a browser

that gathers and stores user-generated content”

• Utility computing • Computing as a metered service!• A “cloud user” buys any amount of computing

power from a “cloud provider” (pay-per-use) • Virtual machine instances • IaaS: infrastructure as a service!• Amazon Web Services is the dominant provider

16

Just an academic exercise?

17

No!!

AWS EC2 Pricing

You can run your own big data experiments!

http://aws.amazon.com/ec2/pricing/

Progress often driven by industry• Development of big data standards & (open source)

software commonly driven by companies such as Google, Facebook, Twitter, Yahoo! …

• Why do they care about big data?

!

• More knowledge leads to • better customer engagement • fraud prevention • new products

18

More data More knowledge

More money

Big data analytics: IBM pitch

19

https://www.youtube.com/watch?v=1RYKgj-QK4I

https://www.youtube.com/watch?v=1RYKgj-QK4I

A concrete example: big data vs. small data

20

Scaling to very very large corpora for !natural language disambiguation.!M. Banko and E. Brill, 2001.

qual

ity

perfect

poortraining datalittle a lot

Task: confusion set disambiguation then vs. than to vs. two vs. too

with little training simple algorithms fail

complex algorithm

but with a lot of data …

The 3 V’s

• Volume: large amounts of data

• Variety: data comes in many different forms from diverse sources

• Velocity: the content is changing quickly

21

The 5 V’s• Volume: large amounts of data



• Value: data alone is not enough; how can value be derived from it?

• Veracity: can we trust the data? how accurate is it?

22






• Validity: ensure that the interpreted data is sound

• Visibility: data from diverse sources need to be stitched together

23

3/5 V’s most !

commonly used






24

Question: how do these attributes apply to the use case of Flickr?

Instantiations

25

Terminology

• Batch processing: running a series of computer programs without human intervention

• Near real-time: brief delay between the data becoming available and it being processed

• Real-time: guaranteed responses between the data becoming available and it being processed

26

Terminology contd.

• Structured data (well defined fields)

• Semi-structured data

• Unstructured data(by humans for humans)

27most common today

standard in the past

Unstructured text• To get value out of unstructured text we need to

impose structure automatically!• Parse text • Extract meaning from it (can be easy or difficult)

• Amount of data we create is more than doubling every two years, most new data is unstructured or at most semi-structured

• Text is not everything - images, video, audio, etc.

28

Extracting meaning

29

easy

difficult

IBM Watson: How it works

30

https://www.youtube.com/watch?v=_Xcmh1LQB9I

8 minutes worth your time - even if not in class. !Contains a nice piece about the use of unstructured text.

https://www.youtube.com/watch?v=_Xcmh1LQB9I

Examples of Volume & Velocity: Twitter• >500 million tweets a day • On average >5700 tweets a second • Peaks of >100,000 tweets a second

• Super Bowl • US election • New Year’s Eve • Football World Cup (672M tweets

in total #WorldCup) !

• Messages are instantly accessible for search!• Messages are used in post-hoc analyses to gather insights

31

32

http://bl.ocks.org/anonymous/raw/0c64880b3a791dffb6e4/

#WorldCup

http://bl.ocks.org/anonymous/raw/0c64880b3a791dffb6e4/

Examples of Volume: the Large Hadron Collider @ CERN• The world’s largest particle accelerator buried in a

tunnel with 27km circumference • Enormous amounts of

sensors register the passage of particles

• 40 million events/second (1MB raw data per event)

• Generates 15 Petabytes a year (15-25 million GB)

33

Image src

Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide

• whether to show an ad • which ad to show

• Users willing at best to wait 2 seconds for their search results

• Feedback loop via user clicks, user searches, mouse movements, etc.

34

Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide

• whether to show an ad • which ad to show

• Users willing at best to wait 2 seconds for their search results

• Feedback loop via user clicks, user searches, mouse movements, etc.

35

More interactions/data on the Web• YouTube: 4 billion views a day, one hour of video

uploaded every second • Facebook: 483 million daily active users (Dec.

2011); 300 Petabytes of data • Google: >1 billion searches per day (March 2011) • Google processed 100 Terabytes of data per day

in 2004 and 20 Petabytes data/day in 2008 • Internet Archive: contains 2 Petabytes of data,

grows 20 Terabytes per month (2011)36

Use case: movie recommendations

37

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

Bob

Alice

Tom

Jane

Frank

Joe

Bob likes X-MEN.

Jane dislikes X-MEN.Should we suggest

X-MEN to Frank?

Question: how can we come up with a suggestion for Frank?!

Think about data-intensive and non-intensive approaches.

Use case contd.• Ignore the data, use experts instead (movie

reviewers); assumes no large subscriber/reviewer divergence

• Use all data but ignore individual preferences; assumes that most users are close to the average

• Lump people into preference groups based on shared likes/dislikes; compute group-based average score per movie

• Focus computational effort on difficult movies (some are universally liked/disliked)

38

A whole research field is concerned with this question: recommender systems.

Use case contd.• Netflix Prize: open competition for the best

collaborative filtering algorithm to predict user ratings for films, based on previous ratings (>100 million ratings by 0.5 million users for ~17,000 movies)

• First competitor to improve over Netflix’s baseline by 10% receives $1,000,000

• Competition started in 2006, price money was paid out in 2009 (winner was 20 minutes quicker than the runner up - same performance!)

39Many research teams competed - innovation driven by industry again!

Example of Variety: Restaurant Locator• Task: given a person’s location, list the top five

restaurants in the neighbourhood • Required data:

• World map • List of all restaurants in the world (opening

hours, GPS coordinates, menu, special offers) • Reviews/ratings • Optional: social media stream(s)

• Data is continuously changing (restaurants close, new ones open, data formats change, etc.)

40

Question: what kind of data do you need for this task?

Society can benefit too, not just companies• Accurate predictions of natural disasters and

diseases • Better responses to disaster discovery

• Timely & effective decisions • Provide resources where need the most

!

• Complete disease/genomics databases to enable biomedical discoveries

• Accurate models to support forecasting of ecosystem developments

41

Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,

Facebook, Instagram, i.e. portals with real-time posting abilities !

!

!

• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre

42

Earthquake!occurs

Warn people further away

people in the area tweet about it

Question: do you think this is possible? If so, what are the challenges?

Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,

Facebook, Instagram, i.e. portals with real-time posting abilities !

!

!

• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre

43

Earthquake!occurs

Warn people further away

people in the area tweet about it

Idea: earthquake warnings contd.• Goal: warning should reach people earlier than

the seismic waves • Travel times of seismic waves: 3-7km/s; arrival

time of a wave 100km away: 20 seconds!• Performance of an existing system (Sakaki et al.,

2010):

44 traditional warning system

Twitter-based

earthquake

It is possible!!

A brief introduction to Streaming & MapReduce

A brief introduction to Streaming

Data streaming scenario• Continuous and rapid input of data (“stream of data”) • Limited memory to store the data - less than linear in

the input size • Limited time to process each data item - sequential

access • Algorithms have one (or very few passes) over the

data • Can be approached from a practical or mathematical

point of view: metric embedding, pseudo-random computations, sparse approx. theory …

47

We go for the practical setup!

Data streaming example

48

3 6 5 4 1 8 2What is the

missing number?

stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data

Solution 1: memorise all number seen so far; memory requirements: n bit (impractical for large n)

Data streaming example

49

3 6 5 4 1 8 2What is the

missing number?

stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data

Solution 2: (closed form)

sum of all numbers from 1 to n

subtract seen numbers

memory: 2logn

Data streaming contd.

50

3 4 9 4 9 8 2What is the average?!

What is the median?

we are allowed one pass over the data, can only store 3 numbers

• Average: can be computed by keeping track of two numbers (sum and #numbers seen)

!• Median: sample data points - but how?

Data streaming contd.

51

• Typically: simple functions of the stream are computed and used as input to other algorithms • Median • Number of distinct elements • Longest increasing sequence • …

• Closed form solutions are rare • Common approaches are approximations of the

true value: sampling, hashing

A brief introduction to MapReduce

MapReduce is an industry standard

53

QCon 2013 (San Francisco)“QCon empowers software development by facilitating the spread of knowledge and

innovation in the developer community. A practitioner-driven conference …”

Hadoop is the open-source implementation of the MapReduce framework.

Industry is moving fast

54 QCon 2014 (San Francisco)

MapReduce

55

• Designed for batch processing over large data sets

• No limits on the number of passes, memory or time

• Programming model for distributed computations inspired by the functional programming paradigm

MapReduce example

56

WordCount: given an input text, determine the frequency of each word. The “Hello World” of the MapReduce realm.

Input text: The dog walks around the house. The dog is in the house.

Input: text

text 1

text 2

text 3

Mapper

Mapper

sort Reducer!(add)

(in,1)!(the,1)!

(house,1)

(dog,1)!(dog,1)!

(walks,1)!(house,1)!(house,1)!

…

dog: 2!walks: 1!house: 2!

…

Mapper

MapReduce example

57

Input: text

text 1

text 2

text 3

Mapper

Mapper

sort Reducer!(add)

(in,1)!(the,1)!

(house,1)

(dog,1)!(dog,1)!

(walks,1)!(house,1)!(house,1)!

…

dog: 2!walks: 1!house: 2!

…

Mapper

We implement the Mapper and the Reducer. Hadoop (and other tools) are responsible for the “rest”.

Summary• What are the characteristics of “big data”?

• Example use cases of big data

• Hopefully a convincing argument why you should care

• A brief introduction of data streams and MapReduce

58

Reading material

59

Required reading!None. !

Recommended reading!Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information by Jules Berman. Chapters 1, 14 and 15.

THE END

big data processing, 2014/15

Documents