the what, why and how of big data

75
December 2nd, 2014 @NasoLuca The What, Why and How of Definitions, Examples, Suggestions, Howtos and much more

Upload: luca-naso

Post on 12-Jul-2015

410 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: The What, Why and How of Big Data

December 2nd, 2014

@NasoLuca

The What, Why and How of

Definitions, Examples, Suggestions, Howtos and much more

Page 2: The What, Why and How of Big Data

Agenda

✤ What is Big Data?

✤ Big Data Examples

✤ How to Tackle a Big Data Problem

✤ Sentiment Analysis

✤ Big Data tools

Part I Part II

Page 3: The What, Why and How of Big Data

How relevant is it?

Big Data

Social Media

Digital Marketing

Machine Learning

Computer Vision

Who’s more relevant to the people?

Let’s ask Google!

Page 4: The What, Why and How of Big Data

How relevant is it?

Big Data

Social Media

Digital Marketing

Machine Learning

Computer Vision

Google Trends

From 2007 to end 2014

Page 5: The What, Why and How of Big Data

Big Data Market

What is Big Data? How relevant is it?

Jobs to support Big Data

In 2012 it was $28B, for 2013 expected $37BScattered across a number of IT landscapes. 45% for new social network analysis and content analytics tools[1]

4.4 Million IT jobs globally by 2015, 1.9m in the US[1]

By 2018, the US alone could face a shortage of 200k people with deep analytical skills as well as 1.5m managers and analysts[2]

Page 6: The What, Why and How of Big Data

Definition

Big Data according to Oxford Dictionary[3]:big data n. Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.

Big Data according to Gartner[4]:Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

This is where the 3 Vs originated from: Volume Velocity Variety

Page 7: The What, Why and How of Big Data

VOLUMEAbout: Amount of data. Unit: bytes

What is Big Data? Definition

Information about the general population, education, health, medicine, travel, geographic locations, shopping, financial transactions, jobs, scientific experiments, emails, sensors, texts, photos, videos, activity on social networks …

2.5 Exabytes of data are created each day worldwide[5]

Facebook (2012): 200 PB of data each year In 3 years CERN collected 75 PB of data (with LHC)Most of US company have 100 TB[5]

1 ZB = 10002 PB = 10003 TB = 10004 GB

How much is Big Data? > 5 TB (as of 2014)

Page 8: The What, Why and How of Big Data

VELOCITYAbout: moving data. Unit: bytes per seconds

What is Big Data? Definition

This really has two interpretations:Data Generation Rate or Data Processing Rate

Every minute (2014)[6]:200M emails4M google search277k more tweets216k pictures on Instagram

What’s the limit to be considered big data?

As of 2014Generation: time to reach 5TB < Project Life TimeProcessing: > 1 MB/s = 5TB/2mo

Page 9: The What, Why and How of Big Data

VARIETYAbout: Form of the data.

3 Types: structured, semi-structured, unstructured

What is Big Data? Definition

1. Structured = Data in a fixed field within a record (spreadsheets, Relational Database)

2. Semi-Structured = XML, JSON, CSV (Text with columns, with a separator)

3. Unstructured = Data stored without any model, or that does not have any organisation

All of them can be Big Data

Page 10: The What, Why and How of Big Data

What is Big Data? Definition

VERACITYLack of accuracy

Data itself is often imprecise or incomplete (typos, empty fields, errors, source changes, …)The time of small and tidy samples is over

This concludes the classical 3 Vs of Big Data.To better describe Big Data we can add a couple more Vs.

Page 11: The What, Why and How of Big Data

VALUEAbout the actionable insights one can get

What is Big Data? Definition

People do not need data, they need insights which are hidden in the data: Value is a concentrated data-juice.

Obtaining correct, but irrelevant, information is a waste of time, effort and resources.

Close interactions between an analytics team and business managers can help you address the right questions.

Page 12: The What, Why and How of Big Data

“Datafication” is the movement behind Big Data[7]

What is Big Data? Implications

Big Data implicitly requires 3 paradigm shifts:

1. from “some” to “all”

2. from “clean” to “messy”

3. from “causation” to “correlation”

Page 13: The What, Why and How of Big Data

What is Big Data? Implications

Page 14: The What, Why and How of Big Data

What is Big Data? Implications

http://xkcd.com/552/

Page 15: The What, Why and How of Big Data

Big Data Examples

Page 16: The What, Why and How of Big Data

General Application Fields

Not only business: Big Data have implications far beyond marketing and consumer goods

It will profoundly change how governments work and alter the nature of politics and our daily life too (smart cities).

When it comes to generating economic growth, providing public services, or fighting wars, those who can harness big data effectively will have a significant edge over others.

Page 17: The What, Why and How of Big Data

Forbes think that it will influence us in 5 ways[8]:

1. how we spend

2. how we vote

3. how we study

4. how we stay healthy

5. how we keep/lose privacy

Big Data Examples - General Application Fields

Page 18: The What, Why and How of Big Data

1. Fire-prevention @ New York City[7]

Big Data Examples - Real Life Applications

Page 19: The What, Why and How of Big Data

Problem

Imbalance between needs and resource

Too many complaints (25,000 per year) too few inspectors

(200).

You want your inspectors to tackle the most relevant cases

only/first.

How to prioritise the complaints?

1. Fire-prevention @ New York City[7]

Big Data Examples - Real Life Applications

Page 20: The What, Why and How of Big Data

1. Fire-prevention @ New York City[7]

Solutiona. Database with information about buildings (crime rates,

ambulance visits, utility usage, missed payments, …)b. Compare database to records of building fires, looking for

correlationsc. Estimate the probability of fire for each of the complaint

Big Data Examples - Real Life Applications

ResultThe efficiency of the inspectors raised from 13% to 70%Among the predictors of a fire were:

the type of building and the year it was builtpermits for exterior brickwork correlated with lower risks

Page 21: The What, Why and How of Big Data

2. Improve Formula 1 car performance[9]

Big Data Examples - Real Life Applications

Page 22: The What, Why and How of Big Data

2. Improve Formula 1 car performance[9]

Big Data Examples - Real Life Applications

Why is this Big Data?

Volume = average 10+ TB of data at each GP per team

Velocity = teams take decisions in <~ 30 seconds

Main goals

1. get real time alarms on brakes, tires, fuel and other factors

that affect car performance during a race

2. find ways to improve car performance in the long term

Page 23: The What, Why and How of Big Data

2. Improve Formula 1 car performance[9]

a. Collect data:130-160 sensors on a car during race, plus

weather conditions, track conditions …

b. Compare data with records of success/failuresc. Look for correlations to get (1) real-time alarms and (2)

long term insights

Big Data Examples - Real Life Applications

$1B cost of saving 0.1s from a single lap$60M money spent by a team on a supercomputer

Page 24: The What, Why and How of Big Data

3. Predict Flu Outbreak in Real-Time

Big Data Examples - Real Life Applications

Page 25: The What, Why and How of Big Data

3. Predict Flu Outbreak in Real-Time

Flu can spread very fast with catastrophic consequences,

traditional methods can be too slow.

Each day, millions of users around the world search for health

information online. As you might expect, there are more flu-

related searches during flu season.

Of course, not every person who searches for "flu" is actually

sick, but a pattern emerges when all the flu-related search

queries are added together.

Big Data Examples - Real Life Applications

Page 26: The What, Why and How of Big Data

3. Predict Flu Outbreak in Real-Time

a. Collect data: keywords searched on the web; data collected by national medical authorities (US Centers for Disease Control and Prevention - CDC)

b. Compare the trends of search queries (top 50M) with the records in real data

c. Find the keywords that correlate with the actual trends, to make predictions based on current searches.

Big Data Examples - Real Life Applications

There are 45 keywords that correlate well with the historical data

The predictions from this system can improve the CDC data by up

to 50% [Royal Society Open Science, 2014]

Page 27: The What, Why and How of Big Data

3. Predict Flu Outbreak in Real-Time

Big Data Examples - Real Life Applications

Orange: US real data

Blue: predictions based on keywords

Page 28: The What, Why and How of Big Data

3. Predict Flu Outbreak in Real-Time

Big Data Examples - Real Life Applications

Google Flu Trend GFT project: www.google.org/flutrends/

Published in Nature in 2009[10]

Example of power of Big Data and of failure of Big Data.

Page 29: The What, Why and How of Big Data

4. Reduce injuries in sports[11]

Big Data Examples - Real Life Applications

Page 30: The What, Why and How of Big Data

4. Reduce injuries in sports[11]

Big Data Examples - Real Life Applications

Injuries are probably the largest market inefficiency in pro

sports

In 2013, teams in the Major League Baseball spent $665

million on the salaries of injured players and replacements

Goal

anticipate when an athlete will get hurt before it actually

happens so to avoid it

Page 31: The What, Why and How of Big Data

4. Reduce injuries in sports[11]

a. Collect data: data about how players actually move (accelerations, elevations, jumping ranges, …) and at what intensity.

b. Compare with records of injuries; let doctors analyse the data

c. Predict the chances to get an injury and intervene before it happens both during workouts or matches

Big Data Examples - Real Life Applications

Founded in 2006, Catapult sales have increased ~70% for six consecutive years and is on track to gross $20 million in 2013.

Page 32: The What, Why and How of Big Data

5. Running massive multiplayer games

Big Data Examples - Real Life Applications

Page 33: The What, Why and How of Big Data

“Infinity Challenge”, a massive 5 week online battle.Two needs: handle massive amount of data in almost real time to update leaderboards and detect cheaters.

Big Data Examples - Real Life Applications

The development team was taking these insights and updating the game almost weekly, using direct player feedback to tweak the game.

Behind the scenes there was the Microsoft Big Data cloud platform - HDInsight on Azure.

5. Running massive multiplayer games

Page 34: The What, Why and How of Big Data

6. Transparency of Governments

Improving politics for all

Big Data Examples - Real Life Applications

Page 35: The What, Why and How of Big Data

6. Transparency of Governments

Improving politics for all

In 2009 the US government started www.data.gov

Today there are 133k datasets in different fields:Agriculture, Climate, Education, Energy, Finance, Geospatial, Global Development, Health, Jobs & Skills, Public Safety, Science & Research, Weather

Big Data Examples - Real Life Applications

Many countries have followedincluding Italy (from 2011): ~ 9k datasets from 80 PACode4italy @Montecitorio

Page 36: The What, Why and How of Big Data

The Dark Side

There is one massive downside to this: Privacy concerns

Do we really want all our data to be logged and stored? Data that can say where we are everyday, which products we buy, which movie we watch, how fast (or slow) we drive our car, where we park it, which roads we usually take, where we go with out bike, how much exercise we do (or don’t), what we eat, how much we spend, which drugs we take, …

Security issues: track my position, steal my identityNot all applications are customer-centric: insurance companies (use data to increase costs)

Page 37: The What, Why and How of Big Data

Governments need to protect citizens against unhealthy market dominance: data antitrustAlso, they need to regulate better the ways companies ask and get the data (just asking for permission with Terms of Use is not enough!)

Big Data Examples - The Dark Side

At present the control of information is being taken away from citizensThe danger is that individuals will not be able to control the ways they are monitored or what happens to the information

Page 38: The What, Why and How of Big Data

How to Tackle a Big Data Problem

Page 39: The What, Why and How of Big Data

Preliminary Steps

First things first: check if it really is a Big Data problem

From the examples we have seen that common 3 steps are:1. collect data2. find correlations (compare with historical records)3. make predictions

Do not follow these steps!

These are relevant phases to execute a Big Data project, once everything is in place.

Page 40: The What, Why and How of Big Data

Preliminary steps:

1. Goals and timescale

what you want to achieve and by when

2. Data

which data you have or need to get

3. Team

which skills you need (can change with data)

4. Silo breaking

connections you need to create (crm, it, marketing)

5. Budget

how much money you can put overall (business stakeholders)

How to Tackle a Big Data Problem - Preliminary Steps

Page 41: The What, Why and How of Big Data

How to Tackle a Big Data Problem - Four Universal Steps

1. Collect & store data (source, privacy, real-time)

2. Clean data (na, errors)

3. Analyse data (correlations)

4. Visualise data (kpi)

It is very unluckily to get everything right (or everything you

need) at first attempt. Be prepared to iterate.

4 Universal Steps

Page 42: The What, Why and How of Big Data

Agenda

✤ What is Big Data?

✤ Big Data Examples

✤ How to Tackle a Big Data Problem

Part I Part II

✤ Sentiment Analysis

✤ Big Data tools

Page 43: The What, Why and How of Big Data

Sentiment Analysis

Page 44: The What, Why and How of Big Data

What is Sentiment Analysis?

Sentiment Analysis according to Oxford[14]:

The process of computationally identifying and categorising

opinions expressed in a piece of text, especially in order to

determine whether the writer’s attitude towards a particular

topic, product, etc. is positive, negative, or neutral.

Page 45: The What, Why and How of Big Data

Operative definition in steps:

Trying to understand what people think about a subject,

from what they write,

automatically,

producing a measure of what they think.

Sentiment Analysis - What is Sentiment Analysis?

Page 46: The What, Why and How of Big Data

The challenge:

Sentiment Analysis - What is Sentiment Analysis?

Hundreds (if not more) of scientific papers have been published on this topic.

None of the problem is solved, applications are flourishing (plenty of space for new ideas)

What humans readily grasp from context is very difficult for computers to detect.Abbreviations, bad spelling and grammar, sarcasm, irony, slang, idiom and personality

Page 47: The What, Why and How of Big Data

Show me the data! Where is the sentiment expressed?

Activity on social network SurveyCRM notesReviews (movies, restaurants, events,…)BlogsNews

Sentiment Analysis - What is Sentiment Analysis?

Page 48: The What, Why and How of Big Data

Why is it important?

Today people are different, they are:

1. more digital/technological

2. more connected

3. less loyal to brands

Communication is bidirectional and people’s reach is large

The People, not the Companies, have the power …

… and they are not afraid to use it.

Page 49: The What, Why and How of Big Data

Sentiment Analysis - Why is it important?

Nestle’ censors a Greenpeace video criticising the companyDomino’s Pizza employees post a video showing bad health codesUnited Airlines broke a guitar and did not reimburse

Page 50: The What, Why and How of Big Data

Some reasons to do sentiment analysis:

Gather feedback from customers (automatic, reliable)

• Give chance to react in real time

Sentiment as proxy of sales, opinions influence a lot

• To make predictions

Sentiment Analysis - Why is it important?

Gather information from/about competitions (so start

“listening”!)

• Find ways to get new customers

Page 51: The What, Why and How of Big Data

Sentiment Analysis - Techniques[13]

One Technique consists in (mainly) looking for:Lexical choice, Negator, Intensifier, Modal operators

I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive.

Here is an (old) opinion:

Page 52: The What, Why and How of Big Data

Sentiment Analysis - Techniques

Lexical choice (words):positive: nice, boost, benefit, bravenegative: terrible, conspire, catastrophe, cowardly

Negator: can flip the valence,not, never

Intensifier: give the strength of the sentiment,really, very, most

Modal operators: distinguish hypothetical from real situations and weaken intensity,

might, could, should

Page 53: The What, Why and How of Big Data

A text can contain multiples sentiments, that will usually be connected to each other, maybe a comparison (as for products)

Analyse the whole text, each sentence

Sentiment Analysis - Techniques

Lexical choice (words):positive: nice, boost, benefit, bravenegative: terrible, conspire,

catastrophe, cowardly

Negator: can flip the valence,not, never

Intensifier: give the strength of the sentiment,

really, very, most

Modal operators: distinguish hypothetical from real situations and weaken intensity,

might, could, should

Page 54: The What, Why and How of Big Data

Sentiment Analysis - Techniques

There is a market of fake opinions!

Page 55: The What, Why and How of Big Data

Every opinion is a quintuple: entity, feature, sentiment value, holder, time

Mike87 on 23-06-2009 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive”

Sentiment Analysis - Techniques

(iPhone, GENERAL , +, Mike87, 23-06-2009)(iPhone, touch_screen, +, Mike87, 23-06-2009)…

We are making an unstructured data a structured data

Page 56: The What, Why and How of Big Data

An Operative Plan

Preliminary:

What’s your goal?

e.g. Reaction to my new product launch (1 month tail)

How can you obtain it?

e.g. Twitter, Facebook and related-field blogs (want to use

google alert?)

How can I measure it? Which KPI? Which test?

e.g. KPI: # of mentions/comments/posts, % of positive over

total; choose threshold values for the goal to be met (for each

KPI)

Page 57: The What, Why and How of Big Data

Universal step 1: Collect and Store The Data

Identify the datatweets that mention the product (or the company?), comments to your Facebook page posts, select the specific blogs to follow

Setup a system that can get the datacreate/buy some tool to get the data automatically and programmatically

Store the data somewhere useful for the project and for your company

(you don’t want to create new silos!)

Sentiment Analysis - An Operative Plan

Page 58: The What, Why and How of Big Data

Universal step 2: Clean The Data

Act on the datadeal with writer mistakes: replace, modify textdeal with program error: remove records

Sentiment Analysis - An Operative Plan

Universal step 3: Analyse The Data

Analyse the data, extract the sentimentBuild the KPI

Page 59: The What, Why and How of Big Data

Universal step 4: Visualise The Data

Learn from the numbers, you need to come out with a story

e.g. Reaction was massive on Twitter and Facebook (2 x threshold), initially very positive (1.5x), then reduce but still good (1.3x); for blog posts the positive test was just passed (1x)

Visualise the story, create a dashboard to follow evolution in real-timecreate a static infographics to describe what happened

Sentiment Analysis - An Operative Plan

Page 60: The What, Why and How of Big Data

Big Data Tools

Page 61: The What, Why and How of Big Data

What is Hadoop?

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware

Created in 2005 by Doug Cutting and Mike Cafarella

Named it after a toy elephant (Cutting son). Originally developed to support the Nutch search engine project

Page 62: The What, Why and How of Big Data

The base Apache Hadoop framework is composed of the following modules:

1. Hadoop Common – libraries and utilities for other modules2. Hadoop Distributed File System (HDFS) – a distributed

file-system that splits files into large blocks and distribute them among the machines

3. Hadoop MapReduce – a programming model for large scale data processing. MapReduce ships code (.jar files) to the nodes that have the required data, and the nodes then process the data in parallel.

4. Hadoop YARN - resource-management platform

Big Data Tools - What is Hadoop?

Page 63: The What, Why and How of Big Data

The Hadoop Ecosystem

Since 2012, "Hadoop" often refers not to just the base modules but rather to the Hadoop Ecosystem,

which includes all of the additional packages that can be installed on top of or alongside Hadoop.

Page 64: The What, Why and How of Big Data

Let us meet some of the “Hadoop tools”:

Hive

Pig

Sqoop

Oozie

Big Data Tools - The Hadoop Ecosystem

Page 65: The What, Why and How of Big Data

Both HIVE and PIG allow to run MapReduce jobs using simple query languages

Big Data Tools - The Hadoop Ecosystem

Hiveprovides a SQL-like interface to data and allows to impose a schema on the data, and is best suited for structured and semi structured data

Pigtranslates the Pig Latin language so that scripts can run on Hadoop. Best suited for data flow jobs, for semi-structured and unstructured data

Page 66: The What, Why and How of Big Data

Sqooptool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores.

Big Data Tools - The Hadoop Ecosystem

Oozieworkflow scheduler system to manage Apache Hadoop jobs.

Oozie is integrated with the rest of the Hadoop Ecosystem supporting several types of Hadoop jobs out of the box (including Pig, Hive and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).

Page 67: The What, Why and How of Big Data

Big Data Tools - The Hadoop Ecosystem

Page 68: The What, Why and How of Big Data

Big Data with Microsoft

Hadoop can be deployed on premises as well as in the cloud. The cloud allows organisations to deploy Hadoop without hardware to acquire or specific setup expertise.

Vendors who currently have an offer for the cloud includeMicrosoft, Amazon and Google.

Let us focus on Microsoft

The key product is: HDInsight for Microsoft Azure

Page 69: The What, Why and How of Big Data

Big Data Tools - Big Data with Microsoft

Azure is Microsoft Cloud Platform, that offers several services

Azure HDInsightdeploys and provisions Apache Hadoop clusters in the cloud, it is compatible with: Ambari, Avro, HBase, HDFS, Hive, Mahout, MapReduce and YARN, Oozie, Pig, Sqoop, Storm, Zookeeper.

Azure Power ShellA scripting environment to control and automate the deployment and management of your workloads in Azure

Page 70: The What, Why and How of Big Data

Big Data Tools - Big Data with Microsoft

Windows Azure Blob Storage WASB

Blob Storage is a general-purpose Hadoop-compatible Azure storage solution that integrates with HDInsight.

Store data in Azure (blob) instead that in the cluster (HDFS)

(Positive) Consequences:Data are still there after you finish Map Reduce jobs and turn the cluster downEasier to share data with other applications

Page 71: The What, Why and How of Big Data

Big Data Tools - Big Data with Microsoft

Windows Azure Blob Storage WASB

Page 72: The What, Why and How of Big Data

Big Data Tools - Big Data with Microsoft

Excel on steroids, thanks to some powerful add-ins

Power Queryallows to simplifies data discovery and access.

You can connect to data across a wide variety of sources, including relational databases, Web and HadoopYou can combine and refine the dataYou can save queries and refresh the data

Page 73: The What, Why and How of Big Data

Big Data Tools - Big Data with Microsoft

Power Pivotallows non specialised users to do some Business Intelligence on different data sources and create interactive reports, sharable as web applications

Power Viewis a very interactive data exploration, visualisation and presentation tool

Power Mapis a data visualisation tool that allows to plot geographic and temporal data on a 3D map, show it over time, and create visual tours

Page 74: The What, Why and How of Big Data

it.linkedin.com/in/lucanaso/

@NasoLuca

Contacts

www.edisonweb.com

Page 75: The What, Why and How of Big Data

References

Big Data & Digital Marketing

Most of the original material

has been posted on: