Transcript
Page 1: Tech4Africa - Opportunities around Big Data

1

Big Data

Steve Watt Technology Strategy @ HP [email protected]@wattsteve

Page 2: Tech4Africa - Opportunities around Big Data

2

Agenda

Hardware Software Data

• Situational Applications

• Big Data

Page 3: Tech4Africa - Opportunities around Big Data

3

Situational Applications

– eaghra (Flickr)

Page 4: Tech4Africa - Opportunities around Big Data

4

Situational Applicatio

nsMashups

Data Explosion

Social Platfor

ms

Enterprise

SOA

LAMPPublishi

ng Platform

s

Web 2.0 Era Topic Map

Inexpensive

Storage

Produce Process

Web 2.0

New Information

Page 5: Tech4Africa - Opportunities around Big Data

5

Page 6: Tech4Africa - Opportunities around Big Data

6

Big Data

– blmiers2 (Flickr)

Page 7: Tech4Africa - Opportunities around Big Data

The data just keeps growing…

1024 GIGABYTE= 1 TERABYTE1024 TERABYTES = 1 PETABYTE

1024 PETABYTES = 1 EXABYTE

1 PETABYTE 13.3 Years of HD Video

20 PETABYTES Amount of Data processed by Google daily

5 EXABYTES All words ever spoken by humanity

Page 8: Tech4Africa - Opportunities around Big Data

8

Web as a Platform Web 1.0 - Connecting Machines

Infrastructure

Web 2.0 - Connecting People API Foundation

Facebook Twitter LinkedIn

Google NetFlix

PayPaleBay Pandora

New York Times

The Fractured Web

Service EconomyService for this

Service for that

App Economy for DevicesApp for this App for that

Web 2.0 Data Exhaust of Historical and Real-time Data

Real-time Data

Mobile

Set Top Boxes

Tablets, etc.

Sensor WebAn instrumented and monitored world

Multiple Sensors in your pocket

Opportunity

Page 9: Tech4Africa - Opportunities around Big Data

9

Data Deluge! But filter patterns can help…

Kakadu (Flickr)

Page 10: Tech4Africa - Opportunities around Big Data

10

Filtering WithSearch

Page 11: Tech4Africa - Opportunities around Big Data

11

Awesome

Filtering Socially

Page 12: Tech4Africa - Opportunities around Big Data

12

Filtering Visually

Page 13: Tech4Africa - Opportunities around Big Data

But filter patterns force you down a pre-processed path

M.V. Jantzen (Flickr)

Page 14: Tech4Africa - Opportunities around Big Data

14

What if you could ask your own questions?

wowwzers(Flickr)

Page 15: Tech4Africa - Opportunities around Big Data

– MrB-MMX

(Flickr)

And go from discovering Something about Everything…

Page 16: Tech4Africa - Opportunities around Big Data

16

To discovering Everything about Something ?

Page 17: Tech4Africa - Opportunities around Big Data

17

Gathering,

Storing,

Processing &

Delivering Data @

Scale

How do we do this?

Lets examine a few techniques for

Page 18: Tech4Africa - Opportunities around Big Data

18

Gathering Data

Data Marketplaces

Page 19: Tech4Africa - Opportunities around Big Data

19

Page 20: Tech4Africa - Opportunities around Big Data

20

Page 21: Tech4Africa - Opportunities around Big Data

21

Gathering Data

Apache Nutch(Web Crawler)

Page 22: Tech4Africa - Opportunities around Big Data

22

Storing, Reading and Processing - Apache Hadoop Cluster technology with a single master and scale out with multiple slaves It consists of two runtimes:

The Hadoop Distributed File System (HDFS) Map/Reduce

As data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancy

A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster.

Jobs run on data that is on the local disks of the machine they are sent to ensuring data locality

Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-execute a job on any node in the cluster.

Want to know more? “Hadoop – The Definitive Guide (2nd Edition)”

Page 23: Tech4Africa - Opportunities around Big Data

23

Delivering Data @ Scale

• Structured Data

• Low Latency & Random Access

• Column Stores (Apache HBase or Apache Cassandra)• faster seeks

• better compression

• simpler scale out

• De-normalized – Data is written as it is intended to be queried

Want to know more? “HBase – The Definitive Guide” & “Cassandra High Performance Cookbook”

Page 24: Tech4Africa - Opportunities around Big Data

24

Storing, Processing & Delivering : Hadoop + NoSQL

NoSQL Repository

Apache Hadoop

FlumeConnector

NoSQL Connector/API

SQOOPConnector

MySQL

H D F SLog Files

Relational Data (JDBC)

Gather

Read/Transform

Low-latency

-Clean and Filter Data

- Transform and Enrich Data

- Often multiple Hadoop jobs

Web Data

Nutch Crawl

Application

Query

ServeCopy

Page 25: Tech4Africa - Opportunities around Big Data

25

Some things to keep in mind…

– Kanaka Menehune (Flickr)

Page 26: Tech4Africa - Opportunities around Big Data

26

Some things to keep in mind…

• Processing arbitrary types of data (unstructured, semi-structured, structured) requires normalizing data with many different kinds of readers

Hadoop is really great at this !

• However, readers won’t really help you process truly unstructured data such as prose. For that you’re going to have to get handy with Natural Language Processing. But this is really hard.

Consider using parsing services & APIs like Open Calais

Want to know more? “Programming Pig” (O’REILLY)

Page 27: Tech4Africa - Opportunities around Big Data

27

Open Calais (Gnosis)

Page 28: Tech4Africa - Opportunities around Big Data

28

Statistical real-time decision making

Capture Historical information

Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation)

Mesh real-time events (such as sensor data) against Models to make automated decisions

Want to know more? “Mahout in Action”

Page 29: Tech4Africa - Opportunities around Big Data

29

Tech Bubble?

What does the Data Say?

Pascal Terjan (Flickr

Page 30: Tech4Africa - Opportunities around Big Data

30

Page 31: Tech4Africa - Opportunities around Big Data

31

Page 32: Tech4Africa - Opportunities around Big Data

32

Apache

Identify Optimal Seed URLs& Crawl to a depth of 2

http://www.crunchbase.com/companies?

c=a&q=privately_held

Crawl data is stored in segment dirs on the HDFS

Page 33: Tech4Africa - Opportunities around Big Data

33

Page 34: Tech4Africa - Opportunities around Big Data

34

Company POJO then /t Out

Prelim Filtering on URL

Making the data STRUCTURED

Retrieving HTML

Page 35: Tech4Africa - Opportunities around Big Data

35

Aargh!

My viz tool requires zipcodes to plot geospatially!

Page 36: Tech4Africa - Opportunities around Big Data

Apache Pig Script to Join on City to get Zip Code and Write the results to Vertica

ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('\t') AS (State:chararray, City:chararray, ZipCode:int);

CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('\t') AS

(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amo

unt:int);

CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);

STORE CrunchBaseZip INTO

'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40),

Month int, Year int, Investor int, Amount varchar(40))}’

USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');

Page 37: Tech4Africa - Opportunities around Big Data

Total Tech Investments By Year

Page 38: Tech4Africa - Opportunities around Big Data

Investment Funding By Sector

Page 39: Tech4Africa - Opportunities around Big Data

39

Total Investments By Zip Code for all Sectors

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.2 Billion in Boston

$1.7 Billion in Austin

Page 40: Tech4Africa - Opportunities around Big Data

40

Total Investments By Zip Code for Consumer Web

$1.2 Billion in Chicago

$600 Million in Seattle

$1.7 Billion in San Francisco

Page 41: Tech4Africa - Opportunities around Big Data

41

Total Investments By Zip Code for BioTech

$1.3 Billion in Cambridge

$528 Million in Dallas

$1.1 Billion in San Diego

Page 42: Tech4Africa - Opportunities around Big Data

42

Questions?


Top Related