mining the web for information using hadoop

29
1 – Someday Soon Mining the web with Hadoop Steve Watt Emerging Technologies @ HP

Upload: steve-watt

Post on 05-Dec-2014

4.861 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Mining the Web for Information using Hadoop

1

– Someday Soon

(Flickr)

Mining the web with HadoopSteve Watt Emerging Technologies @

HP

Page 2: Mining the Web for Information using Hadoop

2

– timsnell (Flickr)

Page 3: Mining the Web for Information using Hadoop

3

Gathering Data

Data Marketplaces

Page 4: Mining the Web for Information using Hadoop

4

Page 5: Mining the Web for Information using Hadoop

5

Page 6: Mining the Web for Information using Hadoop

6

Gathering Data

Apache Nutch(Web Crawler)

Page 7: Mining the Web for Information using Hadoop

7

Tech Bubble?

What does the Data Say?

Pascal Terjan (Flickr)

Page 8: Mining the Web for Information using Hadoop

8

Page 9: Mining the Web for Information using Hadoop

9

Page 10: Mining the Web for Information using Hadoop

10

Using Apache

Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2

For example:

http://www.crunchbase.com/companies?c=a&q=private_heldhttp://www.crunchbase.com/companies?c=b&q=private_heldhttp://www.crunchbase.com/companies?c=c&q=private_heldhttp://www.crunchbase.com/companies?c=d&q=private_held. . .

Crawl data is stored in sequence files in the segments dir on the HDFS

Page 11: Mining the Web for Information using Hadoop

11

ALSO

Page 12: Mining the Web for Information using Hadoop

12

Company POJO then /t Out

Prelim Filtering on URL

Making the data STRUCTURED

Retrieving HTML

Page 13: Mining the Web for Information using Hadoop

13

Company City State Country Sector Round Day Month Year Amount Investors

InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital

InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury

MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc

Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000

Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels

The Result? Tab Delimited Structured Data…

Note: I dropped the ZipCode because it didn’t occur

consistently

Page 14: Mining the Web for Information using Hadoop

14

Time to Analyze/Visualize the data…

Step1: Select the right visual encoding for your questions

Lets start by asking questions & seeing what we can learn from some simple Bar Charts…

Page 15: Mining the Web for Information using Hadoop

*Total Tech Investments By Year

Page 16: Mining the Web for Information using Hadoop

*Total Tech Investments By Year

*Total Tech Investments By Year

Page 17: Mining the Web for Information using Hadoop

*Investment Funding By Sector

Page 18: Mining the Web for Information using Hadoop

18

Total Investments By Zip Code for all Sectors

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.2 Billion in Boston

$1.7 Billion in Austin

Page 19: Mining the Web for Information using Hadoop

19

Total Investments By Zip Code for all Sectors

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.2 Billion in Boston

$1.7 Billion in Austin

Page 20: Mining the Web for Information using Hadoop

20

Total Investments By Zip Code for Consumer Web

$1.2 Billion in Chicago

$600 Million in Seattle

$1.7 Billion in San Francisco

Page 21: Mining the Web for Information using Hadoop

21

Total Investments By Zip Code for BioTech

$1.3 Billion in Cambridge

$528 Million in Dallas

$1.1 Billion in San Diego

Page 22: Mining the Web for Information using Hadoop

22

HP Confidential

Geospatial Encoding of Data

Page 23: Mining the Web for Information using Hadoop

23

Steve’s Not so Excellent Adventure

• Let’s try a Choropleth Encoding of the distribution of investment income by County

• Wait, what is GeoJSON?

• OK, the GeoJSON County is mapped to some code

• Each County code has a value that corresponds to a palette color

• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!?

• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them

Page 24: Mining the Web for Information using Hadoop

24

Generating Investment Income By County

FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘\t’) as (City, State, FIPSCode);

Amt = LOAD ‘data/equity.txt’ using PigStorage(‘\t’) as (City, State, Amount);

AmtGroup = Group Amt BY (City, State);

SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);

JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);

Final = FOREACH JoinGroup generate FIPSCode, Amount;

RESULT: 51234 5000000

16234 1234000 (...)

ALWAYS, ALWAYS check your output…

Page 25: Mining the Web for Information using Hadoop

25

But wait, why are there duplicate records?

Apparently some cities can actually belong to two counties… I guess I’ll pick one.

Page 26: Mining the Web for Information using Hadoop

26

Yay, no duplicates. Lets visualize this!

• Wait, what happened to California ?

• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.

Page 27: Mining the Web for Information using Hadoop

27

On Error Checking…

• Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors.

• Santa Clara, Ca

• Santa, Clara

• Santa, Clara CA

• Track(Count) input and output records. Examine the results. Something fishy?

Page 28: Mining the Web for Information using Hadoop

28

HP Confidential

Page 29: Mining the Web for Information using Hadoop

29

Questions?

Steve Watt [email protected]

@wattsteve

emergingafrican.com