hadoop: data processing by minions abcd-gis august 2015 presentation dave strohschein, harvard...

38
Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis [email protected]

Upload: sharlene-watson

Post on 29-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop: Data Processing by

Minions

ABCD-GIS August 2015 PresentationDave Strohschein, Harvard Center for Geographic Analysis

[email protected]

Page 2: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Today’s TalkWhy use Hadoop?What is Hadoop?How does Hadoop work?How are we using Hadoop?Issues encounteredA broader view – future directions

Page 3: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Background“…WorldMap will be extended to be capable of gathering interactive map information from hundreds of other servers around the world and making this map layer information searchable together with the WorldMap layer information.”

http://worldmap.harvard.edu/

Page 4: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Orientation / Motivationgathering interactive map information

from hundreds of other servers around the world

KML

Shapefiles

Page 5: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Overall Process

Billions of webpages

Hundreds of terabytes of compressed HTML text data

Page 6: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

Page 7: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu
Page 8: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Process the DataHundreds of terabytes of compressed HTML text data Thousands CPU

hours

Months of processing !

Page 9: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Common Crawl Frequency• [ARC] s3://aws-publicdatasets/common-crawl/crawl-001/ - Crawl #1 (2008/2009)• [ARC] s3://aws-publicdatasets/common-crawl/crawl-002/ - Crawl #2 (2009/2010)• [ARC] s3://aws-publicdatasets/common-crawl/parse-output/ - Crawl #3 (2012)• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/ -

Summer 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/ -

Winter 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/ -

March 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-15/ -

April 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/ -

July 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-35/ -

August 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-41/ -

September 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/ -

October 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-49/ -

November 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/ -

December 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-06/ -

January 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-11/ -

February 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-14/ -

March 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-18/ -

April 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-22/ -

May 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/ -

June 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/ -

July 2015

Page 10: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Master

Slaves

Page 11: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Master

Slaves

Page 12: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Process the data

Hours

Thousands CPU hours

Page 13: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Master

Slaves

• Scalability• Fault Tolerance• Resource Sharing

Page 14: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop 1.0 Framework

Hadoop Distributed File System - HDFS

MapReduce - MR

Page 15: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

MapReduce Implementation

Key : Value

orK : V

K1 : V1

KO : VO

Page 16: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

MapReduce Flow

(K ,V)

(K ,V)

(K ,[V]) (K ,V)

Page 17: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop HDFS

Page 18: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu
Page 19: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop 1.0 IssuesScalability – Job Tracker does it

Job Tracker – single point of failure

Resource Utilization – Map & Reduce slots

Designed for MapReduce Applications

Page 20: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop Evolution

Yet Another Resource Negotiator - YARN

Page 21: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop 2.0 Framework

Page 22: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop Environments

Cloud

Local Cluster

‘Virtual’

Page 23: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

A Commodity Server2009 – 8 cores, 16GB of RAM, 4x1TB disk

2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.

http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

Page 24: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Amazon Web Services

Page 25: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu
Page 26: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Hadoop on AWS EMR

Elastic Cloud Compute (EC2)Elastic Map Reduce (EMR)

Amazon Web Services (AWS)

Page 27: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Implementing Hadoop at CGA

AWS account –FREE 750 hrs/month t1.micro (Hadoop 1.0) Smallest Amazon EC2 Instance Good for learning basics Can’t execute Hadoop 2 – needed for libraries

t1.micro m1.medium Hadoop 2 Clusters

Develop on local machine Create test specific test WARCs

Process on cluster m1.medium r3.xlarge

Page 28: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

CommonCrawl Processing on AWS

• Local algorithm development

• Upload application (jar file) to S3

• Ruby command-line-interface for EC2/EMR initialization

Page 29: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Implementing Hadoop at CGA

WARCTagCounter.java

TagCounterMap.java

• Hadoop ‘configuration’• Input data information• Mapper selection• Reducer selection – simple summer

• Mapper functionality • Extends the Mapper class• Mapper<Text, ArchiveReader, Text, LongWritable>

K1 : V1

K2 : V2

Page 30: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

WARC/1.0WARC-Type: responseWARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>Content-Length: 43428Content-Type: application/http; msgtype=responseWARC-Warcinfo-ID: <urn:uuid:3169ca8e-39a6-42e9-a4e3-9f001f067bdf>WARC-Concurrent-To: <urn:uuid:d99f2a24-158a-4c77-bb0a-3cccd40aad56>WARC-IP-Address: 212.58.244.61WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stmWARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3JWARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJOWARC-Truncated: length

HTTP/1.1 200 OKServer: ApacheVary: X-CDNCache-Control: max-age=0Content-Type: text/htmlDate: Sat, 02 Aug 2014 09:52:13 GMTExpires: Sat, 02 Aug 2014 09:52:13 GMTConnection: closeSet-Cookie: BBC-UID=......Set-Cookie: BBC-UID=......

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/.....><html><head><title>

BBC NEWS | Africa | Namibia braces for Nujoma exit</title>

Page 31: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Signature Detection

<!DOCTYPE html>

<p>… <a href="http://maps.vcgi.org/arcgis/rest/services/ ...

</a> </p>

Page 32: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Signatures “http(s)://…/arcgis/rest/services” “http(s)://…/arcgiscache”

“http(s)://…?request=getcapabilities”

“http(s)://… .kml” or “http(s)://… .kmz” (“shape” || “shp”) && “.zip”

“http(s)://… "${z}/${x}/${y}" || "${z}/${y}/${x}" || "$[z]/$[x]/$[y]" ||

"$[z]/$[y]/$[x]" ||"{z}/{x}/{y}" || "{z}/{y}/{x}" || "[z]/[x]/[y]" || "[z]/[y]/[x]"

“http(s)://… request=getmap”

“http(s)://… .jp2” “http(s)://… .ecw” “http(s)://… .sid” “http(s)://… .tfw”

“http(s)://… .gpx” “http(s)://… .geojson” “http(s)://… .gdb”

“http(s)://…thredds…” “http(s)://…opendap…”

Page 33: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Reducer Output

http://cinematreasures.org/theaters/10911.kml 1

http://cinematreasures.org/theaters/10911/map|||http://cinematreasures.org/theaters/10911.kml -1

Signature match

Signature matchURI (base of URL)

Page 34: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

ResultsIt worked!

Pre-built parsers vs. ‘homebrew’ Jsoup parser: inconsistent processing times RegEx parser: much more consistent results

A wide array of geo services vis-à-vis signature choice

Page 35: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Issues Implementing Hadoop

Hadoop learning curve Native Java application Tutorial information exists

Hadoop on AWS: S3, EMR, terminology, billing / cluster size

Optimizing cluster: Instance type, CPU, Memory, etc.

A wide array of geo services vis-à-vis signature choiceWhat’s out there and what’s its signature

Page 36: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu
Page 37: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Future Directions

SpatialHadoopA MapReduce Framework for Spatial Data GIS Tools for Hadoop

Processing GeoTweets

Page 38: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

Backup