hadoop: data processing by minions abcd-gis august 2015 presentation dave strohschein, harvard...

Post on 29-Dec-2015

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hadoop: Data Processing by

Minions

ABCD-GIS August 2015 PresentationDave Strohschein, Harvard Center for Geographic Analysis

dstrohschein@cga.harvard.edu

Today’s TalkWhy use Hadoop?What is Hadoop?How does Hadoop work?How are we using Hadoop?Issues encounteredA broader view – future directions

Background“…WorldMap will be extended to be capable of gathering interactive map information from hundreds of other servers around the world and making this map layer information searchable together with the WorldMap layer information.”

http://worldmap.harvard.edu/

Orientation / Motivationgathering interactive map information

from hundreds of other servers around the world

KML

Shapefiles

Overall Process

Billions of webpages

Hundreds of terabytes of compressed HTML text data

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

Process the DataHundreds of terabytes of compressed HTML text data Thousands CPU

hours

Months of processing !

Common Crawl Frequency• [ARC] s3://aws-publicdatasets/common-crawl/crawl-001/ - Crawl #1 (2008/2009)• [ARC] s3://aws-publicdatasets/common-crawl/crawl-002/ - Crawl #2 (2009/2010)• [ARC] s3://aws-publicdatasets/common-crawl/parse-output/ - Crawl #3 (2012)• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/ -

Summer 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/ -

Winter 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/ -

March 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-15/ -

April 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/ -

July 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-35/ -

August 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-41/ -

September 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/ -

October 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-49/ -

November 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/ -

December 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-06/ -

January 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-11/ -

February 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-14/ -

March 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-18/ -

April 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-22/ -

May 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/ -

June 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/ -

July 2015

Master

Slaves

Master

Slaves

Process the data

Hours

Thousands CPU hours

Master

Slaves

• Scalability• Fault Tolerance• Resource Sharing

Hadoop 1.0 Framework

Hadoop Distributed File System - HDFS

MapReduce - MR

MapReduce Implementation

Key : Value

orK : V

K1 : V1

KO : VO

MapReduce Flow

(K ,V)

(K ,V)

(K ,[V]) (K ,V)

Hadoop HDFS

Hadoop 1.0 IssuesScalability – Job Tracker does it

Job Tracker – single point of failure

Resource Utilization – Map & Reduce slots

Designed for MapReduce Applications

Hadoop Evolution

Yet Another Resource Negotiator - YARN

Hadoop 2.0 Framework

Hadoop Environments

Cloud

Local Cluster

‘Virtual’

A Commodity Server2009 – 8 cores, 16GB of RAM, 4x1TB disk

2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.

http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

Amazon Web Services

Hadoop on AWS EMR

Elastic Cloud Compute (EC2)Elastic Map Reduce (EMR)

Amazon Web Services (AWS)

Implementing Hadoop at CGA

AWS account –FREE 750 hrs/month t1.micro (Hadoop 1.0) Smallest Amazon EC2 Instance Good for learning basics Can’t execute Hadoop 2 – needed for libraries

t1.micro m1.medium Hadoop 2 Clusters

Develop on local machine Create test specific test WARCs

Process on cluster m1.medium r3.xlarge

CommonCrawl Processing on AWS

• Local algorithm development

• Upload application (jar file) to S3

• Ruby command-line-interface for EC2/EMR initialization

Implementing Hadoop at CGA

WARCTagCounter.java

TagCounterMap.java

• Hadoop ‘configuration’• Input data information• Mapper selection• Reducer selection – simple summer

• Mapper functionality • Extends the Mapper class• Mapper<Text, ArchiveReader, Text, LongWritable>

K1 : V1

K2 : V2

WARC/1.0WARC-Type: responseWARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>Content-Length: 43428Content-Type: application/http; msgtype=responseWARC-Warcinfo-ID: <urn:uuid:3169ca8e-39a6-42e9-a4e3-9f001f067bdf>WARC-Concurrent-To: <urn:uuid:d99f2a24-158a-4c77-bb0a-3cccd40aad56>WARC-IP-Address: 212.58.244.61WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stmWARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3JWARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJOWARC-Truncated: length

HTTP/1.1 200 OKServer: ApacheVary: X-CDNCache-Control: max-age=0Content-Type: text/htmlDate: Sat, 02 Aug 2014 09:52:13 GMTExpires: Sat, 02 Aug 2014 09:52:13 GMTConnection: closeSet-Cookie: BBC-UID=......Set-Cookie: BBC-UID=......

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/.....><html><head><title>

BBC NEWS | Africa | Namibia braces for Nujoma exit</title>

Signature Detection

<!DOCTYPE html>

<p>… <a href="http://maps.vcgi.org/arcgis/rest/services/ ...

</a> </p>

Signatures “http(s)://…/arcgis/rest/services” “http(s)://…/arcgiscache”

“http(s)://…?request=getcapabilities”

“http(s)://… .kml” or “http(s)://… .kmz” (“shape” || “shp”) && “.zip”

“http(s)://… "${z}/${x}/${y}" || "${z}/${y}/${x}" || "$[z]/$[x]/$[y]" ||

"$[z]/$[y]/$[x]" ||"{z}/{x}/{y}" || "{z}/{y}/{x}" || "[z]/[x]/[y]" || "[z]/[y]/[x]"

“http(s)://… request=getmap”

“http(s)://… .jp2” “http(s)://… .ecw” “http(s)://… .sid” “http(s)://… .tfw”

“http(s)://… .gpx” “http(s)://… .geojson” “http(s)://… .gdb”

“http(s)://…thredds…” “http(s)://…opendap…”

Reducer Output

http://cinematreasures.org/theaters/10911.kml 1

http://cinematreasures.org/theaters/10911/map|||http://cinematreasures.org/theaters/10911.kml -1

Signature match

Signature matchURI (base of URL)

ResultsIt worked!

Pre-built parsers vs. ‘homebrew’ Jsoup parser: inconsistent processing times RegEx parser: much more consistent results

A wide array of geo services vis-à-vis signature choice

Issues Implementing Hadoop

Hadoop learning curve Native Java application Tutorial information exists

Hadoop on AWS: S3, EMR, terminology, billing / cluster size

Optimizing cluster: Instance type, CPU, Memory, etc.

A wide array of geo services vis-à-vis signature choiceWhat’s out there and what’s its signature

Future Directions

SpatialHadoopA MapReduce Framework for Spatial Data GIS Tools for Hadoop

Processing GeoTweets

Backup

top related