hadoop: data processing by minions abcd-gis august 2015 presentation dave strohschein, harvard...
Post on 29-Dec-2015
216 Views
Preview:
TRANSCRIPT
Hadoop: Data Processing by
Minions
ABCD-GIS August 2015 PresentationDave Strohschein, Harvard Center for Geographic Analysis
dstrohschein@cga.harvard.edu
Today’s TalkWhy use Hadoop?What is Hadoop?How does Hadoop work?How are we using Hadoop?Issues encounteredA broader view – future directions
Background“…WorldMap will be extended to be capable of gathering interactive map information from hundreds of other servers around the world and making this map layer information searchable together with the WorldMap layer information.”
http://worldmap.harvard.edu/
Orientation / Motivationgathering interactive map information
from hundreds of other servers around the world
KML
Shapefiles
Overall Process
Billions of webpages
Hundreds of terabytes of compressed HTML text data
We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.
Process the DataHundreds of terabytes of compressed HTML text data Thousands CPU
hours
Months of processing !
Common Crawl Frequency• [ARC] s3://aws-publicdatasets/common-crawl/crawl-001/ - Crawl #1 (2008/2009)• [ARC] s3://aws-publicdatasets/common-crawl/crawl-002/ - Crawl #2 (2009/2010)• [ARC] s3://aws-publicdatasets/common-crawl/parse-output/ - Crawl #3 (2012)• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/ -
Summer 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/ -
Winter 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/ -
March 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-15/ -
April 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/ -
July 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-35/ -
August 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-41/ -
September 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/ -
October 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-49/ -
November 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/ -
December 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-06/ -
January 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-11/ -
February 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-14/ -
March 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-18/ -
April 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-22/ -
May 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/ -
June 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/ -
July 2015
Master
Slaves
Master
Slaves
Process the data
Hours
Thousands CPU hours
Master
Slaves
• Scalability• Fault Tolerance• Resource Sharing
Hadoop 1.0 Framework
Hadoop Distributed File System - HDFS
MapReduce - MR
MapReduce Implementation
Key : Value
orK : V
K1 : V1
KO : VO
MapReduce Flow
(K ,V)
(K ,V)
(K ,[V]) (K ,V)
Hadoop HDFS
Hadoop 1.0 IssuesScalability – Job Tracker does it
Job Tracker – single point of failure
Resource Utilization – Map & Reduce slots
Designed for MapReduce Applications
Hadoop Evolution
Yet Another Resource Negotiator - YARN
Hadoop 2.0 Framework
Hadoop Environments
Cloud
Local Cluster
‘Virtual’
A Commodity Server2009 – 8 cores, 16GB of RAM, 4x1TB disk
2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.
http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
Amazon Web Services
Hadoop on AWS EMR
Elastic Cloud Compute (EC2)Elastic Map Reduce (EMR)
Amazon Web Services (AWS)
Implementing Hadoop at CGA
AWS account –FREE 750 hrs/month t1.micro (Hadoop 1.0) Smallest Amazon EC2 Instance Good for learning basics Can’t execute Hadoop 2 – needed for libraries
t1.micro m1.medium Hadoop 2 Clusters
Develop on local machine Create test specific test WARCs
Process on cluster m1.medium r3.xlarge
CommonCrawl Processing on AWS
• Local algorithm development
• Upload application (jar file) to S3
• Ruby command-line-interface for EC2/EMR initialization
Implementing Hadoop at CGA
WARCTagCounter.java
TagCounterMap.java
• Hadoop ‘configuration’• Input data information• Mapper selection• Reducer selection – simple summer
• Mapper functionality • Extends the Mapper class• Mapper<Text, ArchiveReader, Text, LongWritable>
K1 : V1
K2 : V2
WARC/1.0WARC-Type: responseWARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>Content-Length: 43428Content-Type: application/http; msgtype=responseWARC-Warcinfo-ID: <urn:uuid:3169ca8e-39a6-42e9-a4e3-9f001f067bdf>WARC-Concurrent-To: <urn:uuid:d99f2a24-158a-4c77-bb0a-3cccd40aad56>WARC-IP-Address: 212.58.244.61WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stmWARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3JWARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJOWARC-Truncated: length
HTTP/1.1 200 OKServer: ApacheVary: X-CDNCache-Control: max-age=0Content-Type: text/htmlDate: Sat, 02 Aug 2014 09:52:13 GMTExpires: Sat, 02 Aug 2014 09:52:13 GMTConnection: closeSet-Cookie: BBC-UID=......Set-Cookie: BBC-UID=......
<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/.....><html><head><title>
BBC NEWS | Africa | Namibia braces for Nujoma exit</title>
Signature Detection
<!DOCTYPE html>
<p>… <a href="http://maps.vcgi.org/arcgis/rest/services/ ...
</a> </p>
Signatures “http(s)://…/arcgis/rest/services” “http(s)://…/arcgiscache”
“http(s)://…?request=getcapabilities”
“http(s)://… .kml” or “http(s)://… .kmz” (“shape” || “shp”) && “.zip”
“http(s)://… "${z}/${x}/${y}" || "${z}/${y}/${x}" || "$[z]/$[x]/$[y]" ||
"$[z]/$[y]/$[x]" ||"{z}/{x}/{y}" || "{z}/{y}/{x}" || "[z]/[x]/[y]" || "[z]/[y]/[x]"
“http(s)://… request=getmap”
“http(s)://… .jp2” “http(s)://… .ecw” “http(s)://… .sid” “http(s)://… .tfw”
“http(s)://… .gpx” “http(s)://… .geojson” “http(s)://… .gdb”
“http(s)://…thredds…” “http(s)://…opendap…”
Reducer Output
http://cinematreasures.org/theaters/10911.kml 1
http://cinematreasures.org/theaters/10911/map|||http://cinematreasures.org/theaters/10911.kml -1
Signature match
Signature matchURI (base of URL)
ResultsIt worked!
Pre-built parsers vs. ‘homebrew’ Jsoup parser: inconsistent processing times RegEx parser: much more consistent results
A wide array of geo services vis-à-vis signature choice
Issues Implementing Hadoop
Hadoop learning curve Native Java application Tutorial information exists
Hadoop on AWS: S3, EMR, terminology, billing / cluster size
Optimizing cluster: Instance type, CPU, Memory, etc.
A wide array of geo services vis-à-vis signature choiceWhat’s out there and what’s its signature
Future Directions
SpatialHadoopA MapReduce Framework for Spatial Data GIS Tools for Hadoop
Processing GeoTweets
Backup
top related