inspire 2015 - clear channel outdoor: building on-demand business location datasets

Post on 30-Jul-2015

68 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

#inspire15

Building On-Demand Business Location DatasetsOr…How I Stopped Worrying about Bad Business Location Data and Learned to Love the Download Tool

Tuesday, May 19, 2014

John Hollingsworth, GIS Manager, Clear Channel Outdoor

#inspire15

Business Problem

Bad Data = Unhappy Clients

#inspire15

• We create maps and analyses that contain locations of our clients, their competitors, and other Points Of Interest.

• The data need to be current and accurate.

• The data are constantly changing and therefore require a real-time source.

• Existing solutions all have downsides.

Business Problem

#inspire15

Comprehensive Business Dataset (Dun & Bradstreet, DatabaseUSA)• Expensive

• Often outdated

• Often poor spatial accuracy

• Duplicates in some cases (Walmart has pharmacy, tire store, etc.)

Existing Solutions

#inspire15

• Not comprehensive

• On-demand requests cost money and time

• Periodically refreshed

Existing Solutions

Aggregators (AggData, Factual)

Comprehensive Business Dataset (Dun & Bradstreet, DatabaseUSA)

#inspire15

• Requires geocoding/data quality checks

• Requires continual requests to ensure current data

• Not available in most cases

Existing Solutions

Aggregators (AggData, Factual)

Data from client (spreadsheet of addresses)

Comprehensive Business Dataset (Dun & Bradstreet, DatabaseUSA)

#inspire15

Alteryx-based Solution

Use the Alteryx Download tool to ‘scrape’ data from awebsite’s location tool.

#inspire15

Quick Demonstration

#inspire15

Yikes!!!

Is this legal?

Cuz it doesn’t feel legal.

#inspire15

• US Supreme Court has ruled that “an author who claims infringement must prove "the existence of ... intellectual production, of thought, and conception.“ and also in reference to phone number listings, “these bits of information are uncopyrightable facts”– Feist v. Rural 1991

• Terms of Service agreements on websites do not protect factual information.

• A company could theoretically bring a case for damages if the download process is so intense as to cause a disruption of service for their servers. You may need to throttle your collection to prevent this type of intrusive attack.

• All that said, caveat metentis. Meaning consult your in-house legal staff for additional clarification.

Yes. This Is Legal.

#inspire15

• Analyze Web Page and Location App Web Traffic

• Determine Collection Method to Use Based on Website Architecture

• Configure Download Tool

• Parse Results

• Error Correct

• Troubleshoot

Overview Of How To Do This

#inspire15

Analyze Web Traffic

#inspire15

• Use Web Traffic Debugging software such as Fiddler• http://www.telerik.com/download/fiddler• Set output to Raw in both windows

• Turn on cookies

• Determine if you must use iterative tool or not – sometimes all of the locations are listed on one page.

• Be rigorous – often there is an obvious, hard way and also a subtle, easy way.

Analyze Web Traffic: Best Practices

#inspire15

• Experiment using trial and error by copying data from Inspectors window and running it in the Composer window.

Analyze Web Traffic: Best Practices

#inspire15

Single Request

• Single request returns all addresses and latitude/longitude data

• JSON, XML, main web page

• Hint: Look for single Google Map with all points

Collection Methods

#inspire15

• List of store URLs on main page->pull each page

• List of states->List of stores->pull file or each page

• List of states->List of cities->List of stores->pull file or each page

Collection Methods

Multi-Step

Single Request

#inspire15

• e.g. http://www.store.com/3829

• Iterate through a set number of integers for store IDs

• Can be tricky because sometimes huge gaps in IDs

Collection Methods

Multi-Step

Single Request

Sequential IDs

#inspire15

• Use zip codes for search criteria instead of city/state

• Grid Centroids based on search radius

• Grid MBR values based on search radius

• Tip: Experiment with enlarging search radius. If no limit, then you can get all in one request.

Collection Methods

Multi-Step

Single Request

Sequential IDs

Spatial

#inspire15

Common Spatial Searches

Grid centroids as Lat/Long input values with 100 mile radius

#inspire15

Common Spatial Searches

Zip codes nearest to grid centroids as input values with 100 mile radius

#inspire15

Configure Download Tool

#inspire15

• Determine GET or POST method

• Watch out for Encode URL Text

• Copy Headers

• Experiment using Fiddler Composer to see which Headers are necessary

• Try without cookie Header as those can expire and break your workflow.

Configure Download Tool

#inspire15

Parse Results

#inspire15

• Sample: If you are iterating, just a few iterations to test parse logic.

• Look for meta property if on a store’s page

• Add RecordID if iterating as the JSON will restart numbering

• Use the JSON/XML parsing tools in Alteryx

Parse Results: Best Practices

#inspire15

• Use Multi-Row Formula tool to parse HTML

Parse Results: Best Practices

#inspire15

Error Correct

#inspire15

• Deduplicate when radius collection method used – Use Unique Tool

• Bad geocodes: you are at the mercy of the geocoder that created the data

• Verify counts using Wikipedia or company's annual report

Error Correct

#inspire15

Troubleshoot

#inspire15

• Lat/Lon values in Google geocode string that are not real• sll=latitude,longitude is where the search originated, not the

actual point

• IP timeouts – may need to throttle to solve

• Parse cues not in all pages or extra lines cause skips - e.g. address data includes shopping center name, etc.

• Multiple pages in search results

• Some sites include closed stores

Troubleshoot

#inspire15

Q & A

#inspire15

Free Stuff!!

Go to

http://tinyurl.com/WebScrapingToolsto download zip file containing useful macros and sample workflow.

THANK YOU!

#inspire15

top related