downloading the internet with python + scrapy

30
Downloading the internet with Python + scrapy Erin Shellman @erinshellman Puget Sound \ Programming Python meet-up January 14, 2015

Upload: erin-shellman

Post on 16-Jul-2015

1.285 views

Category:

Technology


2 download

TRANSCRIPT

Downloading the internet with Python + scrapy 💻🐍Erin Shellman @erinshellman Puget Sound \ Programming Python meet-up January 14, 2015

hi!

I’m a data scientist in the Nordstrom Data Lab. I’ve built scrapers to monitor the product catalogs of various sports retailers.

Getting data can be hard

Despite the open-data movement and popularity of APIs, volumes of data are locked up in DOMs all over the internet.

Monitoring competitor prices

• As a retailer, I want to strategically set prices in relation to my competitors.

• But they aren’t interested in sharing their prices and mark-down strategies with me. 😭

• “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.”

• scrapin’ on rails!

scrapy startproject prices

scrapy project

prices/ scrapy.cfg prices/

__init__.py items.py

pipelines.py settings.py spiders/

__init__.py ...

scrapy projectprices/

scrapy.cfg prices/

__init__.py items.py

pipelines.py settings.py spiders/

__init__.py backcountry.py

...

class Product(Item):! ! product_title = Field()! description = Field()! price = Field() !

Define what to scrape in items.py

protip: get to know the DOM.

protip: get to know the DOM.

Sometimes there are hidden gems.

SKU-level inventory availability? Score!

Spider design

Spiders have two primary components:

1. Crawling (navigation) instructions

2. Parsing instructions

Define the crawl behavior in spiders/backcountry.py

After spending some time on backcountry.com, I decided the all brands landing page was the best starting URL.

class BackcountrySpider(CrawlSpider):! name = 'backcountry'! def __init__(self, *args, **kwargs):! super(BackcountrySpider, self).__init__(*args, **kwargs)! self.base_url = 'http://www.backcountry.com'! self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']!! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)!! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)!

Part I: Crawl Setup

e.g. brand_url = http://www.backcountry.com/burton

shopAllBrands.jsp']!! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)!! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)!

! def parse_brand_landing_pages(self, response):! shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"! shop_all_link = response.xpath(shop_all_pattern).extract()!! if shop_all_link:! all_product_url = str(self.base_url + shop_all_link[0]) !! yield scrapy.Request(url = all_product_url,! callback = self.parse_product_pages)! else: ! yield scrapy.Request(url = response.url,! callback = self.parse_product_pages)

def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"!! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()!! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)!! for product in product_pages:! product_url = str(self.base_url + product)!! yield scrapy.Request(url = product_url,! callback = self.parse_item)

def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"!! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()!! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)!! for product in product_pages:! product_url = str(self.base_url + product)!! yield scrapy.Request(url = product_url,! callback = self.parse_item)

# Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)!! for product in product_pages:! product_url = str(self.base_url + product)!! yield scrapy.Request(url = product_url,! callback = self.parse_item)

def parse_item(self, response):!! item = Product()! dirty_data = {}!! dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()! dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()! dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()!! for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip()!! yield item!

Part II: Parsing

for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip()

Part II: Clean it now!

scrapy crawl backcountry -o bc.json

2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished) 2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json 2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 33068379, 'downloader/request_count': 41848, 'downloader/request_method_count/GET': 41848, 'downloader/response_bytes': 1715232531, 'downloader/response_count': 41848, 'downloader/response_status_count/200': 41835, 'downloader/response_status_count/301': 9, 'downloader/response_status_count/404': 4, 'dupefilter/filtered': 12481, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929), 'item_scraped_count': 38881, 'log_count/DEBUG': 81784, 'log_count/ERROR': 23, 'log_count/INFO': 26, 'request_depth_max': 7, 'response_received_count': 41839, 'scheduler/dequeued': 41848, 'scheduler/dequeued/memory': 41848, 'scheduler/enqueued': 41848, 'scheduler/enqueued/memory': 41848, 'spider_exceptions/IndexError': 23, 'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)} 2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)

{ "review_count": 18, "product_id": "BAF0028", "brand": "Baffin", "product_url": "http://www.backcountry.com/baffin-cush-slipper-mens", "source": "backcountry", "inventory": { "BAF0028-ESP-S3XL": 27, "BAF0028-BKA-XL": 40, "BAF0028-NVA-XL": 5, "BAF0028-NVA-L": 7, "BAF0028-BKA-L": 17, "BAF0028-ESP-XXL": 12, "BAF0028-NVA-XXL": 6, "BAF0028-BKA-XXL": 44, "BAF0028-NVA-S3XL": 10, "BAF0028-ESP-L": 50, "BAF0028-ESP-XL": 52, "BAF0028-BKA-S3XL": 19 }, "price_high": 24.95, "price": 23.95, "description_short": "Cush Slipper - Men's", "price_low": 23.95, "review_score": 4 }

prices/ scrapy.cfg prices/

__init__.py items.py

pipelines.py settings.py spiders/

__init__.py backcountry.py

evo.py rei.py ...

–Monica Rogati, VP of Data at Jawbone

“Data wrangling is a huge — and surprisingly so — part of the job. It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

Resources• Code here!:

• https://github.com/erinshellman/backcountry-scraper

• Lynn Root’s excellent end-to-end tutorial.

• http://newcoder.io/Intro-Scrape/

• Web scraping - It’s your civic duty

• http://pbpython.com/web-scraping-mn-budget.html

Bring your projects to hacknight!http://www.meetup.com/Seattle-PyLadies Ladies!!

Bring your projects to hacknight!http://www.meetup.com/Seattle-PyLadies Ladies!!

Thursday, January 29th 6PM

!

Intro to iPython and Matplotlib

Ada Developers Academy 1301 5th Avenue #1350