dive into scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · developer tools...

38
Dive into Scrapy @juanriaza Juan Riaza

Upload: vandung

Post on 11-May-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Dive into Scrapy

@juanriazaJuan Riaza

Page 2: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

CHAPTER 1

- THE FANTABULOUS WORLD OF DATA -

Page 3: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Sources of Data

!RSS✉EMAIL

#INTERNET📰DOCUMENTS

Page 4: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

🕏APIs

Page 5: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Tradeoffs

Most of the world hasn't embraced API-centric development

Most of the world's interesting data isn't API accessible

Page 6: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

APIs Tradeoffs

Throttling

Limited Data

Availability

They know you

Page 7: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

The web is thoroughly broken

tl;dr

Page 8: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Web Scraping

“is a computer software technique of extracting information from websites”

Page 9: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

- BASIC TOOLSET FOR THE CURIOUS -

Chapter 2

Page 10: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

HTTP

Headers, Query String

Status Codes

Methods

Persistence

GET, POST, PUT, HEAD…

2XX, 3XX, 4XX, 418 , 5XX, 999

Accept-language, UA*…

Cookies

Page 11: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Developer Tools

Emulate mobile devices

Network Inspector

Resources

Search XPATH

Elements, Cookies

Filter by XHR

Mobile sites

Extensions Hola, JS Switch…

Page 12: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

HTTP Libraries

Urllib2 (stdlib)

requests-oauthlib

python-requests

requestb.in

Page 13: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

HTML is not a regular language

Page 14: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

HTML Parsers

lxml pythonic binding for the C libraries libxml2 and libxslt

beautifulsoup html.parser, lxml, html5lib

Page 15: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Those who don't understand xpath are cursed to reinvent it, poorly.

Page 16: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

# -*- coding: utf-8 -*-import requestsimport lxml.html

req = requests.get('https://fosdem.org/2015/schedule/events/')tree = lxml.html.fromstring(req.text)for tr in tree.xpath('//tr'): content = tr.xpath('./td[1]/a/text()') name = tr.xpath('./td[2]/a/text()')

Page 17: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

- TOOLSET FOR THE ADVENTUROUS -

CHAPTER 3

Page 18: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies
Page 19: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Maybe you'll need multiple HTTP requests.

Scrapy-ify early on

Maybe you'll just want testable code.

Page 20: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

“An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”

Page 21: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Healthy Community

!6.3k ★1.6k forks   

500 watchers

" @scrapyproject 1.6k followers

2.7k questions

2k members on mailing list✉

Page 22: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Start a project

$ scrapy startproject <name>fosdem ├── fosdem │   ├── __init__.py │   ├── items.py │   ├── pipelines.py │   ├── settings.py │   └── spiders │   └── __init__.py └── scrapy.cfg

Page 23: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

import scrapy

class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/', ]

def parse(self, response): self.log('A response from %s just arrived!' % response.url)

Page 24: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Spiders

Generate the initial Requests

In the callback function, you parse the response and return either Item objects, Request objects, or an iterable of both

start_urls, start_requests()

Page 25: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

import scrapy

class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/' ]

def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3}

for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

Page 26: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Interactive Shell

Invaluable tool for developing and debugging your spiders

$ scrapy shell <url>

Page 27: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Interactive Shell

iPython

Invoking the shell from spiders to inspect responses (scrapy.shell.inspect_response)

Available Scrapy Objects spider, request, sel…

Available Shortcuts shelp(), fetch(), view()

Page 28: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Avoid getting banned

Rotate your user agent

Disable cookies

Download delays

Use a pool of rotating IPs

Crawlera

Page 29: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Everything else

Feed Exports

Items, ItemLoaders, Middlewares, Pipelines, Stats

Testing

JSON, CSV, XML, DjangoItem, S3…

Contracts

Page 30: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

from django.db import models

class Person(models.Model): name = models.CharField(max_length=255) age = models.IntegerField()

from scrapy.contrib.djangoitem import DjangoItem

class PersonItem(DjangoItem): django_model = Person

DjangoItem

Page 31: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

scrapinghub/pycon-speakers!

Page 32: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

- DEPLOYMENT -

CHAPTER 4

Page 33: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Scrapyd

Provides a JSON web service to upload new project versions (as eggs) and schedule spiders

$ scrapy deploy

Page 34: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Scrapy Cloud

Scrapy Cloud, our platform as a service offering, allows you to easily build crawlers, deploy them instantly and scale them on demand. Watch your Scrapy spiders as they run and collect data, and review their data through our beautiful frontend.

Page 35: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

- ABOUT US -

CHAPTER 5

Page 36: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

TONS of Open Source

Page 37: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Mandatory Sales Slide

Professional Services

Scrapy Cloud

Crawlera

Products

Page 38: Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

We’re hiring!