scrapinghub pycon philippines 2015

30
Scraping the web using scrapy Jolo Balbin Mikko Gozalo

Upload: richard-dowinton

Post on 17-Aug-2015

1.511 views

Category:

Technology


4 download

TRANSCRIPT

Scraping the web using scrapyJolo BalbinMikko Gozalo

– Mitchell Kapor, a variation of former MIT President Jerome Weisner’s quote

“Getting information off the Internet is like taking

a drink from a fire hydrant.”

Tons of information on the internetNews / Rappler, ABS-CBN News, GMA News Online SOCIAL MEDIA / Facebook, Twitter TRANSPORTATION / MMDA, WAZE, DOTC WEBSITE WEATHER / Project NOAH, PAGASA E-COMMERCE / LAZADA, ZALORA, EBAY, OLX Government DATA / PHILGEPS

Tons of information on the internetNews / What’s trending? What’s HAPPENING?

SOCIAL MEDIA / What are the people’s sentiment on subject x? TRANSPORTATION / What’s the traffic like later?

WEATHER / What’s the effect of weather on traffic?

E-COMMERCE / Who’s selling the cheapest item x? Government DATA / Where are our taxes going?

The ProblemNot all data are structured!HOW DO WE TURN UNSTRUCTURED DATA INTO STRUCTURED ONES?

Web scraping๏ computer software technique of extracting

information from websites.

๏ focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.

https://en.wikipedia.org/wiki/Web_scraping

Conventional Way

๏ Fetch the webpage using urllib, httplib or requests.

๏Use beautifulsoup4, lxml or regular expressions to get extract information.

๏Analyze/store the information!

Conventional Way

Conventional Way

๏ Blocking! We have to wait for each request to finish before we move on to the next.

๏ If it encounters an error somewhere, we’re doomed. Everything will just halt!

Conventional Way

๏We can use threading, gevent or other libraries to make it asynchronous.

๏We can wrap parts of the code in try-except blocks to catch possible Exceptions.

Here comes…

Why Scrapy?๏ Processes requests and responses asynchronously.

๏Customizable! You can override almost everything.

๏Handles cookies, delays, timeouts, etc so you won’t have to. No need to reinvent the wheel!

๏ Includes Selectors, a parsing library that can parse HTML and XML using XPATH or CSS; or you can just use Beautiful Soup!

History of Scrapy๏An open source framework to scrape websites

๏ Scrapy was started by Pablo Hoffman and Shane Evans (2007)

๏Originally a tool used by Shane’s company

๏ They saw the potential, and open sourced it.

Getting Started with scrapy

๏As easy as pip install Scrapy.

๏ Start a project with scrapy startproject project_name.

๏Creating your first spider!

Short Demo

using Scrapy at Work

At Scrapinghub๏Company that provides scraping-related services to

clients around the globe.

๏Distributed team of 105 people around the world.

๏Active in contributing to open-source!

๏ Project owner of Scrapy!

Academe/Research๏A U.S. Department of Energy National Laboratory

Operated by a university in California.

๏Analyzes relation between product price, energy efficiency and other product features of typical home appliances.

๏ Partnered with Scrapinghub for academic research!

Market Analytics๏A UK company that provides price, promotion and online

product positioning analytics.

๏Help consumers find the best prices!

๏Help online retailers compare their prices with other retailers.

๏Help brands check if retailers are providing accurate product information.

๏ Partnered with Scrapinghub for their scraping needs!

Government Research

๏ Scrapinghub is participating in DARPA’s Memex.

๏Crawls the deep web.

๏Aids in systematically tracking down criminal activity.

using Scrapy FOR SIDE PROJECTS

MRT Passenger Traffic

๏Crawls the MRT3 website using Scrapy.

๏Downloads the CCTV images for each station.

๏Approximate the relative passenger traffic for the certain moment using computer vision!

MRT Passenger Traffic

*Line status as of July 1 (Wednesday), 6:40pm

MRT Passenger Traffic

*CUBAO STATION status as of July 1

MRT Passenger Traffic

*AYALA STATION NB status as of July 1

MiniBalita.com

๏A news reader for Philippine news.

๏Crawls Philippine news websites such as Rappler, ABS-CBN News, Inquirer, Spin.ph, etc.

๏ Integrated with TextTeaser to produce “mini” balita.

2013 General Elections๏Crawled the 2013 General Elections to find trends.

๏ 70 clustered precincts registered 100% turnout, most of them in ARMM.

๏One clustered precinct voted for only one senator. No one voted for anyone else despite the fact that a voter may choose up to 12 candidates!

Is SCRAPING legal?๏ Legalities about scraping is a gray area.

๏ Scraping public data is somewhat legal.

๏ Illegality may arise from how the data is used.

๏ Some websites explicitly prohibit scraping.

๏Always obey robots.txt.

End. Any Questions?Jolo BalbinTwitter: @mojojolohttp://www.summarizerman.com

Mikko GozaloTwitter: @mikkogozalohttp://www.mikkogozalo.com