webscraping at statistics netherlands focused crawler (roboto) data store search & match...
TRANSCRIPT
Content
– Internet as a datasource (IAD): motivation
– Some IAD projects over past years
– Technologies used
– Summary / trends
– Observations / thoughts
– Legal
– The Dutch Business Register
2
The why
3
Administrative sources
– Tax, social security services
– Municipalities/ Provinces
– Supermarkets
– …
– …
– Surveys
Internet sources
Fuel prices (2009)
‐ Daily fuel prices from website of unmanned petrol
stations (tinq.nl)
‐ Regional prices (per station) every day
Now: 2016:
‐ A direct data feed from travelcard company, weekly
‐ Fuel prices per day and all transactions of that week
‐ Publication in website: prices per month
4
Airline tickets (2010)
5
– Pilot: 3 robots on 6 airline companies
– 2 robots by external companies, 1 by SN
– Prices comply with manual collection
– Quite expensive; negative business case
– 2016: still manual price collection of airline tickets
0
50
100
150
200
250
11 Feb 03 Mar 23 Mar 12 Apr 02 May 22 May 11 Jun 01 Jul 21 Jul 10 Aug
Ticket price Amsterdam - Milano
Robot
Manual
Housing market
– Housing market (from 2011):
‐ Discussions with external company for > 1 year (iWoz)
‐ We scraped 5 sites, about 250.000 observations /
week, 2 years
2013 ->:
‐ Direct feed from one of the sites (Jaap.nl)
‐ Statline tables: Bestaande woningen in verkoop
‐ “based on 80-90 percent of the market”
7
Bulk price collection for CPI (1)
– Bulk price collection for CPI (from 2012):
‐ Mainly clothing
‐ Software scrapes all prices and product data (id, name,
description, category, colour, size,…)
2016:
‐ About 500.000 price observations daily from 10 sites
‐ Data from 3 sites used in production of Dutch CPI
‐ Price collection process embedded in organisation
‐ Plans to extend to > 20 sites; other domains
8
Bulk price collection for CPI (2)
Processing
bulk data from
the Internet 9
Structured data
Data collection & Feature extraction
Index based on internet data
Big Data Index methods
Features: Fine-knit Jumper Dark blue Striped Cotton edges
Robot-assisted price collection
– Robot tool for detecting price changes on (parts of) websites
– Traffic light indicates status:
‐ Green: nothing changed, prices is saved in database
‐ Red: some change, need attention of statistician
‐ Two click to hold old price or store a new one
‐ In production from 2014
Collect data on enterprises for EGR (2013)
– Pilot: find data about EGR enterprises on the web ‐ We scraped semi structured data from Wikipedia ‐ Multiple wikipedia languages (NL, EN, DE, FR)
‐ 2016: something alike in ESSnet BD WP2?
11
Search product descriptions for classifying business activities
– Search product descriptions on web (from 2014)
‐ First time we used automated search with Google
search API for statistics
‐ Pilot, no production
‐ Some doubts on google results
12
Twitter-LinkedIn (1)
– LinkedIn-Twitter for profiling (2015)
‐ Automated search on LinkedIn based on a sample of
twitter users
‐ Very specific and experimental
‐ “Profiling of Twitter data, a big data selectivity study”,
Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch
13
Scraping websites of enterprises
– Identify family businesses (search and / or crawling)
(2016)
– Identify businesses with a Corporate Social Responsibility
(CSR) (search and / or crawling) (2016)
– Research program: ‐ “Extracting information from websites to improve economic figures”
– This ESSnet BD WP2 !!!
15
Crawling for Statistics
16
Internet Focused Crawler (Roboto)
Datastore
Search & Match ElasticSearch
Url-base
Incomplete statistical data
More complete statistical data
Search terms
Navigation terms
Item identifyer terms “year report, family business”
Technologies used
– Perl (2009), Djuggler (2010)
– Python, Scrapy (2010)
– R (2011-2015)
– NodeJS (Javacript on server) (2014-)
– Google Search API (2014-)
– ElasticSearch (2016)
– Roboto (nodejs package, 2015-2016)
– Nutch: tested, not used
– Generic Framework (robot framework) for bulk scraping
of prices
17
Summary / trends
18
Production Scrape Search Crawl External company
Tinq x (x) Travelcard
Airlines x 2 robots
Housing x (x) Jaap.nl
BulkCPI x x
Robottool x x (x)
EGR x x
RGS x
Twitter/ Linkedin
x x
Enterprises x x Dataprovider?
Observations / thoughts …
‐ If it is there, we can get it
‐ Technology is (usually) not the problem!
‐ The internet is a living thing!
‐ It’s too simple to think we can just buy the
internet somewhere and then make statistics!
‐ It’s powerful to combine something we know
with something we observe!
‐ External companies can help, but be careful …
19
Legal
– Dutch Statistics Law: ‐ Enterprises have to provide data to Statistics Netherlands on
request ‐ Scraping information from websites reduces response burden ‐ Statistics Netherlands does use data for official statistics only
– Dutch database legislation: ‐ Commercial re-use of intellectual property is forbidden ‐ This may also apply to internet sources
– Privacy: ‐ Dutch (statistical) legislation on protection of personal
information ‐ Statistics Netherlands does only scrape public sources and
processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally
– Netiquette: ‐ respect robots.txt ‐ identify yourself (user-agent) ‐ do not overload servers, use some idle time between requests
21
Dutch Business Register (simplified)
22
Legal units relationships Cluster of
control Enterprise
groups Enterprises Local units
Sources: - Trade Register - Tax Register - Social security register
(employees) - Profilers
- From administrative units to statistical units:
- About 1.5 Million administrative entities - About 0.5 Million have a url - Quality of url field not known, but seems usable