data quality: the elephant in the (big data) room · data quality: the elephant in the (big data)...

27
Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape Town, South Africa 6-7 July 2017

Upload: others

Post on 25-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Data Quality: The elephant in the (big data) room

Chris Park Data Scientist UK Data Service

DataFirst Data Quality Workshop Cape Town, South Africa 6-7 July 2017

Page 2: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Janitors?

“Data scientists, according to interviews and expert

estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labour of collecting and

preparing unruly digital data, before it can be explored for useful nuggets.”

New York Times

Page 3: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

“Data Science”

Page 4: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

2016 CrowdFlower Survey

Page 5: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

2016 CrowdFlower Survey

Page 6: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Key Messages

•  Data cleaning and data quality are important in the present era of ‘Big Data’ et al. It might be cheaper to store data now, but it is harder to keep track, standardize, and curate data for secondary research.

•  The way forward is to work across disciplines and sectors, e.g. academia, government, and industry, to provide standardized access to and use of data that has potential to provide public value, e.g. energy data.

Page 7: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

UK Data Service

•  Curator of the UK’s largest collection of digital social and economic research data.

•  Serving the data needs of social and economics researchers since 1967.

•  Promotes data sharing and reproducibility, a topic of increasing importance, e.g. data as academic output.

•  Undergone a number of key transformations in response to changing user needs.

Page 8: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Decline of Survey Data, 1980 - 2010

Chetty, R. (2012). The Transformative Potential of Administrative Data for Microeconometric Research. Retrieved from http://conference.nber.org/confer/2012/SI2012/LS/ChettySlides.pdf

AER: American Economic Review JPE: Journal of Political Economy QJE: Quarterly Journal of Economics ECMA: Econometrica

Page 9: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Rise of Administrative Data, 1980 - 2010

Chetty, R. (2012). The Transformative Potential of Administrative Data for Microeconometric Research. Retrieved from http://conference.nber.org/confer/2012/SI2012/LS/ChettySlides.pdf

AER: American Economic Review JPE: Journal of Political Economy QJE: Quarterly Journal of Economics ECMA: Econometrica

Page 10: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Human Activity

Page 11: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Human Activity

Page 12: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Same architecture, different infrastructure

And also: in response to changing user needs, diversifying into new and emerging forms of data with public impact, e.g. energy data.

Page 13: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Smarter Household Energy Data

•  Partnership between UK Data Service, UCL Centre for Energy Epidemiology, and DataFirst.

•  Explore ways to scale up research using household energy data, e.g. benefits and barriers.

•  Energy research is important: •  Energy is the linchpin of modern economic activity, •  Efficient use can help reduce negative impact on the

environment and help consumers save money on their bills, •  Linking with sociodemographic data can help Identify

and support fuel poor households, etc.

Page 14: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Energy Research

•  Key lies in linking energy data with administrative data such as building and sociodemographic data.

•  Topics studied include: •  Forecasting based on machine learning. Helps with

estimating supply. •  Help consumers save money on their bills by shifting energy

consumption to lower-tariff times of the week. •  Disaggregating energy use to break down consumption to

the appliance level.

Page 15: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Barriers to Energy Research

•  Heavily anonymized e.g. limited ability to link with other datasets.

•  Limited and biased sample e.g. recruitment-based studies

•  One-time dataset e.g. sprawl, limited reproducibility

•  Data governance and provenance issues e.g. no standard documentation

Page 16: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Barriers to Energy Research

•  Missing and duplicate observations and lack of standardized markers. e.g. “”, NA, NULL, 99, ‘99’, etc.

•  Timestamp formats: different combinations of date, time, and date + time columns, and handling of time zones.

e.g. Daylight saving: false features - Duplicates when clock turns back 1 hour, - Missing when clock shifts forward 1 hour. •  80-90% of time spent in janitorial work.

Page 17: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Key Messages

•  Data cleaning and data quality is important in the era of ‘Big Data’ et al. It might be cheaper to store data now, but it is harder to keep track, standardize, and curate data.

•  Way forward is through collaborative projects between academia, government, and industry that facilitate access to and use of data with policy implications, e.g. energy data.

Page 18: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

From ‘Dumb’ to ‘Smart’: Meters

Page 19: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Why Smart Meters?

•  Better control and oversight over own energy use.

•  No more ‘estimated’ bills, and no more meter readers visiting your home.

•  Researchers can have access to raw, unadjusted data.

•  Opportunity to standardize how energy data stored and shared to encourage reproducibility.

Page 20: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Smart Meter Roll-out Plans in Europe

Page 21: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Data Quality Challenges

Retrieved from https://www.intechopen.com/source/html/50727/media/fig2.png

Page 22: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Lessons learned and way forward

•  Academia, industry, and government all have something to offer.

•  Smart meters provide a unique opportunity to demonstrate how data-driven innovation across industries and sectors can create public value.

•  Want: a unified, standardized, and secure interface to

smart meter data that can help researchers and policymakers.

Page 23: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Smart Meter Research Portal

•  Serve as a knowledge base for intervention and longitudinal studies using energy data across the socio-technical spectrum.

•  Provide seamless access to standardized smart meter data at half-hourly, daily, or monthly resolutions.

•  Facilitate secure data linkage service within an ISO-certified, trusted digital repository.

•  Use cutting-edge technology based on the big data platform at the UK Data Service.

Page 24: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Data Service as a Platform

Page 25: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Key Messages

•  Data cleaning and data quality is important in the era of ‘Big Data’ et al. It might be cheaper to store data now, but it is harder to keep track, standardize, and curate data.

•  The way forward is to work across disciplines and sectors, e.g. academia, government, and industry, to provide standardized access to and use of data that has potential to provide public value, e.g. energy data.

Page 26: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape
Page 27: Data Quality: The elephant in the (big data) room · Data Quality: The elephant in the (big data) room Chris Park Data Scientist UK Data Service DataFirst Data Quality Workshop Cape

Chris Park Big Data Network Support UK Data Service [email protected]