methodological principles in dealing with big data, reijo sund

20
Methodological Principles in Dealing with Big Data Reijo Sund University of Helsinki, Centre for Research Methods, Faculty of Social Sciences Big Data seminar Statistics Finland, Helsinki 2.6.2014 1. kesäkuuta 14

Upload: tilastokeskus

Post on 01-Jul-2015

230 views

Category:

Data & Analytics


3 download

DESCRIPTION

Methodological principles in dealing with Big Data, Reijo Sund Big Data Seminar, 2nd June 2014

TRANSCRIPT

Page 1: Methodological principles in dealing with Big Data, Reijo Sund

Methodological Principles in

Dealing with Big DataReijo Sund

University of Helsinki, Centre for Research Methods, Faculty of Social Sciences

Big Data seminarStatistics Finland, Helsinki 2.6.2014

1. kesäkuuta 14

Page 2: Methodological principles in dealing with Big Data, Reijo Sund

Big Data

Data have been produced for hundreds of years

The reasons for such production were originally administrative in nature

There was a need for systematically collected numerical facts on a particular subject

Advances in information technology have made it possible to more effectively collect and store larger and larger data sets

1. kesäkuuta 14

Page 3: Methodological principles in dealing with Big Data, Reijo Sund

From data to information

As far as there has been data, there has been a challenge to transform it into useful information

Too much data in an unusable form has always been a common complain

Well known hierarchy:

Data - Information - Knowledge - Wisdom - Intelligence

1. kesäkuuta 14

Page 4: Methodological principles in dealing with Big Data, Reijo Sund

Secondary data

There are more and more ”big data”, but the emphasis has been on technical aspects and not on the information itself

Data without explanations are useless

Big Data are often secondary dataNot tailored to specific research question at hand

More (detailed) data would not solve the basic problemsMore background information is required for utilization

1. kesäkuuta 14

Page 5: Methodological principles in dealing with Big Data, Reijo Sund

Fundamental problem

The belief that big data consist of autonomous, atom-like building blocks is fundamentally erroneous

Raw register data as such are of little value

No simple magic tricks to overcome problems arising from the fundamental limitations of empirical research

More general aspects of scientific research are needed in order to understand the related methodological challenges

1. kesäkuuta 14

Page 6: Methodological principles in dealing with Big Data, Reijo Sund

Knowledge discovery process

Process consists of several main phases:Understanding the phenomenon, Understanding the problem,Understanding data, Data preprocessing, Modeling,Evaluation, Reporting

The main difference to the ”traditional” research process is the additional interpretation-operationalization phase

Context Debate

Idea Theory

 

Problem

Data Analysis

Question

Answer

Perspective

1. kesäkuuta 14

Page 7: Methodological principles in dealing with Big Data, Reijo Sund

Prerequisites

Effective use of big data presumes skills in various areas:

Measurement

Data modeling (information sciences)

Statistical computing (statistics)

Theory of the subject matter

1. kesäkuuta 14

Page 8: Methodological principles in dealing with Big Data, Reijo Sund

Principles of measurementReality can be confronted by recording observations that reflect the phenomenon of interest

Measurement aims to create data as symbolic representations of the observations

Operationalization determines how the phenomenon P that becomes visible via observations O is mapped to data D ?

Successful if it becomes possible to make valid interpretations I of symbolic data D in regard to the phenomenon P

1. kesäkuuta 14

Page 9: Methodological principles in dealing with Big Data, Reijo Sund

Infological equation

Information is something that has to be produced from the data and the pre-knowledge

Infological equation:

I = i(D,S,t)Information I is produced from the data D and the pre-knowledge S(at time t using the interpretation process i)

1. kesäkuuta 14

Page 10: Methodological principles in dealing with Big Data, Reijo Sund

Data modeling

Data modeling can be used to construct (computer-based) symbol structures which capture the meaning of data and organize it in ways that make it understandable

Only what is (or can be) represented is considered to exist

 

Phenomenon

Concept

Object

Host Attributes

Time Place Realized observation Data component

Knowledge component

Logical component

Taxonomy

Partonomy

Theoretical measurement properties

1. kesäkuuta 14

Page 11: Methodological principles in dealing with Big Data, Reijo Sund

Data preprocessingData cleaning and reduction

Correction of “global” deficiencies in the dataDropping of “uninteresting” data

Data abstraction“Intelligent enrichment” of data using background knowledge

This kind of preprocessing reminds much more qualitative than quantitative analysis

Each rule reflects the instability of the concept and is a step further from the "objectivity" of the study

1. kesäkuuta 14

Page 12: Methodological principles in dealing with Big Data, Reijo Sund

Preprocessing in practiceNeed for conceptual representation of each object

Two main classes for concept-data relation:Factual = minimal background knowledgeAbstracted = cognitive fit acceptable

A sophisticated (and subjective) preprocessing aiming to scale matters down to a size more suitable for specific analyses is the most important and time-consuming part of the (big) data analysis

1. kesäkuuta 14

Page 13: Methodological principles in dealing with Big Data, Reijo Sund

Greater statistics

Statistics offers not only a set of tools for problem- solving, but also a formal way of thinking about the modeling of the actual problem

Rather than trying to squeeze the data into a predefined model or saying too much on what can and cannot be done, data analysis should work to achieve an appropriate compromise between the practical problems and the data

1. kesäkuuta 14

Page 14: Methodological principles in dealing with Big Data, Reijo Sund

ChallengesHow to analyze massive data effectively when manual management is unfeasible?

How to avoid ‘snooping/dredging/fishing/shopping’ without assuming that data are automatically in concordance with the theory?

How to deal with data that include total populations without traditional meaning for sampling error and statistical significance?

1. kesäkuuta 14

Page 15: Methodological principles in dealing with Big Data, Reijo Sund

Thank you!

For more information:

http://www.helsinki.fi/~sund

1. kesäkuuta 14

Page 16: Methodological principles in dealing with Big Data, Reijo Sund

How to calculate the annual number of hip fractures in Finland?Background knowledge: All hip fractures in Hospital Discharge Register

Data challenge: Difficult to separate new admissions from the care of old fractures

Change of theory: Consider only first hip fractures instead of all hip fractures

Solution in terms of data: Easy to determine the number of first hip fractures from the register if enough old data are available and deterministic record linkage can be used

1. kesäkuuta 14

Page 17: Methodological principles in dealing with Big Data, Reijo Sund

Is there more hip fractures during winter? How to define winter?

Based on the data, ”Winter” is from November to April

5/98 11/98 5/99 11/99 5/00 11/00 5/01 11/01 5/02 11/02

1/98 7/98 1/99 7/99 1/00 7/00 1/01 7/01 1/02 7/02 1/03

0

5

10

15

20

Institutionalized

5/98 11/98 5/99 11/99 5/00 11/00 5/01 11/01 5/02 11/02

1/98 7/98 1/99 7/99 1/00 7/00 1/01 7/01 1/02 7/02 1/03

0

5

10

15

20

Over 50 years old

1. kesäkuuta 14

Page 18: Methodological principles in dealing with Big Data, Reijo Sund

Data abstracted outcomes

Commonly used outcomes measuring effectiveness of (hip fracture) surgery are death and complication

These are medical concepts, but must be abstracted from individual level register-based data by using some ‘rules’, such as a list of some particular diagnosis codes recorded in the data

1. kesäkuuta 14

Page 19: Methodological principles in dealing with Big Data, Reijo Sund

Stabile and complex outcomes

It is easy typically straightforward to extract the event of death from the data by using "one line rule“

Extraction of complications may require tens of different rules which are justified by using domain knowledge and evaluation of rules with concrete data until saturation point is reached

1. kesäkuuta 14

Page 20: Methodological principles in dealing with Big Data, Reijo Sund

1. kesäkuuta 14