empirical methods in computer...

31
Statistical Methods in Computer Science Experiment Design Gal A. Kaminka [email protected]

Upload: others

Post on 18-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Statistical Methods in Computer Science

Experiment Design

Gal A. [email protected]

Page 2: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 2

Vague idea

“ groping around” experiences

Model/Theory

Hypothesis

Initialobservations

Data, analysis, interpretation

Results & finalPresentation

Experimental Lifecycle

Experiment

Page 3: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 3

A Slightly Revised View...

Model/Theory

Hypothesis

Experiment

Analysis

Page 4: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 4

Proving a Theory?

We've discussed 4 methods of proving a propositionEveryone knows itSomeone specific says itAn experiment supports itWe can mathematically prove it

Some propositions cannot be verified empirically:“ This mega-compiler has linear run-time”Infinite possible inputs --> cannot prove empirically

But they may still be disproved:e.g., code that causes the compiler to run non-linearly

Page 5: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 5

Karl Popper's Philosophy of Science

Popper advanced a particular philosophy of science:Falsifiability

For a theory to be considered scientific, it must be falsifiableThere must be some way to refute it, in principleNot falsifiable <==> Not scientific

Examples:“ All crows are black” falsifiable by finding a white crow“ Compile in linear time” falsifiable by non-linear performance

Theory tested on its predictions

Page 6: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 6

Proving by disproving...

Platt (“ Strong Inference” , 1964) offers a specific method:1) Devise alternative hypotheses for observations2) Devise experiment(s) allowing elimination of hypotheses3) Carry out experiments to obtain a clean result4) Go to 1.

The idea is to eliminate (falsify) hypotheses

Page 7: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 7

Forming Hypotheses

So, to support theory X, we:1) Construct falsifiability hypotheses X

1,.... X

n, ....

2) Systematically experiment to disprove X, but proving Xi

3) If all falsification hypotheses eliminated, then this lends support

Note that future falsification hypotheses may be formedTheory must continue to hold against “ attacks”Popper: Scientific evolution, “ survival of the fittest theory”

How does this view hold in computer science?

Page 8: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 8

Forming Hypotheses in CS

(1) Carefully identify the theoretical object we are studying:e.g., “ the relation between input-size and run-time is linear”e.g., “ the algorithm causes robots to collect pucks better”e.g., “ the display improves user performance”

(2) Identify falsification hypothesis (null hypothesis) H0

e.g., “ there is an input-size for which run-time is non-linear”e.g., “ the algorithm will cause robots to collect less pucks”e.g., “ the display will have no effect on user performance”

(3) Now, experiment to eliminate H0

Page 9: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 9

The Basics of Experiment Design

Experiments identify a relation between variables X, Y, ... Simple experiments: Provide indication of relation

Better/worse, linear or non-linear, ....

Advanced experiments: help identify causes, interactionsLinear in input size but constant factor depends on type of data

Page 10: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 10

Types of Experiments and Variables

Manipulation experimentsManipulate (= set value of) independent variablesObserve (measure value of) dependent variables

Observation experimentsObserve predictor variablesObserve response variables

Other variables:Endogenous: On causal path between independent and dependent Extraneous: Other variables influencing dependent variables

Page 11: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 11

An example observation experiment

Theory: Gender affects score performanceFalsifying hypothesis: Gender does not affect performanceCannot use manipulation experiments:

Cannot control gender

Must use observation experiments

Page 12: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 12

An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)

# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

Independent (Predictor)Variables

Page 13: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 13

An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)

# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

Dependent (Response)Variables

Page 14: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 14

An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)

# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

EndogenousVariables

Page 15: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 15

An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)

# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

ExogenousVariables

Page 16: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 16

Experiment Design: Introduction

Different experiment types explore different hypothesesFor instance, a very simple design: treatment experiment

Sometimes known as a lesion study

treatment Ind1 & Ex

1 & Ex

2 & .... & Ex

n ==> Dep

1

control Ex1 & Ex

2 & .... & Ex

n ==> Dep

2

Treatment condition: With independent variableControl condition: with no independent variable

Page 17: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 17

Comparison Experiments

An improvement over treatment experimentsAllow comparison of different conditions

treatment1 Ind

1 & Ex

1 & Ex

2 & .... & Ex

n ==> Dep

1

treatment2

Ind2 & Ex

1 & Ex

2 & .... & Ex

n ==> Dep

2

control Ex1 & Ex

2 & .... & Ex

n ==> Dep

3

Compare performance of algorithm A to B to C ....Control condition: Optional (e.g., to establish baseline)

Page 18: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 18

Example of Comparison Experiments

Compare performance of user interface A to B to C ....(Kaminka and Elmaliach 2006)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Split&Tool

Only Tool

split

dev

iati

on

[deg

ree]

Page 19: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 19

Careful !

An effect on the dependent variable may not be as expectedExample: An experiment

Hypothesis: fly's ear is on its wingsFly with two wings. Make loud noise. Observe flight.Fly with one wing. Make loud noise. No flight.Conclusion: Fly with only one wing cannot hear!

What's going on here?First, interpretation by the experimenterBut also, lack of sufficient falsifiability:

There are other possible explanations for why fly wouldn't fly.

Page 20: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 20

Controlling for other factors

Often, we cannot manipulate all extraneous variablesThen, we need to make sure they are sampled randomly

Randomization averages out their affect

This can be difficulte.g.,, suppose we are trying to relate gender and mathWe control for effect of # of siblings by random samplingBut # of siblings may be related to age:

Parents continue to have children hoping for a boy (Beal 1994)Thus # of siblings tied with gender

Must separate results based on # of siblings

Page 21: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 21

Factorial Experiment DesignsEvery combination of factor values is sampled

Hope is to exclude or reveal interactions

This creates a combinatorial number of experimentsN factors, k values each = kN combinations

Strategies for eliminating values:Merge values, categories. Skip values.Focus on extremes, to get a general trend.

Head turn velocity

Perf

orm

ance

Head turn velocityPe

rfor

man

ce

Page 22: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 22

Tips for Factorial Experiments

For “ numerical” variables, 2 value ranges are not enoughDon't give a good sense of the function relating variables.

Measure, measure, measure.Piggybacking measurement: cheaper than re-running experiments

Simplify comparisons:Use same number of data points (trials) for all configurations

Page 23: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 23

Experiment Validity

Type of validity: Internal and External validityInternal validity:

Experiment shows relationship (independent causes dependent)

External validity:Degree to which results generalize to other conditions

Threats: uncontrolled conditions threatening validity

Page 24: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 24

Internal validity threats: Examples

Order effectsPractice effects in human or animal test subjectsBug in testing system leaves system “ unclean” for next trial

Demand effectsExperimenter influences subject

e.g., answering questions of subjects

Confounding effectsSee “ fly with no wings cannot hear”

Page 25: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 25

Order Effects

Order effects can confound resultsIf treatment/control given two different orders

e.g., good for treatment, bad for control (or vice versa)

Solution:Counter-balancing (all possible orders to all groups)

If treatment/control given exact same orderPractice effects in humans and animals

Solution:Randomize order of presentation to subjects

Page 26: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 26

External threats to validity

Sampling bias: Non-representative samplese.g., non-representative external factors

Floor and ceiling effectsProblems tested too hard, too easy

Regression effectsResults have no way to go but up or down

Solution approach: Run pilot experiments

Page 27: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 27

Sampling Bias

Prefer setting/measuring specific values over othersFor instance:

Including results that were found by some deadline

Solution: Detect, and removee.g., by visualization, looking for non-normal distributionse.g., surprising distribution of dependent data, for different values of indepdentn variable.

Page 28: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 28

Baselines: Floor and Ceiling Effects

How do we know A is good? Bad?Maybe the problems are too simple? Too hard?

For exampleNew machine learning algorithm has 95% accuracyIs this good?

Controlling for Floor/CeilingEstablish baselinesFind range of inputsShow that a “ silly” approach achieves close result

Page 29: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 29

Regression Effects

General phenomenon: “ Regression towards the mean”Repeated measurement converges towards mean values

Example threat: Run a program on 100 different inputsProblems 6, 14, 15 get a very low scoreWe now fix problem, and want to re-testIf chance has anything to do with scoring, then must re-run allWhy?

Scores on 6, 14, 15 has no where to go but up.So re-running these problems will show improvement by chance

Solution:Re-run complete tests, or sample conditions uniformly

Page 30: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 30

Summary

Defensive thinkingIf I were trying to disprove the claim, what would I doThen think ways to counter any possible attack on claim

Strong Inference, Popper's falsification ideasScience moves by disproving theories (empirically)

Experiment design: Carefully think through threatsIdeal independent variables: easy to manipulateIdeal dependent variables: measurable, sensitive, and meaningful

Next week: Hypothesis testing (?)

Page 31: Empirical Methods in Computer Scienceu.cs.biu.ac.il/~fridman/Statistics/LectureNotes/06-hypotheses.pdfStatistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Empirical Methods in Computer Science © 2006-now Gal Kaminka 31

Sampling Bias