empirical methods in computer...

Statistical Methods in Computer Science

Experiment Design

Gal A. [email protected]

Empirical Methods in Computer Science © 2006-now Gal Kaminka 2

Vague idea

“ groping around” experiences

Model/Theory

Hypothesis

Initialobservations

Data, analysis, interpretation

Results & finalPresentation

Experimental Lifecycle

Experiment


A Slightly Revised View...

Model/Theory

Hypothesis

Experiment

Analysis


Proving a Theory?

We've discussed 4 methods of proving a propositionEveryone knows itSomeone specific says itAn experiment supports itWe can mathematically prove it

Some propositions cannot be verified empirically:“ This mega-compiler has linear run-time”Infinite possible inputs --> cannot prove empirically

But they may still be disproved:e.g., code that causes the compiler to run non-linearly


Karl Popper's Philosophy of Science

Popper advanced a particular philosophy of science:Falsifiability

For a theory to be considered scientific, it must be falsifiableThere must be some way to refute it, in principleNot falsifiable <==> Not scientific

Examples:“ All crows are black” falsifiable by finding a white crow“ Compile in linear time” falsifiable by non-linear performance

Theory tested on its predictions


Proving by disproving...

Platt (“ Strong Inference” , 1964) offers a specific method:1) Devise alternative hypotheses for observations2) Devise experiment(s) allowing elimination of hypotheses3) Carry out experiments to obtain a clean result4) Go to 1.

The idea is to eliminate (falsify) hypotheses


Forming Hypotheses

So, to support theory X, we:1) Construct falsifiability hypotheses X

1,.... X

n, ....

2) Systematically experiment to disprove X, but proving Xi

3) If all falsification hypotheses eliminated, then this lends support

Note that future falsification hypotheses may be formedTheory must continue to hold against “ attacks”Popper: Scientific evolution, “ survival of the fittest theory”

How does this view hold in computer science?


Forming Hypotheses in CS

(1) Carefully identify the theoretical object we are studying:e.g., “ the relation between input-size and run-time is linear”e.g., “ the algorithm causes robots to collect pucks better”e.g., “ the display improves user performance”

(2) Identify falsification hypothesis (null hypothesis) H0

e.g., “ there is an input-size for which run-time is non-linear”e.g., “ the algorithm will cause robots to collect less pucks”e.g., “ the display will have no effect on user performance”

(3) Now, experiment to eliminate H0


The Basics of Experiment Design

Experiments identify a relation between variables X, Y, ... Simple experiments: Provide indication of relation

Better/worse, linear or non-linear, ....

Advanced experiments: help identify causes, interactionsLinear in input size but constant factor depends on type of data


Types of Experiments and Variables

Manipulation experimentsManipulate (= set value of) independent variablesObserve (measure value of) dependent variables

Observation experimentsObserve predictor variablesObserve response variables

Other variables:Endogenous: On causal path between independent and dependent Extraneous: Other variables influencing dependent variables


An example observation experiment

Theory: Gender affects score performanceFalsifying hypothesis: Gender does not affect performanceCannot use manipulation experiments:

Cannot control gender

Must use observation experiments


An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)

# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

Independent (Predictor)Variables



# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

Dependent (Response)Variables



# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

EndogenousVariables



# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

ExogenousVariables


Experiment Design: Introduction

Different experiment types explore different hypothesesFor instance, a very simple design: treatment experiment

Sometimes known as a lesion study

treatment Ind1 & Ex

1 & Ex

2 & .... & Ex

n ==> Dep

1

control Ex1 & Ex

2 & .... & Ex

n ==> Dep

2

Treatment condition: With independent variableControl condition: with no independent variable


Comparison Experiments

An improvement over treatment experimentsAllow comparison of different conditions

treatment1 Ind

1 & Ex

1 & Ex

2 & .... & Ex

n ==> Dep

1

treatment2

Ind2 & Ex

1 & Ex

2 & .... & Ex

n ==> Dep

2

control Ex1 & Ex

2 & .... & Ex

n ==> Dep

3

Compare performance of algorithm A to B to C ....Control condition: Optional (e.g., to establish baseline)


Example of Comparison Experiments

Compare performance of user interface A to B to C ....(Kaminka and Elmaliach 2006)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Split&Tool

Only Tool

split

dev

iati

on

[deg

ree]


Careful !

An effect on the dependent variable may not be as expectedExample: An experiment

Hypothesis: fly's ear is on its wingsFly with two wings. Make loud noise. Observe flight.Fly with one wing. Make loud noise. No flight.Conclusion: Fly with only one wing cannot hear!

What's going on here?First, interpretation by the experimenterBut also, lack of sufficient falsifiability:

There are other possible explanations for why fly wouldn't fly.


Controlling for other factors

Often, we cannot manipulate all extraneous variablesThen, we need to make sure they are sampled randomly

Randomization averages out their affect

This can be difficulte.g.,, suppose we are trying to relate gender and mathWe control for effect of # of siblings by random samplingBut # of siblings may be related to age:

Parents continue to have children hoping for a boy (Beal 1994)Thus # of siblings tied with gender

Must separate results based on # of siblings


Factorial Experiment DesignsEvery combination of factor values is sampled

Hope is to exclude or reveal interactions

This creates a combinatorial number of experimentsN factors, k values each = kN combinations

Strategies for eliminating values:Merge values, categories. Skip values.Focus on extremes, to get a general trend.

Head turn velocity

Perf

orm

ance

Head turn velocityPe

rfor

man

ce


Tips for Factorial Experiments

For “ numerical” variables, 2 value ranges are not enoughDon't give a good sense of the function relating variables.

Measure, measure, measure.Piggybacking measurement: cheaper than re-running experiments

Simplify comparisons:Use same number of data points (trials) for all configurations


Experiment Validity

Type of validity: Internal and External validityInternal validity:

Experiment shows relationship (independent causes dependent)

External validity:Degree to which results generalize to other conditions

Threats: uncontrolled conditions threatening validity


Internal validity threats: Examples

Order effectsPractice effects in human or animal test subjectsBug in testing system leaves system “ unclean” for next trial

Demand effectsExperimenter influences subject

e.g., answering questions of subjects

Confounding effectsSee “ fly with no wings cannot hear”


Order Effects

Order effects can confound resultsIf treatment/control given two different orders

e.g., good for treatment, bad for control (or vice versa)

Solution:Counter-balancing (all possible orders to all groups)

If treatment/control given exact same orderPractice effects in humans and animals

Solution:Randomize order of presentation to subjects


External threats to validity

Sampling bias: Non-representative samplese.g., non-representative external factors

Floor and ceiling effectsProblems tested too hard, too easy

Regression effectsResults have no way to go but up or down

Solution approach: Run pilot experiments


Sampling Bias

Prefer setting/measuring specific values over othersFor instance:

Including results that were found by some deadline

Solution: Detect, and removee.g., by visualization, looking for non-normal distributionse.g., surprising distribution of dependent data, for different values of indepdentn variable.


Baselines: Floor and Ceiling Effects

How do we know A is good? Bad?Maybe the problems are too simple? Too hard?

For exampleNew machine learning algorithm has 95% accuracyIs this good?

Controlling for Floor/CeilingEstablish baselinesFind range of inputsShow that a “ silly” approach achieves close result


Regression Effects

General phenomenon: “ Regression towards the mean”Repeated measurement converges towards mean values

Example threat: Run a program on 100 different inputsProblems 6, 14, 15 get a very low scoreWe now fix problem, and want to re-testIf chance has anything to do with scoring, then must re-run allWhy?

Scores on 6, 14, 15 has no where to go but up.So re-running these problems will show improvement by chance

Solution:Re-run complete tests, or sample conditions uniformly


Summary

Defensive thinkingIf I were trying to disprove the claim, what would I doThen think ways to counter any possible attack on claim

Strong Inference, Popper's falsification ideasScience moves by disproving theories (empirically)

Experiment design: Carefully think through threatsIdeal independent variables: easy to manipulateIdeal dependent variables: measurable, sensitive, and meaningful

Next week: Hypothesis testing (?)


Sampling Bias

empirical methods in computer...

Documents