a bunch of stuff you need to know becky and danny

A bunch of stuff A bunch of stuff you need to you need to

knowknow

Becky and Danny

CounterbalancingCounterbalancing

Why you need to counterbalance:

To avoid order effects – some items may influence other items

To avoid fatigue effects – subjects get tired and performance on later items suffers

To avoid practice effects – subjects learn how to do the task and performance on later items improves


2 item counterbalance:

Subject 1 Subject 2

First Item A B

Second Item

B A



Sub 1

Sub 2

Sub 3

Sub 4

Sub 5

Sub 6

1st Item

A B C A B C

2nd Item

B C A C A B

3rd Item

C A B B C A



Sub 1

Sub 2

Sub 3

Sub 4

1st Item A C B D

2nd Item B A D C

3rd Item C D A B

4th Item D B C A


X > 4 item counterbalance:

1) Create a simple Latin Square

2) Randomize the rows

3) Randomize the Colums


X >>>> 4 item counterbalance:

Randomize items

Simpson’s Paradox

Simpson’s Paradox

Total

Admit

Total

Deny

Men 19 13

Women 13 19

59%

41%

Simpson’s Paradox

1

Admit

1

Deny

2

Admit

2

Deny

Total

Admit

Total

Deny

Men 18 7 1 6 19 13

Women 7 1 6 18 13 19

Definitions of interactions:

The whole is greater than the sum of its parts

The relationship between the variables is multiplicative instead of additive

The effectiveness of one intervention is contingent upon another intervention

Interactions

Why are interactions important?

1) Null effects can’t get published, the interaction solves that

2) Interactions are usually more interesting than main effects

3) Like Simpson’s paradox, interactions can mask an effect

Interactions

Yes

Yes

No

No

0

5

3

-20

Interactions

Yes

Yes

No

No

0

5

10

100

Valium

Alcohol

Interactions

Ya Ya Sisterhood

Female

Scorpion King

Male

10

-10

-10

10

Gender

Movie

Interactions

Validity

• Is the translation from concept to operationalization accurately representing the underlying concept.

• Does it measure what you think it measures.

• This is more familiarly called Construct Validity.

Types of Construct Validity

• Translation validity (Trochims term)– Face validity – Content validity

• Criterion-related validity – Predictive validity – Concurrent validity – Convergent validity – Discriminant validity

Translation validity

• Is the operationalization a good reflection of the construct?

• This approach is definitional in nature – assumes you have a good detailed

definition of the construct – and you can check the operationalization

against it.

Face Validity

• “On its face" does it seems like a good translation of the construct. – Weak Version: If you read it does it appear to

ask questions directed at the concept.– Strong Version: If experts in that domain

assess it, they conclude it measures that domain.

Content Validity

• Check the operationalization against the relevant content domain for the construct.

• Assumes that a well defined concept is being operationalized which may not be true.

• For example, a depression measure should cover the checklist of depression symptoms

Criteria-Related Validity

• Check the performance of operationalization against some criterion.

• Content validity differs in that the criteria are the construct definition itself -- it is a direct comparison.

• In criterion-related validity, a prediction is made about how the operationalization will perform based on our theory of the construct.

Predictive Validity

• Assess the operationalization's ability to predict something it should theoretically be able to predict. – A high correlation would provide evidence

for predictive validity -- it would show that our measure can correctly predict something that we theoretically thing it should be able to predict.

Concurrent Validity

• Assess the operationalization's ability to distinguish between groups that it should theoretically be able to distinguish between.

• As in any discriminating test, the results are more powerful if you are able to show that you can discriminate between two groups that are very similar.

Convergent Validity

• Examine the degree to which the operationalization is similar to (converges on) other operationalizations that it theoretically should be similar to. – To show the convergent validity of a test of arithmetic

skills, one might correlate the scores on a test with scores on other tests that purport to measure basic math ability, where high correlations would be evidence of convergent validity.

Discriminant Validity

• Examine the degree to which the operationalization is not similar to (diverges from) other operationalizations that it theoretically should be not be similar to. – To show the discriminant validity of a test of

arithmetic skills, we might correlate the scores on a test with scores on tests that of verbal ability, where low correlations would be evidence of discriminant validity.

Threats to ConstructValidity• From the discussion in Cook and Campbell (Cook, T.D. and Campbell, D.T. Quasi-

Experimentation: Design and Analysis Issues for Field Settings.). • Inadequate Preoperational Explication of

Constructs• Mono-Operation Bias• Mono-Method Bias • Interaction of Different Treatments • Interaction of Testing and Treatment • Restricted Generalizability Across

Constructs • Confounding Constructs and Levels of

Constructs

Inadequate Preoperational Explication of Constructs

• You didn't do a good enough job of defining (operationally) what you mean by the construct.

• Avoid by:– Thinking through the concepts better– Use methods (e.g., concept mapping) to

articulate your concepts – Get “experts” to critique your

operationalizations

Mono-Operation Bias

• Pertains to the independent variable, cause, program or treatment in your study not to measures or outcomes.

• If you only use a single version of a program in a single place at a single point in time, you may not be capturing the full breadth of the concept of the program.

• Solution: try to implement multiple versions of your program.

Mono-Method Bias

• Refers to your measures or observations.• With only a single version of a self esteem

measure, you can't provide much evidence that you're really measuring self esteem.

• Solution: try to implement multiple measures of key constructs and try to demonstrate (perhaps through a pilot or side study) that the measures you use behave as you theoretically expect them to.

Interaction of Different Treatments

• Changes in the behaviors of interest may not be due to experimental manipulation, but may be due to an interaction of experimental manipulation with other interventions.

Interaction of Testing and Treatment

• Testing or measurement itself may make the groups more sensitive or receptive to treatment.

• If it does, then the testing is in effect a part of the treatment, it's inseparable from the effect of the treatment.

• This is a labeling issue (and, hence, a concern of construct validity) because you want to use the label “treatment" to refer to the treatment alone, but in fact it includes the testing.

Restricted Generalizability Across Constructs

• The "unintended consequences" treat to construct validity

• You do a study and conclude that Treatment X is effective. In fact, Treatment X does cause a reduction in symptoms, but what you failed to anticipate was the drastic negative consequences of the side effects of the treatment.

• When you say that Treatment X is effective, you have defined "effective" as only the directly targeted symptom.

Confounding Constructs and Levels of Constructs

• If your manipulation does not work, it may not be the case that it does not work at all, but only at that level

• For example peer pressure may not work if only 2 people are applying pressure, but may work fine if 4 people are applying pressure.

The "Social" Threats to Construct Validity

• Hypothesis Guessing

• Evaluation Apprehension

• Experimenter Expectancies

Hypothesis Guessing

• Participants may try to figure out what the study is about. They "guess" at what the real purpose of the study is.

• They are likely to base their behavior on what they guess, not just on your manipulation.

• If change in the DV could be due to how they think they are supposed to behave, then the change cannot be completely attributed to the manipulation.

• It is this labeling issue that makes this a construct validity threat.

Evaluation Apprehension

• Some people may be anxious about being evaluated and consequently perform poorly.

• Or because of wanting to look good “social desirability” they may try to perform better (e.g. unusual prosocial behavior).

• In both cases, the apprehension becomes confounded with the treatment itself and you have to be careful about how you label the outcomes.

Experimenter Expectancies • The researcher can bias the results of a study in

countless ways, both consciously or unconsciously.

• Sometimes the researcher can communicate what the desired outcome for a study might be (and participant desire to "look good" leads them to react that way).

• The researcher might look pleased when participants give a desired answer.

• If this is what causes the response, it would be wrong to label the response as a manipulation effect.

Reliability

• Means "repeatability" or "consistency". • A measure is considered reliable if it would

give us the same result over and over again (assuming that what we are measuring isn't changing!).

• There are four general classes of reliability estimates, each of which estimates reliability in a different way.

Reliabilty (continued)

• Inter-Rater or Inter-Observer Reliability

• Test-Retest Reliability

• Parallel-Forms Reliability

• Internal Consistency Reliability

Inter-Rater or Inter-Observer Reliability

• Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.

• Establish reliability on pilot data or a subsample of data and retest often throughout.

• For categorical data a X2 can be used and for continuous data an R can be calculated.

Test-Retest Reliability

• Used to assess the consistency of a measure from one time to another.

• This approach assumes that there is no substantial change in the construct being measured between the two occasions.

• The amount of time allowed between measures is critical.

• The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation

Parallel-Forms Reliability • Used to assess the consistency of the results of two tests

constructed in the same way from the same content domain.

• Create a large set of questions that address the same construct and then randomly divide the questions into two sets and administer both instruments to the same sample of people.

• The correlation between the two parallel forms is the estimate of reliability.

• One major problem with this approach is that you have to be able to generate lots of items that reflect the same construct.

Parallel-Forms and Split Half Reliability

• The parallel forms approach is very similar to the split-half reliability described below.

• The major difference is that parallel forms are constructed so that the two forms can be used independent of each other and considered equivalent measures.

• With split-half reliability we have an instrument that we wish to use as a single measurement instrument and only develop randomly split halves for purposes of estimating reliability.

Internal Consistency Reliability

• Used to assess the consistency of results across items within a test.

• In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results.

• We are looking at how consistent the results are for different items for the same construct within the measure.

Kinds of Internal Reliability

• Average Inter-item Correlation

• Average Itemtotal Correlation

• Split-Half Reliability

• Cronbach's Alpha ()

Pragmatics

Gricean Maxims:Quality: Speaker is assumed to tell the truthQuantity: Speakers won’t burden hearers with

already known info. Obvious inferences will be made

Relation: Speaker will only talk about things relevant to the interaction

Manner: Speakers will be brief, orderly, clear, and unambiguous.

Pragmatics

Examples of where this breaks down:

Piagetian conservation tasks

Representativeness: The Linda Problem

Dilution effect: nondiagnostic information

Implanted Memories: cooperative vs. adversarial sources

Mutual Exclusivity

Pragmatics

Examples of where this breaks down:

Framing effects

Inconsistent responses due to pragmatics: the part whole problem

Conventional implicatures: “all” vs. “each and every”

Manipulation Checks

Have them.

Lots of them.

Validity and Reliability

Graduate Methods

Becky Ray

Winter, 2003For further reference see:

http://trochim.human.cornell.edu/kb/

a bunch of stuff you need to know becky and danny

Documents

colums slide

danny slide

item ba slide

simpsons paradox slide

rd item cabbca slide

item counterbalance

th itemdbca slide

intervention interactions