unit 5a: survival analysis: questions about whether and when © andrew ho, harvard graduate school...

24
Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 1 ttp://xkcd.com/931 /

Upload: alvin-white

Post on 04-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

Unit 5a: Survival Analysis: Questions about Whether and When

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 1

http://xkcd.com/931/

Page 2: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

• Research questions addressed by survival analysis: Whether+When• Contrasting 2 Data Formats: Person vs. Person-Period• Life Table Analysis: Hazard Probability vs. Survival Probability

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 2

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Course Roadmap: Unit 5a

Today’s Topic Area

Page 3: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 3

The “Whether and When” TestYou need survival analysis if your research questions ask “Whether” and “When” a critical event occurs.

The “Whether and When” TestYou need survival analysis if your research questions ask “Whether” and “When” a critical event occurs.

Time-to-Relapse Among Treated AlcoholicsCooney, et al. (1991).Research Questions:

Whether, and if so when, rehabilitated alcoholics relapse to drinking?

Which treatment regimens are more effective in preventing relapse?

89 post-treatment alcoholics, randomized to either a “coping skills” or an “interaction skills” follow-up treatment.

Prospective data collection for 2 years.During follow-up 57 patients relapsed to

alcoholism, 28 remained abstinent, 4 disappeared after remaining abstinent for a short time.

Time-to-Relapse depended on: Type of follow-up program. Psychopathology of the patient.

Age at 1st Suicide Ideation For Adolescents Bolger, et al. (1989). Research Questions:

Whether, and if so when, an adolescent 1st considers suicide?

Does occurrence of suicide ideation differ by gender and developmental phase?

391 undergraduates, aged 16 through 22. Retrospective data collection, through current

age. At interview, 275 respondents had considered

suicide, 116 had not. Time-to-First-Suicide-Ideation.

Greatest risk in middle adolescence. Higher among females. Higher in adolescents w/ absent parents. Race by Age interaction.

Research questions addressed by Survival Analysis

Page 4: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 4

Classical Methods of Survival Analysis

Simple data-analytic approaches for summarizing survival data appropriately:• Estimation of the sample

hazard function.• Estimation of the sample

survivor function.• Estimation of the median

lifetime. Simple tests of differences in

survivor functions by “group”:• Survival analytic equivalent

of the t-test.

Today

Discrete-TimeSurvival Analysis

Replicates classical methods of survival analysis, using logistic regression analysis.

Extends classical survival analytic methods by making a regression format available:• Can include multiple

predictors, including interactions.

• Provides single parameter and GLH testing, using the –2LL statistic.

• Fitted hazard functions, survivor functions & median lifetimes, can be recovered from the fitted logistic regression model.

Next 2-3 class meetings

Continuous-TimeSurvival Analysis

Replaces discrete-time survival analysis when time has been measured continuously.

Imposes additional assumptions on the data.

Extends classical survival analytic methods by making a regression format available:• Can include predictors,

including interactions.

• Has its own testing procedures, based on standard practices.

• Fitted hazard functions, survivor functions & median lifetimes, are easily recovered from the fitted models.

Time Permitting

Analytic Approaches to Survival Analysis

Page 5: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 5

Dataset SPEC_ED.txt

Overview Discrete-time person-level dataset on the career duration of special education teachers who began their teaching careers in the Michigan public schools between 1972 and 1978, and who were followed uninterruptedly until 1985.

Source State Department of Education, Michigan.

Sample size 3941 teachers.

More Info Singer & Willett, 2003

Note on labeling of discrete-time “bins.” We regarded a teacher’s physical first year as their zeroth year, a year in which they must have taught in order to be a part of the study. If they quit sometime during the following year, they were classified as having taught for one year and having quit in “bin one.”

Note on labeling of discrete-time “bins.” We regarded a teacher’s physical first year as their zeroth year, a year in which they must have taught in order to be a part of the study. If they quit sometime during the following year, they were classified as having taught for one year and having quit in “bin one.”

Important Distinction You Must Keep In Mind

The two “modern” approaches to survival analysis are distinct in the way that duration must be measured:• In Discrete-time Survival

Analysis, time is measured in discrete units, such as semesters, years, etc.

• In Continuous-time Survival Analysis, time can be measured to any level of precision.

Research Question

Whether, and if so when, do special

education teachers in Michigan leave the teaching profession for the first time?

“Multiple Cohort” Sample DesignMultiple annual cohorts of special education teachers are pooled together in the sample:• Cohorts entered the sample sequentially

between 1972/3 and 1978/9 school years.*

• All cohorts were followed until the end of the 1984/5 school year (i.e., in June 1985).

72 |--|--|--|--|--|--|--|--|--|--|--|--|--|8573 |--|--|--|--|--|--|--|--|--|--|--|--|85

74 |--|--|--|--|--|--|--|--|--|--|--|8575 |--|--|--|--|--|--|--|--|--|--|85

76 |--|--|--|--|--|--|--|--|--|8577 |--|--|--|--|--|--|--|--|85

78 |--|--|--|--|--|--|--|85

The SPEC_ED Dataset

Page 6: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 6

The dataset is straightforward, containing Teacher IDs and length of service, with one small hitch …The dataset is straightforward, containing Teacher IDs and length of service, with one small hitch …

Structure of Dataset

Col# Var Name Variable Description Variable Metric/Labels

1 ID Teacher identification code. Integer

2 YRSTCH

# of years the teacher remained in teaching, to first quit, or until the teacher was censored in 1985 by the end of the study.

Integer

3 CENSOR

Dummy variable to indicate whether a teacher’s career was censored by the end of data collection in 1985.

Dichotomous variable: 0 = not censored,1 = censored.

There is a problem intrinsic to survival data, and is illustrated here: The event of interest is “quitting teaching for the first time.” But, not every teacher experiences this event while being

observed by researchers. We say that these teachers are “censored” by the end of the

data-collection. We call this “right censoring” because the YRSTCH range is

cut off on the right (positive) side. The actual observation (if we had waited) would be higher.

Key Idea: The presence of the censored cases is telling you

something about the probability that the time-to-event is longer than the

period of observation.

If you want an unbiased estimate of time-to-event, you cannot ignore

the censored cases, but must find a way to include them in the analysis

so that they can contribute whatever information they contain.

Why Is Censoring A Problem For Data Analysis?

… because if censoring occurs we don’t know the time-to-event for the people in the sample who may have

the longest times-to-event.

Dataset variables and the issue of “Censoring”

Page 7: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 7

*---------------------------------------------------------------------------* Input the raw dataset, name and label the variables and selected values.*---------------------------------------------------------------------------* Input the target dataset: infile ID YRSTCH CENSOR /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED.txt"  * Label the variables: label variable ID "Teacher Identification Code" label variable YRSTCH "Number of Years in Teaching" label variable CENSOR "Was Teaching Career Censored?" * Label the values of important categorical variables: * Dichotomous censoring variable CENSOR: label define censorlbl 0 "Not Censored" 1 "Censored" label values CENSOR censorlbl *----------------------------------------------------------------------------* Examining the data, for the first 40 cases.*---------------------------------------------------------------------------- list ID YRSTCH CENSOR in 1/40, clean

*---------------------------------------------------------------------------* Input the raw dataset, name and label the variables and selected values.*---------------------------------------------------------------------------* Input the target dataset: infile ID YRSTCH CENSOR /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED.txt"  * Label the variables: label variable ID "Teacher Identification Code" label variable YRSTCH "Number of Years in Teaching" label variable CENSOR "Was Teaching Career Censored?" * Label the values of important categorical variables: * Dichotomous censoring variable CENSOR: label define censorlbl 0 "Not Censored" 1 "Censored" label values CENSOR censorlbl *----------------------------------------------------------------------------* Examining the data, for the first 40 cases.*---------------------------------------------------------------------------- list ID YRSTCH CENSOR in 1/40, clean

Bearing this in mind, let’s explore the special educator data in Stata Do File, Unit5a.do …Bearing this in mind, let’s explore the special educator data in Stata Do File, Unit5a.do …

Standard data-input and labeling statements

Standard data-input and labeling statements

Print out the data on the first 40 teachers in the dataset for inspection …

Print out the data on the first 40 teachers in the dataset for inspection …

+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+

+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+

The “Person-Level” Dataset

Page 8: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+

+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 8

Here’s the data listing (with cases omitted to save space) …Here’s the data listing (with cases omitted to save space) … Dataset formatted in this way is

known as a person-level dataset: Because it contains one row of

event history data per teacher.

Teacher #2 was in the dataset for 2 years and was not censored.• S/he experienced the event of interest in the second year,• That is, s/he quit teaching for the first time sometime during

the second year.

Teacher #5 was in the dataset for 12 years and was censored.• S/he outlasted the data collection.• S/he taught for at least 12 years, and

possibly more.

We tend to be drawn to dangerous analyses with this dataset structure!!!

The “Person-Level” dataset encourages dangerous analyses…

Page 9: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 9

CareerLength

UncensoredCases

CensoredCases

1 456 02 384 03 359 04 295 05 218 06 184 07 123 2808 79 3079 53 255

10 35 26511 16 24112 5 386

0

100

200

300

400

500

1 3 5 7 911

456

384359

295

218

184

123

7953

3516

5

0 0 0 0 0 0

280307

255265241

386

# of

Tea

cher

s

Frequency of Teachers with Careers of Different Lengths

One sensible thing you can do in such datasets is display the frequency with which each career length occurs, in a vertical histogram that includes all the teachers in the sample, both censored and un-censored.One sensible thing you can do in such datasets is display the frequency with which each career length occurs, in a vertical histogram that includes all the teachers in the sample, both censored and un-censored.

Note the impact of the multi-cohort research

design -- any teacher who began teaching after 1978 and taught longer than six years is a censored case.

Comparing Uncensored and Censored Cases0

100

200

300

400

500

Fre

que

ncy

0 2 4 6 8 10 12Number of Years in Teaching

Uncensored Censored

Page 10: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 10

CareerLength

UncensoredCases

CensoredCases

1 456 02 384 03 359 04 295 05 218 06 184 07 123 2808 79 3079 53 255

10 35 26511 16 24112 5 386

13

57

911

0 0 0 0 0 0

280307

255265241

386

456

384359

295

218

184

123

7953

3516

5

0

100

200

300

400

500

# of

Tea

cher

s

Frequency of Teachers with Careers of Different Lengths

Here, are two misleading – but common -- strategies for trying to summarize teachers’ career length, while trying to deal with censoring …Here, are two misleading – but common -- strategies for trying to summarize teachers’ career length, while trying to deal with censoring …

Second Misleading Approach

If you set the career lengths of the censored

teachers to their longest observed

career length, then the sample mean teaching career length is 6.31 years. This too is a negatively biased

estimate of population career length even if only one teacher has lasted longer than the censored duration.

First Misleading ApproachIf you take the average of the career

lengths of only the uncensored teachers, their sample mean teaching career is 3.73 years, a negatively biased estimate of the average population teaching career length.

010

020

030

040

050

0F

req

uenc

y

0 2 4 6 8 10 12Number of Years in Teaching

Uncensored Censored

Bias imparted when ignoring censoring

Page 11: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 11

Dataset SPEC_ED_PP.txt

Overview Person-period dataset containing the same information as the SPEC_ED.txt person dataset, on the career duration of special education teachers who began their teaching careers in the Michigan public schools between 1972 and 1978, and who were followed uninterruptedly until 1985.

Source State Department of Education, Michigan.

Sample size 24875 annual person-period records.

More Info Singer & Willett, 2003

It is easier to appreciate these data when they are reformated into a person-period format. In a person-period dataset, you can gain a better understanding of a class of summary statistics that address

the “whether” and “when” questions. Hazard probability – Probability of failure at time conditional upon survival to that time point. Survival probability – Probability of surviving beyond time Median lifetime – Lifetime above which half of the persons survive.

It is easier to appreciate these data when they are reformated into a person-period format. In a person-period dataset, you can gain a better understanding of a class of summary statistics that address

the “whether” and “when” questions. Hazard probability – Probability of failure at time conditional upon survival to that time point. Survival probability – Probability of surviving beyond time Median lifetime – Lifetime above which half of the persons survive.

Notice that the name of the dataset is different

Here’s a clue to the difference between the person-level and the person-period dataset… There is

a row for every person-period combination in the data.

The Person-Period Dataset

To convert from one to the other, use the dthaz library. Type “net install dthaz.pkg” or type “findit prsnperd” The library was created by a former Ph.D. student at our

School of Public Health, Alexis Dinno (now an Assistant Professor at Portland State).

Page 12: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 12

Col Var Variable Description Labels

1 ID Teacher identification code. Integer

2 PERIOD Records the discrete time period to which each record refers. Integer

3 EVENT Dummy variable indicating whether the teacher experienced the event of interest in this period. 0 = no; 1 = yes

4 P1

5 P2

6 P3

Etc.

In a person-period dataset, each person has one row of data for each discrete time-period, each containing …In a person-period dataset, each person has one row of data for each discrete time-period, each containing …

The earlier YRSTCH variable,

which recorded the duration of the teaching career in the person-level dataset, has been

replaced by variable PERIOD, which labels the time-period to

which each row of the person-period

dataset refers.

Person-period dataset contains other variables too,

that are labeled and explained in these rows of the codebook. We ignore them here, but will return to them later during the presentation on discrete-time

survival analysis.

We’ve also acquired a new variable called EVENT, which records whether a teacher experienced the event of interest

(“Quit Teaching For The 1st Time”) in the particular discrete time-period in question.

The Person-Period Data Structure

Page 13: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 13

*-----------------------------------------------------------------------------•Input the person-period dataset•*-----------------------------------------------------------------------------* Input the dataset: infile ID PERIOD EVENT P1-P12 /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED_PP.txt"  * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *------------------------------------------------------------------------------* Inspect the structure of the new person-period dataset.* Notice that there is one row per discrete time-period for each person.*----------------------------------------------------------------------------- list ID PERIOD EVENT in 1/40

*------------------------------------------------------------------------------* Carry out the life-table analysis, by classical contingency table analysis.*------------------------------------------------------------------------------ tabulate EVENT PERIOD, column

*-----------------------------------------------------------------------------•Input the person-period dataset•*-----------------------------------------------------------------------------* Input the dataset: infile ID PERIOD EVENT P1-P12 /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED_PP.txt"  * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *------------------------------------------------------------------------------* Inspect the structure of the new person-period dataset.* Notice that there is one row per discrete time-period for each person.*----------------------------------------------------------------------------- list ID PERIOD EVENT in 1/40

*------------------------------------------------------------------------------* Carry out the life-table analysis, by classical contingency table analysis.*------------------------------------------------------------------------------ tabulate EVENT PERIOD, column

In Unit5a.do, I input the special educator person-period dataset and list the data, including estimation of a life table …In Unit5a.do, I input the special educator person-period dataset and list the data, including estimation of a life table …

Standard data input statements, reading in the ID,

PERIOD and EVENT variables and the mystery variables, P1 through P12, that we will return to later during our discrete-time

survival-analysis presentation

Print out the first 40 cases for inspection.

Carry out a Life Table Analysis: Tabulate the frequencies of EVENT by PERIOD. Kill the row & total percentage computation, but retain the

estimation of percentages in the columns defined by PERIOD.

Carry out a Life Table Analysis: Tabulate the frequencies of EVENT by PERIOD. Kill the row & total percentage computation, but retain the

estimation of percentages in the columns defined by PERIOD.

Reading in the Person-Period Dataset

Page 14: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

Person-Level Dataset +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+

Person-Level Dataset +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 14

Person-Period Dataset +-----------------------+ | ID PERIOD EVENT | |-----------------------| 1. | 1 1 Quit | 2. | 2 1 No Quit | 3. | 2 2 Quit | 4. | 3 1 Quit | 5. | 4 1 Quit | |-----------------------| 6. | 5 1 No Quit | 7. | 5 2 No Quit | 8. | 5 3 No Quit | 9. | 5 4 No Quit | 10. | 5 5 No Quit | |-----------------------| 11. | 5 6 No Quit | 12. | 5 7 No Quit | 13. | 5 8 No Quit | 14. | 5 9 No Quit | 15. | 5 10 No Quit | |-----------------------| 16. | 5 11 No Quit | 17. | 5 12 No Quit | 18. | 6 1 Quit | 19. | 7 1 No Quit | 20. | 7 2 No Quit | |-----------------------| 21. | 7 3 No Quit | 22. | 7 4 No Quit | 23. | 7 5 No Quit | 24. | 7 6 No Quit | 25. | 7 7 No Quit | |-----------------------| 26. | 7 8 No Quit | 27. | 7 9 No Quit |…

Person-Period Dataset +-----------------------+ | ID PERIOD EVENT | |-----------------------| 1. | 1 1 Quit | 2. | 2 1 No Quit | 3. | 2 2 Quit | 4. | 3 1 Quit | 5. | 4 1 Quit | |-----------------------| 6. | 5 1 No Quit | 7. | 5 2 No Quit | 8. | 5 3 No Quit | 9. | 5 4 No Quit | 10. | 5 5 No Quit | |-----------------------| 11. | 5 6 No Quit | 12. | 5 7 No Quit | 13. | 5 8 No Quit | 14. | 5 9 No Quit | 15. | 5 10 No Quit | |-----------------------| 16. | 5 11 No Quit | 17. | 5 12 No Quit | 18. | 6 1 Quit | 19. | 7 1 No Quit | 20. | 7 2 No Quit | |-----------------------| 21. | 7 3 No Quit | 22. | 7 4 No Quit | 23. | 7 5 No Quit | 24. | 7 6 No Quit | 25. | 7 7 No Quit | |-----------------------| 26. | 7 8 No Quit | 27. | 7 9 No Quit |…

In a person-period dataset:• Each person contributes one row of data for

each time-period,• Data record continues until the time-period in

which they either experience the event of interest, or they are censored.

Teacher #2 is not censored and so s/he

experiences the event of interest (i.e. quits teaching for the first time) in the 2nd

year.

Teacher #5 is censored – s/he never experiences the event of interest (i.e. doesn’t quit

teaching for the first time) in all the 12 years during which

teachers are observed.

Person-Level vs. Person-Period Datasets

Page 15: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 15

Here’s the Life Table – it’s a Two-Way Contingency Table Analysis of EVENT by PERIOD …Here’s the Life Table – it’s a Two-Way Contingency Table Analysis of EVENT by PERIOD …

Use frequencies to estimate a hazard probability describing “risk of quitting teaching for the 1st time” in each time-period, given that the teacher survived earlier periods.

Hazard probability is the (conditional) probability that a teacher will experience the event of interest (i.e., quit teaching for the first time) in a particular time-period, given that s/he has “survived” up until this period.

In Discrete Time Period #1, for instance: There are 3941 teachers “at risk of quitting for the first time.” Of this “risk set,” 456 were observed to quit for the first time. Hence, the probability that a teacher will quit for the first time in this period

(given that she entered it), is (456/3941), or 0.1157. So, the sample hazard probability in Discrete Time-Period #1 is 1157.0ˆ

1 th

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

Life Tables: At Each Time Point, for People Who Survived

Page 16: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 16

Here’s the sample hazard probability for discrete time-period #2 …Here’s the sample hazard probability for discrete time-period #2 …

Sample hazard probability (or “risk”) in discrete time-period #2 is: 3485 teachers survive from time-period #1 and enter the risk set for time-period #2. Of these, 384 quit for the first time. Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that

point, is (384/3485), or 0.1102. So, the sample hazard probability in discrete time-period #2 is 11.02%. How did we get that number? Note that the survivors at the target time point are the survivors from the previous time point minus the

“quitters.” For now…

Hazard Probability: For each Time Point, the Probability of “Fail”

Page 17: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 17

Here’s the sample hazard probability for discrete time-period #3 …Here’s the sample hazard probability for discrete time-period #3 …

Sample hazard probability (or “risk”) in discrete time-period #3 is: 3101teachers survive from time-period #2 and enter the risk set for time-period #3. Of these, 359 quit for the first time. Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that

point, is (359/3101), or 0.1158. So, the sample hazard probability in discrete time-period #3 is 11.58%. How did we get that number? The survivors at the target time point are still the survivors from the previous time point minus the

“quitters.” For now…

Hazard Probability: For each Time Point, the Probability of “Fail”

Page 18: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event

column percentage frequency Key

. tabulate EVENT PERIOD, column

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 18

Here’s the sample hazard probability for discrete time-period #11 …Here’s the sample hazard probability for discrete time-period #11 …

Sample hazard probability (or “risk”) in discrete time-period #11 is: 913 survive from time-period #10… BUT ONLY 648 enter the risk set for

time-period #11. Of these, 16 quit for the first time. Hence, the risk that a teacher will quit for the first time in time-period #11,

given that she survived to that point, is (16/648), or 0.0245. So, the sample hazard probability in discrete time-period #11 is 2.45%. Where did the teachers go? At time 10, they were censored. They did not quit, but, although they did

survive until time-period #11, we do not know whether they quit at that time. So, we don’t let this bias our interpretation of the hazard probability at time-

period #11. However they can contribute to hazard estimates at times<11!

Hazard Probability: For each Time Point, the Probability of “Fail”

Page 19: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 19

Timeperiod

# Teachersin the risk set

in this time period

# Teachers who quit in this time period

Samplehazard

probability

1 3941 456 0.11572 3485 384 0.11023 3101 359 0.11584 2742 295 0.10765 2447 218 0.08916 2229 184 0.08257 2045 123 0.06018 1642 79 0.04819 1256 53 0.042210 948 35 0.036911 648 16 0.024712 391 5 0.0128

0.0000

0.0200

0.0400

0.0600

0.0800

0.1000

0.1200

0.1400

1 2 3 4 5 6 7 8 9 10 11 12

Year in Teaching Career

Ha

zard

Pro

ba

bil

ity

Collect the sample hazard probabilities together and plot them as a sample hazard function …Collect the sample hazard probabilities together and plot them as a sample hazard function …

The Hazard Function

Page 20: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 20

TimePeriod

SampleHazard

Probabilityh(t)

Sample Survival

ProbabilityS(t)

0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123

Once you have the sample hazard probabilities, you can cumulate them to get sample survival probabilities …Once you have the sample hazard probabilities, you can cumulate them to get sample survival probabilities …

Sample Survival Probability

Survival probability in any time period is the probability of “surviving” beyond that period (ie, the probability of not experiencing the event of

interest until after the period).

Here, all teachers survived the 0th time period, so the estimated sample survival probability in the 0th period is 1.000.

The estimated hazard probability suggests that a proportion of 0.1157 of teachers in the 1st period risk set will “die” in the 1st period (i.e., quit teaching).

Because a proportion of 0.1157 of the risk set will “die” in the 1st period, we know that (1 - 0.1157) or 0.8843 of the 1st period risk set will survive.

In other words, 0.8843 of the entering “1.0000” will remain “alive” beyond the 1st time-period (and will therefore be potentially available to quit teaching for the first time at some later period).

The sample survival probability in the 1st time period is therefore 0.8843 1.000, or:

8843.0)(ˆ1 tS

The Survival Probability

Page 21: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 21

TimePeriod

SampleHazard

Probabilityh(t)

Sample Survival

ProbabilityS(t)

0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123

And, the estimated survival probability in discrete time period #2…And, the estimated survival probability in discrete time period #2…

Here, according to the estimated sample survival probability, a proportion of 0.8843 of the teachers survived the 1th time period.

Estimated hazard probability suggests that a proportion of 0.1102 of teachers in the 2nd period risk set will “die” in the 2nd period (i.e., quit teaching for the first time).

Because a proportion of 0.1102 of the risk set will “die” in the 2nd period, we know that (1 - 0.1102) -- or 0.8898 -- of the 2nd period risk set will survive.

In other words, a proportion of 0.8898 of the entering “0.8843” will remain “alive” beyond the 2nd time period (and be potentially available to quit teaching for the first time, later).

Sample survival probability in the 2nd time period is therefore 0.8898 0.8843, or:

7869.0)(ˆ2 tS

The Survival Probability

Page 22: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 22

TimePeriod

SampleHazard

Probabilityh(t)

Sample Survival

ProbabilityS(t)

0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123

And, the estimated survival probability in discrete time period #3 … etcAnd, the estimated survival probability in discrete time period #3 … etc

Here, according to the estimated sample survival probability, a proportion of 0.7869 of the teachers survived the 2nd time period.

The estimated hazard probability suggests that a proportion of 0.1158 of teachers in the 3rd period risk set will “die” in the 3rd period (i.e., quit teaching for the first time).

Because a proportion of 0.1158 of the risk set will “die” in the 3rd period, we know that (1 - 0.1158), or 0.8842, of the 3rd period risk set will survive.

In other words, a proportion of 0.8842 of the entering “0.7869” will remain “alive” beyond the 3rd time period (and be potentially available to quit teaching for the first time, later).

The sample survival probability in the 2nd time period is therefore 0.8842 0.7869, or:

6958.0)(ˆ3 tS

The Survival Probability

Page 23: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 23

TimePeriod

SampleHazard

Probabilityh(t)

Sample Survival

ProbabilityS(t)

jt )(ˆ jth )(ˆjtS

1jt )(ˆ1jtS

Thus, as a general principle, the estimated survivor probability in any time period j can be found by substituting into a simple little rule …

Thus, as a general principle, the estimated survivor probability in any time period j can be found by substituting into a simple little rule …

So, in general, in any time period j ..

)(ˆ)](ˆ1[)(ˆ1 jjj tSthtS

The Survival Probability – General Equation

Page 24: Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 24

Timeperiod

Samplehazard

probability

Sample Survival

Probability

0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123

Sample Survivor Function

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

0 2 4 6 8 10 12

Year in Teaching

Sam

ple

Su

rviv

or P

rob

abili

ty

Plotting the sample survival probabilities against time period provides the sample survivor function.Plotting the sample survival probabilities against time period provides the sample survivor function.

Typical monotonically decreasing survivor function …

We can also use this to estimate the median time of survival, by projecting over from 0.5 and down to the Time axis.

The Survival Function