accuracy, reliability, and validity of freesurfer measurements

Accuracy, Reliability, and Validity of Freesurfer Measurements

David H. [email protected]

Why Talk About This?

• This is not meant to imply that everything is perfect in FreeSurfer processing; it is a sample of the types of procedures that we and others have used to provide information about what works and what doesn’t, and to enhance confidence in our results.

• The information here should be used as a guide for how to assess the data in your own projects.

What is Accuracy?

• Accuracy: the degree of closeness of a measured or calculated quantity to its actual (true) value (e.g. a physical property such as length or thickness)

• MRI measures are indirect. We may be able to measure morphometry accurately given the contrast of the MR image, however, this contrast may differ from measurements from the actual tissue.

What is Reliability?• Measures obtained for the same individual on two

different days, close together in time to avoid a biological influence on the reliability measure– Reliability of a labeling procedure in the same scan– Reliability of the labeling procedure on two different scans– Reliability of the labeling procedure on two different scans

collected on two different scanners• The reliability of an overall effect can be assessed by

replication of the experiment in an independent sample.

• This is a general theory, that applies to all types of data, structural, functional, cognitive, etc.

What is Validity?

• Validity: the extent to which an indirect measurement is representative of what it is supposed to measure.

• For example, in fMRI we use blood flow as an indirect measure of neural activity. Is this a valid measure of neural activity?

Validity Examples• Internal validity: What is the strength of the overall experimental

design, study sample size, analysis procedures, etc.?• External validity: Would the effect measured generalize to another

sample? (replication)• Ecological validity: Can the results be applied in the real world

outside of the experimental setting? (clinical application)• Construct validity: Does the totality of evidence support the validity

of a single measure? (do the data fit with what is known?)• Face validity: Does the measure seem to be a good measure?• Convergent validity: How well does the measure correlate with

other types of measures that it should theoretically be correlated with? (do the data correlate with ‘gold standards’)

• Discriminant validity: Is the measure not correlated with measures it should not be correlated with? (ICV/age)

One does not necessarily ensure the other

• A measure that is perfectly reliable (e.g. you get the same exact measure every time), but not accurate, or valid.

• We can measure morphometry very precisely, but the validity of this measure depends on the quality of the input data.

• If an experiment is not reliable, then it is likely inaccurate and invalid.

Types of Error

• Random Error: Unknown and unpredictable changes in the measurement– Should be unbiased– Accuracy, reliability, and validity all limited by error

• Systematic error: Predictable offset or scaling of data– Typically comes from some aspect of the data

acquisition/analysis– Can be identified and corrected by analyzing

standards that closely match the real sample (e.g. do you get the same values at 1.5T as at 3T?)

How does poor reliability and validity affect your studies?

• Poor reliability increases variance across individuals and across timepoints.

• Validity is directly tied to interpretation. You may have a valid measure of ‘cortical thickness’, but ‘cortical thickness’ might not be a valid measure of degeneration– E.g. normal variation, hydration

• Many studies would benefit from the ability to measure minute changes across time.

Accuracy and Validity of Spherical Averaging for Labeling Structural and Functional Anatomy

Fischl et al., 1999

Anatomical Labeling

Fischl et al., 1999

Functional Labeling

Fischl et al., 1999

Enhanced Statistical Power

Fischl et al., 1999

Face Validity: Results fall within Expected Range

• Consistent with published findings:– crowns of gyri are thicker

than the fundi of sulci– sensory areas are among the

thinnest in the cortex.

Fischl et al., 1999

Validate against manual measurements of imaging data from

another study

Fischl et al., 1999

Automated measures are similar in size and region to manual measures, and

predict who will develop AD

Fischl et al., 2002

Comparison with Postmortem Measures

Rosas et al., 2002

Manual Measurements• Can only be done in regions where folds are appropriate• Calcarine also consistent across studies

Orbitofrontal Calcarine

Kuperberg et al., 2003

Salat et al., 2004

Compared to ManuallyLabeled Data

• 1 volume and 2 surface based labeling schemes• Percent of subjects labeled correctly at each location

across the surface.

Fischl et al., 2004 Desikan et al., 2006

Volume Atlas Surface Atlas Surface Atlas 2

Replication of Result:Split Sample

• Concordant results are likely not due to statistical error

• Current study with 5 samples used in prior literature

Salat et al., 2004

Cross Sequence Parameters

Fischl et al., 2004

Comparison across time, scanner, field strength, number of scans, sequence type, scanner upgrade, and

scanner manufacturer

Han et al., 2006

Effects of Pulse Sequence, Voxel Geometry, and Parallel Imaging

Wonderlick et al., 2008

Replication of Effects in Same Participants Across Scanning Conditions

Dickerson et al., 2008

WMPARC: same subjects scanned at different times (test-retest)

Salat et al., 2008

Replicable results across sex and hemisphere

Men

Women

Salat et al., 2008

Consistent Findings Across 5 samples Used To Identify Regions with Predictive Validity

• Regional measures predict who wll progress to AD.

Dickerson et al., 2008

Conclusions• Any tool used for MR analysis should be

rigorously tested for accuracy, reliability, and validity

• Most of the measures from Freesurfer have good accuracy, reliability, and validity across a range of conditions

• These results are dependent on optimal input data and correct implementation

• These data provide confidence, but do not substitute for using similar procedures to check data from each new study

accuracy, reliability, and validity of freesurfer measurements

Documents

external validity

face validity

convergent validity

replicationecological

measure correlate

good measure

othera measure

exact measure