hamilton depression

7/27/2019 Hamilton Depression

http://slidepdf.com/reader/full/hamilton-depression 1/15

Am J Psychiatry 161:12, December 2004 2163

Reviews and Overviews

http://ajp.psychiatryonline.org

The Hamilton Depression Rating Scale:

Has the Gold Standard Become a Lead Weight?

R. Michael Bagby, Ph.D.

Andrew G. Ryder, M.A.

Deborah R. Schuller, M.D.

Margarita B. Marshall, B.Sc.

Objective: The Hamilton Depression Rat-

ing Scale has been the gold standard for theassessment of depression for more than 40years. Criticism of the instrument has beenincreasing. The authors review studies pub-lished since the last major review of this in-strument in 1979 that explicitly examinethe psychometric properties of the Hamil-ton depression scale. The authors’ goal is todetermine whether continued use of theHamilton depression scale as a measure of treatment outcome is justified.

Method: MEDLINE was searched for stud-ies published since 1979 that examinepsychometric properties of the Hamilton

depression scale. Seventy studies wereidentified and selected, and then groupedinto three categories on the basis of themajor psychometric properties exam-ined—reliability, item-response character-istics, and validity.

Results: The Hamilton depression scale’s

internal reliability is adequate, but manyscale items are poor contributors to the

measurement of depression severity; oth-

ers have poor interrater and retest reliabil-

ity. For many items, the format for re-

sponse options is not optimal. Content

validity is poor; convergent validity and

discriminant validity are adequate. The

factor structure of the Hamilton depres-

sion scale is multidimensional but with

poor replication across samples.

Conclusions: Evidence suggests that the

Hamilton depression scale is psychomet-

rically and conceptually flawed. The

breadth and severity of the problems mil-

itate against efforts to revise the current

instrument. After more than 40 years, it is

time to embrace a new gold standard for

assessment of depression.

(Am J Psychiatry 2004; 161:2163–2177)

The Hamilton Depression Rating Scale (1) was devel-

oped in the late 1950s to assess the effectiveness of the first

generation of antidepressants and was originally pub-

lished in 1960. Although Hamilton (1) recognized that the

scale had “room for improvement” (p. 56) and that further

revision was necessary, the scale quickly became the stan-

dard measure of depression severity for clinical trials of

antidepressants (2, 3). The Hamilton depression scale has

retained this function and is now the most commonly

used measure of depression (3). Our objective in this arti-

cle is to provide a review of the Hamilton depression scale

literature published since the last major evaluation of its

psychometric properties, more than 20 years ago (4). More

recent reviews have appeared (3, 5–7), but they have notsystematically examined the literature with regard to a

broad range of measurement issues. Significant develop-

ments in psychometric theory and practice have been

made since the 1950s and need to be applied to instru-

ments currently in use. We evaluate the Hamilton depres-

sion scale in light of these current standards and conclude

by presenting arguments for and against retaining, revis-

ing, or rejecting the Hamilton depression scale as the gold

standard for assessment of depression.

Method

Studies for the review were identified by means of MEDLINEsearches for both “depression” and “Hamilton.” All studies pub-

lished during the period since the last major review (January 1980

to May 2003) were considered. Studies selected for review had to

be explicitly designed to evaluate empirically the psychometric

properties of the instrument or to review conceptual issues re-

lated to the instrument’s development, continued use, and/or

shortcomings. At least 20 published versions of the Hamilton de-

pression scale exist, including both longer and shortened ver-

sions. This review was limited to studies that examined the origi-

nal 17-item version, as the majority of the studies that evaluated

the scale’s psychometrics used the 17-item version. Only a small

number of studies evaluated other versions, and most of these

versions contain the original 17 items. Seventy articles met the se-

lection criteria and were categorized into three groups on the ba-

sis of the major psychometric property examined—reliability,item response, and validity. Table 1 lists the articles included in

the review.

Results

Reliability

Clinician-rated instruments should demonstrate three

types of reliability: 1) internal reliability, 2) retest reliability,

and 3) interrater reliability. Cronbach’s alpha statistic (78)

is used to evaluate internal reliability, and estimates ≥0.70



2164 Am J Psychiatry 161:12, December 2004

HAMILTON DEPRESSION SCALE


TABLE 1. Characteristics of Studies Examining the Psychometric Properties of the Hamilton Depression Rating Scale a

% ofFemaleSubjects

Psychometric PropertiesExamined

Study Year Language N Subjects ReliabilityItem

Response Validity

Aben et al. (8) 2002 Dutch 202 46 Stroke patients ×

Addington et al. (9) 1990 English 250 — b Schizophrenia inpatients ×

Addington et al. (10) 1996 English 112c 60 Schizophrenia inpatients × ×

Addington et al. (10) 1996 English 89d — b Schizophrenia inpatients × ×

Akdemir et al. (11) 2001 Turkish 94 66 Psychiatric patients × ×Baca-García et al. (12) 2001 Spanish 1 100 Dysthymia outpatient ×

Bech (5) 1981 Danish 66 70 Depressed inpatients × ×

Bech et al. (13) 1992 Multilingual 1,128 — b Psychiatric patients × ×

Bech et al. (14) 2002 Danish 650 — b Psychiatric patients × ×

Berard and Ahmed (15) 1995 English 22 64 Elderly psychiatric outpatients × ×

Berrios and Bulbena-Villarasa (16)

1990 Castilian 1,204 59 Psychiatric outpatients × ×

Brown et al. (17) 1995 English 259 — b Medical outpatients ×

Carroll et al. (18) 1981 English 278 — b Depressed patients ×

Cicchetti and Prusoff (19) 1983Time 1 English 86 — b Depressed outpatients ×

Time 2 English 81 — b Depressed outpatients ×

Craig et al. (20) 1985 English 32 0 Schizophrenia inpatients × ×

Daradkeh et al. (21) 1997 Arabic 73 58 Depressed inpatients × ×

Deluty et al. (22) 1986 English 70 39 Psychiatric inpatients × ×

Demitrack et al. (23) 1998 — b 85 66 Professionals/laypersons ×

Entsuah et al. (24) 2002Sample 1 Multilingual 865 65 Psychiatric patients ×

Sample 2 Multilingual 757 64 Psychiatric patients ×

Sample 3 Multilingual 450 62 Psychiatric patients ×

Faries et al. (25) 2000 — b 1,658 — b Depressed outpatients ×

Feinberg et al. (26) 1981 English — b — b Depressed patients ×

Fleck et al. (27) 1995 French 60 77 Psychiatric outpatients ×

Fuglum et al. (28) 1996 Danish — b — b Depressed patients × ×

Gastpar and Gilsdorf (29) 1990 Multilingual 122 66 Depressed patients ×

Gibbons et al. (30) 1993 English 370 72 Psychiatric patients × ×

Gilley et al. (31) 1995Sample 1 English 185 56 Alzheimer’s disease patients × ×

Sample 2 English 54 39 Comparsion subjects with normalcognition

× ×

Sample 3 English 57 37 Parkinson’s disease patients × ×

Gottlieb et al. (32) 1988 English 43 67 Neurological patients × ×

Gullion and Rush (33) 1998 English 324 67 Depressed patients ×

Hammond (34) 1998 English 100 74 Elderly medical patients ×Hooijer et al. (35) 1991 Flemish 56 — b Mental health professionals ×

Hotopf et al. (36) 1998 English 49 65 Primary care patients ×

Kobak et al. (37) 1999 English 113 — b Psychiatric patients/communitycomparison subjects

× ×

Koenig et al. (38) 1995 English 38 55 Elderly medical patients ×

Lambert et al. (39) 1986 — b 1,850 — b Psychiatric patients ×

Lambert et al. (40) 1988 English 13 31 Psychiatric inpatients/outpatients ×

Leentjens et al. (41) 2000 Dutch 63 37 Parkinson’s disease patients ×

Leung et al. (42) 1999 Chinese 93 56 Psychiatric inpatients × ×

McAdams et al. (43) 1996 English 101 23 Schizophrenia outpatients ×

Maier and Philipp (44) 1985 German 280 — b Psychiatric outpatients ×

Maier et al. (45) 1988Sample 1 German 130 — b Psychiatric inpatients × × ×

Sample 2 German 48 — b Psychiatric inpatients × × ×

Maier et al. (46) 1988 German 130 — b Psychiatric inpatients ×

Marcos and Salamero (47) 1990 Spanish 234 76 Community geriatric subjects ×

Meyer et al. (48) 2001 English 196 68 Medical outpatients ×Middelboe et al. (49) 1994 Danish 36 64 Medical outpatients ×

Moberg et al. (50) 2001 English 20 70 Geriatric consultation/liaison patients ×

Mottram et al. (51) 2000 English 433 73 Elderly psychiatric referrals ×

Naarding et al. (52) 2002Sample 1 Dutch 44 36 Stroke inpatients ×

Sample 2 Dutch 274 60 Alzheimer’s disease patients ×

Sample 3 Dutch 85 40 Parkinson’s disease patients ×

O’Brien and Glaudin (53) 1988Sample 1 English 183 70 Psychiatric outpatients ×

Sample 2 English 182 70 Psychiatric outpatients ×

(continued)




BAGBY, RYDER, SCHULLER, ET AL.


reflect adequate reliability (79, 80). The internal reliability

of individual items is calculated by using corrected item-

to-total correlation with Pearson’s r; items should have a

correlation greater than 0.20 (79, 80). Retest reliability as-

sesses the extent to which multiple administrations of the

scale generate the same results. When scores on an instru-

ment are expected to change in response to effective treat-

ment, it is necessary to demonstrate that these scores re-main the same in the absence of treatment. Interrater

reliability assesses the extent to which multiple raters gen-

erate the same result. Although Pearson’s r is often used to

compute these estimates, the preferred method is the

intraclass r (81), which allows for adjustment for agree-

ment by chance. Estimates of retest and interrater reliabil-

ity should be at a minimum of 0.70 (Pearson’s r) and 0.60

(intraclass r) (82). For retest reliability of scale items, Pear-

son’s r >0.70 is considered acceptable (83).

Internal Reliability

Table 2 summarizes the results from studies examining

internal reliability of the total Hamilton depression scale. Es-

timates ranged from 0.46 to 0.97, and 10 studies reported es-

timates ≥0.70. Table 3 summarizes the studies that exam-

ined internal reliability at the item level. The majority of

Hamilton depression scale items show adequate reliability.

Six items met the reliability criteria in every sample (guilt,middle insomnia, psychic anxiety, somatic anxiety, gastro-

intestinal, general somatic), and an additional five items met

the criteria in all but one sample (depressed mood, suicide,

early insomnia, late insomnia, work and interests, hypo-

chondriasis). Loss of insight was the item with the most vari-

able findings, suggesting a potential problem with this item.

Interrater Reliability

Total Hamilton depression scale interrater reliabilities

are displayed in Table 2. Pearson’s r ranged from 0.82 to

TABLE 1. Characteristics of Studies Examining the Psychometric Properties of the Hamilton Depression Rating Scalea (continued)

% ofFemaleSubjects

Psychometric PropertiesExamined

Study Year Language N Subjects ReliabilityItem

Response Validity

O’Hara and Rehm (54) 1983 English 20 0 Depressed outpatients ×

Olsen et al. (55) 2003 Danish 91 74 Psychiatric and medical patients ×

Onega and Abraham (56) 1997 English 206 70 Geriatric psychiatric outpatients ×

Pancheri et al. (57) 2002 Italian 186 62 Depressed outpatients × ×Paykel (58) 1990Sample 1 English 101 — b Depressed inpatients × ×

Sample 2 English 118 — b Psychiatric outpatients × ×

Sample 3 English 167 — b General practice outpatients × ×

Potts et al. (59) 1990 English 694 74 Depressed outpatients ×

Ramos-Brieva andCordero-Villafafila (60)

1988 Spanish 135 70 Depressed inpatients/outpatients × ×

Rehm and O’Hara (61) 1985 English 158 100 Community (symptomatic) subjects × ×

Reynolds and Kobak (62) 1995 English 357 59 Psychiatric outpatient/nonreferredcommunity subjects

×

Riskind et al. (63) 1987 English 191 54 Psychiatric outpatients × ×

Santor and Coyne (64) 2001Sample 1 English 316 — b Primary care outpatients ×

Sample 2 English 318 70 Depressed outpatients ×

Santor and Coyne (65)Sayer et al. (66)

2001 English 732 — b Depressed patients ×

1993 English 114 61 Psychiatric inpatients × ×

Senra Rivera et al. (67) 2000 Castilian 52 65 Depressed patients × ×Shain et al. (68) 1990 English 45 64 Depressed adolescent inpatients ×

Smouse et al. (69) 1981 English — b — b Depressed patients ×

Steinmeyer and Möller (70) 1992 German 223e 68 Psychiatric inpatients ×

Steinmeyer and Möller (70) 1992 German 174f 68 Psychiatric inpatients ×

Strik et al. (71) 2001Sample 1 Dutch 156 0 Medical patients × ×

Sample 2 Dutch 50 100 Medical patients × ×

Teri and Wagner (72) 1991 English 75 68 Alzheimer’s patients ×

Thase et al. (73) 1983 English 147 100 Depressed outpatients × ×

Thompson et al. (74) 1998 English 242 100 Psychiatric referrals ×

Whisman et al. (75) 1989 English 70 100 Depressed outpatients × ×

Williams (76) 1988 English 23 65 Psychiatric inpatients ×

Zheng et al. (77) 1988 Chinese 329 47 Psychiatric inpatients/outpatients × ×

a Studies were published between January 1980 and May 2003 and identified by means of a MEDLINE search for both “depression” and“Hamilton.”

b Not reported.c Number of subjects providing data at time 1.d Number of subjects providing follow-up data 3 months after admission.e Number of subjects providing baseline (i.e., pretreatment) data.f Number of subjects providing endpoint (week 6) data after treatment with either paroxetine or amitriptyline.






0.98, and the intraclass r ranged from 0.46 to 0.99. Some

investigators provided evidence that the skill level or ex-

pertise of the interviewer and the provision of structured

queries and scoring guidelines affect reliability (19, 23, 35,

54). Across studies, the best estimate mean of interrater re-

liability for studies reporting higher levels of interviewer

skill and use of expert raters, structured queries, and scor-

ing guidelines did not statistically differ from that for other

studies (z=0.81, n.s.).

At the individual item level, interrater reliability is poor

for many items. Cicchetti and Prusoff (19) assessed reli-

ability before treatment initiation and 16 weeks later at

trial end. Only early insomnia was adequately reliable be-

fore treatment, and only depressed mood was adequately

reliable after treatment. Thirteen items had coefficients

<0.50 before treatment, and 11 items had coefficients

<0.50 after treatment. Rehm and O’Hara (61) performed a

similar analysis with data from two samples. Six items

showed adequate reliability in the first sample (early in-

somnia, middle insomnia, late insomnia, somatic anxiety,

gastrointestinal, loss of libido), as did 10 in the second

sample (depressed mood, guilt, suicide, early insomnia,

middle insomnia, late insomnia, work/interests, psychic

anxiety, somatic anxiety, gastrointestinal). Loss of insight

showed the lowest interrater agreement in both samples.

Craig et al. (20) found that only one item, work/interests,

had adequate interrater reliability. Moberg et al. (50) re-

ported that nine items demonstrated adequate reliability

when the standard Hamilton depression scale was admin-

istered (depressed mood, guilt, suicide, early insomnia,

late insomnia, agitation, psychic anxiety, hypochondria-

sis, loss of insight), but all items showed adequate reliabil-

ity when the scale was administered with interview guide-

lines. Potts et al. (59) demonstrated that a single omnibus

coefficient can mask specific problems. Using a structured

interview version of the Hamilton depression scale, they

TABLE 2. Studies Reporting Reliability Estimates for the Total 17-Item Hamilton Depression Rating Scale a

Study YearInternal Reliability(Cronbach’s alpha)

Interrater Reliability(Pearson’s r)

Interrater Reliability(Intraclass r)

Retest Reliability(Pearson’s r)

Addington et al. (9) 1990 0.82Addington et al. (10) 1996 0.93Akdemir et al. (11) 2001 0.75 0.87 – 0.98b 0.85Baca-Garcí a et al. (12) 2001 0.97Cicchetti and Prusoff (19) 1983

Time 1 0.46Time 2 0.82

Craig et al. (20) 1985 0.95Deluty et al. (22) 1986 0.96Demitrack et al. (23) 1998 0.65 – 0.79b

Fuglum et al. (28) 1996 0.86 0.81Gastpar and Gilsdorf (29) 1990 0.48Gilley et al. sample 1 (31) 1995 0.92Gottlieb et al. (32) 1988 0.99Hammond (34) 1998 0.46Kobak et al. (37) 1999 0.91 0.98Koenig et al. (38) 1995 0.97Leung et al. (42) 1999 0.94Maier et al. (45) 1988

Sample 1 0.70Sample 2

Time 1 0.72Time 2 0.70

McAdams et al. (43) 1996 0.77Meyer et al. (48) 2001 0.57 – 0.80b

Middelboe et al. (49) 1994 0.75O’Hara and Rehm (54) 1983

Expert raters 0.91Novice raters 0.76

Pancheri et al. (57) 2002 0.90Potts et al. (59) 1990 0.82 0.92Ramos-Brieva and Cordero-Villafafila (60) 1988 0.72Rehm and O’Hara (61) 1985

Study 1 0.76 0.78 – 0.91b

Study 2 0.91 – 0.96b

Reynolds and Kobak (62) 1995 0.92 0.96Riskind et al. (63) 1987 0.73Shain et al. (68) 1990 0.97Teri and Wagner (72) 1991 0.65 – 0.97b

Whisman et al. (75) 1989 0.85

Williams (76) 1988 0.81Zheng et al. (77) 1988 0.71 0.92a Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton

depression scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”b Range over multiple pairs of raters.






found an overall intraclass coefficient of 0.92; however,

two trained psychiatrists differed at least 20% of the time

in their ratings of psychic anxiety, psychomotor agitation,

and psychomotor retardation, and they differed by at least

two points 15% of the time in their ratings of loss of libido.

The ratings of trained raters disagreed with the psychia-

trists’ ratings on psychomotor agitation (50% of the time),

hypochondriasis (60%), loss of libido (90%), and loss of

energy (100%).

Retest Reliability

Retest reliability for the Hamilton depression scale

ranged from 0.81 to 0.98 (Table 2). Retest reliability at the

item level (Table 3) ranged from 0.00 to 0.85. Williams (76)

argued in favor of using structured interview guides to

boost item and total scale reliability and developed the

Structured Interview Guide for the Hamilton Depression

Rating Scale. This effort increased the mean retest reliabil-

ity across individual items to 0.54, although only four

items met the criteria for adequate reliability (depressed

mood, early insomnia, psychic anxiety, and loss of libido).

Item Characteristics

Content and scaling. Standard psychometric practice

dictates that items within an instrument should measure a

single symptom and contain response options linked to

increasing or decreasing amounts of that symptom. Each

item is assumed to contribute equally to the total score or

be backed with evidence in support of differential weight-

ing. These criteria are not consistently met by using the

current scaling procedure or the options for rating symp-

toms. Although improperly scaled items can cause prob-

lems in quantitative measurement, evaluation of item

scaling takes place first at a qualitative level. Some Hamil-ton depression scale items measure single symptoms

along a meaningful continuum of severity; many do not.

The item assessing depressed mood includes a combina-

tion of affective, behavioral, and cognitive features, such

as gloomy attitude, pessimism about the future, subjective

feeling of sadness, and tendency to weep. The general so-

matic symptoms item, which is also symptomatically het-

erogeneous, includes feelings of heaviness, diffuse back-

ache, and loss of energy. Headache is coded only as part of

somatic anxiety along with such symptoms as indigestion,

palpitations, and respiratory difficulties. Genital symp-

toms for women entail loss of libido and menstrual distur-

bances. The problems inherent in the heterogeneity of

these rating descriptors reduce the potential meaningful-

ness of these items, a problem exacerbated if the different

components of an item actually measure multiple con-

structs and thus measure different effects.

Most items on the Hamilton depression scale at least are

scaled so that increasing scores represent increasing se-

verity. It is less clear whether the anchors used for different

scores on certain items actually assess the same underly-

ing construct/syndrome. This ambiguity is most obvious

for severity ratings involving psychotic features. The feel-

ings of guilt item, for example, is graded as follows: 0=ab-

sent, 1=self-reproach, 2=ideas of guilt or rumination over

past errors or sinful deeds, 3=present illness is a punish-

ment, and 4=hears accusatory or denunciatory voices

and/or experiences threatening visual hallucinations. A

patient with guilt-themed hallucinations may be more se-

verely ill than a patient who has nonpsychotic guilty feel-

ings, but is he/she feeling more guilt? The psychotic fea-

tures may instead represent a qualitatively different

construct/syndrome associated with more severe illness.

Similarly, the hypochondriasis item progresses through

bodily self-absorption (rated 1) and preoccupation with

health (rated 2) before switching to querulous attitude

(rated 3) and then again to hypochondriacal delusions

(rated 4). These item-scoring anchors violate basic mea-

surement principles, because nominal scaling and ordinal

scaling are combined in a single item.

Although Hamilton (1) explained the rationale for the

inclusion of both 3-point and 5-point items, the argument

was not made on the grounds of differential weighting.Hamilton believed that certain items would be difficult to

anchor dimensionally and therefore assigned them fewer

response options. The end result is that certain items con-

tribute more to the total score than others. Contrasting

psychomotor retardation and psychomotor agitation, for

example, reveals that a severe manifestation of the former

contributes 4 points, whereas an equally severe manifes-

tation of the latter contributes 2 points. Similarly, some-

one who weeps all the time can contribute 3 or 4 points on

depressed mood, whereas someone who feels tired all the

time can contribute only 2 points on the general somatic

symptoms item.

Item Response Analysis

A psychiatric rating scale should measure a single psy-

chopathological construct (i.e., an illness or syndrome)

and be composed of items that adequately cover a range of

symptoms that are consistently associated with the syn-

drome. Item response theory, a method used increasingly

in the evaluation and construction of psychometric in-

struments, permits empirical evaluation of these pre-

mises. It is important to note that this method was not

available when the original Hamilton depression scale was

developed, although some researchers more recently used

this method to evaluate this instrument. According to item

response theory, a scale and its constituent items may

have good reliability estimates but still fail to meet item re-

sponse theory criteria. For example, if a depression scale

were composed only of items measuring mild depression,

the instrument would have great difficulty distinguishing

between moderate and severe cases of depression, as both

would be characterized by high scores on all items. This is-

sue is particularly pressing in studies of clinical change;

not only is a wide range of severity often represented in

this research, but individual patients are expected to move






along this continuum as they improve. Continued use of

items insensitive to change underestimates the strength of actual treatment effects and makes it necessary to have

larger samples to demonstrate that an effect is statistically

significant. Falsely identifying patients as not having

changed represents an additional source of “noise” and

weakens the “signal” of a true treatment effect. A prag-

matic implication of such lack of sensitivity is that new

compounds shown to be promising in the laboratory may

appear spuriously ineffective in clinical trials.

A related issue concerns the extent to which a severity

score actually measures a single unidimensional syn-

drome. To summarize a syndrome with a single score re-

quires a precise understanding of what that score repre-sents. The implicit assumption is that the severity score

represents a single dimension (84); if depression is hetero-

geneous, interpretation of a single summed score is un-

clear. If, for example, items assessing psychological and

physical symptoms were only loosely related, a single

score would not distinguish between two potentially dif-

ferent groups of depressed patients—one group whose

symptoms were primarily psychological and another

group with primarily vegetative symptoms. Any effects of

an intervention targeting only one of these aspects would

be harder to detect.Gibbons et al. (85) presented a strategy for identifying a

unidimensional set of items from a psychiatric rating scale

and evaluating the extent to which these items adequately

measure the full range of depression severity. Subse-

quently, a subset of Hamilton depression scale items that

would measure a single dimension of depression across a

wide range of severity was developed (30). This subset in-

cluded depressed mood, which was sensitive at low levels;

work/interests, psychic anxiety, and loss of libido, which

were sensitive at mild levels; somatic anxiety, psychomo-

tor agitation, and guilt, which were sensitive at moderate

levels; and suicide, which was sensitive at severe levels.

These items were proposed as a psychometrically stronger

form of the full Hamilton depression scale.

Santor and Coyne (64, 65) used item response theory to

examine the functioning of the full Hamilton depression

scale and its individual items. In one of these studies (65)

they examined individual Hamilton depression scale item

performance in a combined sample of primary care pa-

tients and depressed patients from the National Institute

of Mental Health Treatment of Depression Collaborative

Research Program. One expects different item ratings at

TABLE 3. Studies Reporting Item Reliability Estimates for the 17-Item Hamilton Depression Rating Scalea

Scale Item

Reliability Measure and Study YearDepressed

Mood Guilt SuicideEarly

InsomniaMiddle

InsomniaLate

InsomniaWork/

Interests

Internal reliabilityb

Berrios and Bulbena-Villarasa (16) 1990Sample 1 0.32 0.24 0.26 0.25 0.32 0.31 0.39Sample 2 0.37 0.38 0.40 0.23 0.37 0.42 0.33

Gastpar and Gilsdorf (29) 1990

Time 1 0.10 0.22 – 0.04 0.04 0.22 0.13 0.09Time 2 0.65 0.39 0.50 0.44 0.46 0.53 0.73

Paykel (58) 1990Sample 1 0.52 0.31 0.31 0.24 0.21 0.38 0.59Sample 2 0.42 0.38 0.47 0.27 0.34 0.30 0.58Sample 3 0.52 0.41 0.49 0.34 0.35 0.34 0.59

Rehm and O’Hara (61) 1985 0.63 0.26 0.47 0.40 0.41 0.37 0.46Interrater reliabilityc

Cicchetti and Prusoff (19) 1983Time 1 0.37 0.18 0.59 0.76 0.57 0.42 0.33Time 2 0.72 0.37 0.64 0.57 0.45 0.49 0.64

Moberg et al. (50)d 2001Standard administration 0.90 0.80 0.90 0.61 0.39 0.89 0.50Interview guidelines 0.96 0.83 0.81 0.97 0.78 0.89 0.87

Rehm and O’Hara (61)e 1985Above median split 0.61 0.39 0.49 0.74 0.79 0.72 0.56Below median split 0.84 0.82 0.92 0.91 0.79 0.92 0.73

Retest reliabilityf

Akdemir et al. (11) 2001 0.61 0.78 0.67 0.69 0.79 0.76 0.73Williams (76) 1988 0.80 0.63 0.64 0.80 0.62 0.30 0.54

a Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depres-sion scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”

b Correlation of item scores with total scores. An uncorrected Pearson’s r>0.20 was considered significant. Significant correlations are shown inboldface type.

c Interrater Pearson’s r≥0.70 was considered significant; intraclass r≥0.60 was considered significant. Significant correlations are shown in bold-face type.

d The study included both standard and interview guideline methods; interrater reliability was calculated by using the intraclass r.e The subjects were assigned to two groups by means of a median split according to Hamilton depression scale total scores; interrater Pear-

son’s r values were calculated for both groups.f Test-retest Pearson’s r >0.70 was considered acceptable. Acceptable correlations are shown in boldface type.






different levels of depression severity, with zeroes more

common at mild levels of overall depression and higheritem scores more common with more severe overall de-

pression. Moreover, whereas most items on the Hamilton

depression scale are, overall, sensitive to depression sever-

ity, 12 items had at least one problematic response option

(the five items that had no such problems were depressed

mood, guilt, suicide, work/interests, and psychic anxiety)

(64). For example, the likelihood of receiving a rating of 1

on the insomnia items was essentially the same regardless

of the overall severity of depression, but the likelihood of

receiving a rating of 4 on somatic anxiety was very low

even when overall depression was severe. These findings

confirm that the rating scheme is not ideal for many items

on the Hamilton depression scale, with the unfortunate

effect of decreasing the capacity of the Hamilton depres-

sion scale to detect change (6, 7).

Rasch Analysis

Additional efforts to analyze the performance of indi-

vidual Hamilton depression scale items and to identify an

underlying single dimension of depression severity have

benefited from a technique known as Rasch analysis, a

method similar to item response theory. Rasch analysis

proposes an ideal underlying dimension based on mathe-

matical and theoretical reasoning about the construct thatis being measured and then assesses the extent to which

actual data correspond to this ideal. This approach was

first applied to the Hamilton depression scale by Bech et

al. (86), who confirmed that six items previously shown to

have properties associated with unidimensionality (87)

could be combined to create a shorter scale that met the

formal Rasch criteria. This six-item scale was thus pro-

posed as a better measure than the full Hamilton depres-

sion scale for assessing depression severity along a single

dimension; the six-item scale is composed of items for de-

pressed mood, guilt, work/interests, psychomotor retar-

dation, anxiety psychic, and general somatic symptoms

(87). The unidimensionality of this six-item subscale has

since been confirmed in two studies that used Rasch

methods (13, 14). Maier and Philipp (44) used Rasch anal-

ysis to confirm unidimensionality for a subset of Hamilton

depression scale items. The resulting scale was similar to

that obtained by Bech et al. (86). In another study that

used Rasch analysis (46), six items were found to be prob-

lematic: suicide, psychomotor agitation, anxiety somatic,

general somatic symptoms, hypochondriasis, and loss of

insight.

Scale Item

Retardation AgitationPsychicAnxiety

SomaticAnxiety Gastrointestinal

GeneralSomatic

Loss ofLibido Hypochondriasis

WeightLoss

Loss ofInsight Mean

0.24 0.24 0.42 0.35 0.33 0.29 0.29 0.34 0.29 0.06 0.290.31 0.35 0.36 0.29 0.37 0.34 0.30 0.36 0.26 0.06 0.32

0.03 0.07 0.39 0.34 0.28 0.32 0.05 0.34 – 0.04 0.25 0.160.40 0.40 0.64 0.58 0.53 0.55 0.55 0.23 0.11 0.27 0.47

0.33 0.37 0.53 0.41 0.52 0.25 0.27 0.33 0.40 0.45 0.380.33 0.20 0.33 0.47 0.63 0.50 0.39 0.43 0.49 0.23 0.400.21 0.25 0.50 0.42 0.41 0.44 0.23 0.16 0.42 – 0.07 0.350.14 0.18 0.54 0.46 0.27 0.38 0.13 0.33 0.25 0.16 0.34

0.39 0.20 0.19 0.34 0.43 0.30 0.39 0.29 0.57 – 0.02 0.370.26 0.32 0.40 0.45 0.51 0.42 0.59 – 0.04 0.06 – 0.03 0.40

0.46 0.89 0.67 0.57 0.34 0.57 0.39 0.76 0.58 0.63 0.640.75 0.97 0.88 0.84 0.95 0.92 0.94 0.89 1.00 1.00 0.90

0.54 0.51 0.52 0.88 0.75 0.55 0.77 0.35 0.51 0.37 0.590.69 0.52 0.71 0.82 0.73 0.68 0.69 0.64 0.54 0.22 0.72

0.85 0.66 0.80 0.79 0.71 0.66 0.76 0.79 0.08 0.79 0.700.32 0.11 0.78 0.66 0.59 0.61 0.70 0.55 0.58 0.00 0.54






Validity

Validity of psychiatric rating scales such as the Hamil-

ton depression scale comprises 1) content, 2) convergent,

3) discriminant, 4) factorial, and 5) predictive validity.

Content validity is assessed by examining scale items to

determine correspondence with known features of a syn-

drome. Convergent validity is adequate when a scaleshows Pearson’s r values of at least 0.50 in correlations

with other measures of the same syndrome. Discriminant

validity is established by showing that groups differing in

their diagnostic status can be separated by using the scale.

Predictive validity for symptom severity measures such as

the Hamilton depression scale is determined by a statisti-

cally significant (p<0.05) capacity to predict change with

treatment. Factorial validity is established by using factor

analysis or related techniques (e.g., principal-component

analysis) to demonstrate that a meaningful structure can

be found in multiple samples. An a priori criterion of 0.40

has been used to identify which items are part of which

factors (88).

Content validity. Because of its wide use and long clini-

cal tradition, the Hamilton depression scale seems to both

define as well as measure depression. One could criticize

DSM-IV for not adequately capturing Hamilton depres-

sion scale depression as much as one could criticize the

Hamilton depression scale for not providing full coverage

of DSM-IV depression. Nonetheless, the operational crite-

ria provided in DSM-IV are used as the official nosology

for much of psychiatry worldwide. The criteria for major

depression have been revised three times in response to

developments in field trial research and clinical consensus

based on expert opinion, most recently in 1994. Research-

ers have developed a number of longer versions of the

Hamilton depression scale that include additional symp-

toms such as the reverse vegetative features of atypical de-

pression. However, the core items of the Hamilton depres-sion scale have remained unchanged for more than 40

years. It is reasonable to ask whether this instrument cap-

tures depression as it is currently conceptualized. Several

symptoms contained within the Hamilton depression

scale are not official DSM diagnostic criteria, although

they are recognized as features associated with depression

(e.g., psychic anxiety). For other symptoms included in the

Hamilton depression scale (e.g., loss of insight, hypochon-

driasis), the link with depression is more tenuous. More

critically, important features of DSM-IV depression are of-

ten buried within more complex items and sometimes are

not captured at all. The work/interests item includes an-

hedonic features along with listlessness, indecisiveness,

social avoidance, and lowered productivity. It is impossi-

ble to determine the extent to which anhedonia per se in-

fluences severity. Guilt is captured in both Hamilton de-

pression scale depression and DSM-IV depression, but the

Hamilton depression scale contains no explicit assess-

ment of feelings of worthlessness. Decision-making diffi-

culties are buried within the work/interests item of the

Hamilton depression scale, but concentration difficulties

are not included. The reverse vegetative symptoms—

TABLE 4. Studies Reporting Estimates of Convergent Validity of the 17-Item Hamilton Depression Rating Scale, ComparedWith Other Depression Measuresa

r

Study Year

Beck DepressionInventory

BriefPsychiatric

Rating Scale

Center forEpidemiologic

StudiesDepression Scale

Clinical GlobalImpression

Scale

CarrollRating Scale

for Depression

GlobalAssessment

Scale

Akdemir et al. (11) 2001 0.48 0.56Berard and Ahmed (15) 1995 0.48

Brown et al. (17) 1995 0.70 – 0.85b

Carroll et al. (18) 1981 0.60 0.71Craig et al. (20) 1985 0.56 0.65Feinberg et al. (26) 1981 0.77 0.75Gottlieb et al. (32) 1988

Low-severity group 0.89High-severity group 0.57

Hotopf et al. (36) 1998 0.77Kobak et al. (37) 1999 0.89Leung et al. (42) 1999Maier et al. total sample (46) 1988Olsen et al. (55) 2003Rehm and O’Hara (61) 1985 0.73 – 0.86Senra Rivera et al. (67) 2000

Time 1 0.70Time 2 0.92

Whisman et al. (75) 1989

Time 1 0.27 0.41Time 2 0.67 0.68

Zheng et al. (77) 1988 – 0.47a Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depres-

sion scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”b Multiple assessments over an 8-month period.






weight gain, hyperphagia, and hypersomnia— were pro-

vided by Hamilton (1) as additional items but are notscored on the original Hamilton depression scale.

Convergent validity. A wide range of instruments has

been used to examine the convergent validity of the Hamil-

ton depression scale (Table 4). Most of the correlation co-

efficients met the preestablished criterion, and the Hamil-

ton depression scale showed adequate convergent validity

in correlations with all but two scales, including the major

depression section of the Structured Clinical Interview for

DSM-IV. The latter finding provides evidence of noncorre-

spondence between the Hamilton depression scale and

DSM-IV.Discriminant validity. Two approaches have been used

to evaluate the discriminant validity of the Hamilton de-

pression scale. In the first approach, several studies used

the receiver operating curve as a statistical means of deter-

mining the cutoff scores for detecting depression and then

provided corresponding rates of sensitivity, specificity,

positive predictive power, and negative predictive power

for the Hamilton depression scale in distinguishing de-

pressed and nondepressed subjects. In other studies, re-

r

HospitalAnxiety andDepression

Scale

Montgomery-Åsberg

DepressionRating Scale

MajorDepressionInventory

MinnesotaMultiphasicPersonalityInventory

RaskinDepression

Scale

StructuredClincial

Interview forDSM-IV

VisualAnalogue

Scale

ZungSelf-RatingDepression

Scale

0.370.38

– 0.65

0.490.01

0.670.85 0.65

0.860.67 0.81

0.680.88

0.200.500.62

TABLE 5. Studies Reporting Classification Accuracy Rates for the 17-Item Hamilton Depression Rating Scale a

Study Year Cutoff Scoreb Sensitivity SpecificityPositive

Predictive ValueNegative

Predictive Value

Aben et al. (8) 2002 12 0.78 0.75 0.37 0.95Leentjens et al. (41) 2000 13/14 0.88 0.86 0.84 0.89Leung et al. (42) 1999 12/13 0.88 0.86 0.84 0.89Mottram et al. (51) 2000 15/16 0.88 0.99 0.99 0.97Naarding et al. (52) 2002

Sample 1 10/11 0.73 1.00 1.00 0.88Sample 2 13/14 0.45 0.96 0.76 0.86Sample 3 15/16 0.70 0.99 0.93 0.91

Strik et al. (71) 2001 11/12 0.76 0.86 0.41 0.99Thompson et al. (74) 1998 — c 0.69 – 0.87d 0.99 – 1.00d — c — c

Mean 12.6/13.5 0.76 0.91 0.77 0.92a Rates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depression

scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”b The minimum score above which sensitivity and specificity are maximized in the detection of depression with the Hamilton depression scale

for a given study. Where two scores are given, the lower score represents the threshold below which cases are classified as nondepressed,and the higher score represents the threshold above which cases are classified as depressed.

c Not reported.d Range of scores across multiple assessments.






searchers have examined the capacity of the Hamilton de-

pression scale to distinguish different groups of clinical

patients (e.g., patients with endogenous versus those with

nonendogenous depression, patients with anxiety versus

those with depression) using statistical techniques to de-

tect mean group differences. Classification rates resulting

from receiver operating curve analysis have not been

widely reported in the Hamilton depression scale litera-ture. Our search only identified seven studies (Table 5), and

some of these investigations sought to detect depression in

samples of patients with medical conditions other than

psychiatric disorders (Table 1). Sensitivity, specificity, and

negative predictive power were generally consistent and

large, but positive predictive power was more variable, and

two studies reported very low positive predictive power.

The second type of discriminant validity study attempts

to distinguish different clinical groups. In a comparison of

healthy, depressed, and bipolar depressed individuals,

Rehm and O’Hara (61) found that the total Hamilton de-

pression scale score clearly differentiated these three cate-

gories, with the depressed patients scoring higher than the

healthy participants and with the bipolar depressed pa-

tients scoring higher than both of the other groups. At the

item level, four items—psychomotor agitation, gastro-

intestinal symptoms, loss of insight, and weight loss—failed to differentiate depressed from healthy subjects.

Only psychic anxiety and hypochondriasis significantly

differentiated the subjects with unipolar and bipolar de-

pression. Kobak et al. (37) showed significant total scale

score differences between individuals with major depres-

sion, individuals with minor depression, and healthy com-

parison subjects. Zheng et al. (77) reported that the Hamil-

ton depression scale was able to discriminate psychiatric

patients classified as mildly, moderately, and severely dys-

functional on the basis of Global Severity Scale scores.

Thase et al. (73) found that the Hamilton depression scale

could distinguish patients with endogenous depression

from patients with nonendogenous depression, with pa-

tients in the former category having higher scores. Gott-

lieb et al. (32) reported no significant differences between

the Hamilton depression scale scores of patients classified

as having low-severity versus high-severity Alzheimer’sdisease. Several researchers have investigated the capacity

of the Hamilton depression scale to differentiate between

patients with anxiety and those with depression. Prusoff

and Klerman (89) suggested the Hamilton depression

scale could indeed separate these constructs, and Maier et

al. (45) demonstrated that the Hamilton depression scale

had a higher correlation with an external measure of de-

pression than with an external measure of anxiety, but thesaturation of the Hamilton depression scale with anxiety-

related concepts was nonetheless considerable.

Predictive validity. Edwards et al. (90) performed a meta-

analysis of 19 studies with a total of 1,150 patients that

compared the predictive validity of the Hamilton depres-

sion scale and the Beck Depression Inventory. Treatments

included pharmacotherapy, behavior therapy, cognitive

restructuring, dynamic psychotherapy, and various com-

binations. The Hamilton depression scale was found to be

TABLE 6. Studies Reporting Factor Analyses and Principal-Component Analyses of the 17-Item Hamilton Depression RatingScalea

Study YearNumber

of FactorsDepressed

Mood Guilt SuicideEarly

InsomniaMiddle

InsomniaLate

InsomniaWork/

Interests

Addington et al. (10) 1996Time 1 7 I I, V V — II II, V, VI I, IVTime 2 7 I, II, VII II, III, VII VII III III III, V II

Akdemir et al. (11) 2001 6 — II II III III III — Berrios and

Bulbena-Villarasa (16) 1990Sample 1 4 I I, II I I I I IISample 2 4 I I, II I IV I, IV I I, II, IV

Brown et al. (17) 1995 6 III III III I V V VIDaradkeh et al. (21) 1997 5 II II, IV I I IIIFleck et al. (27) 1995 3 I I — III III III IGibbons et al. (30) 1993 5 I, IV I I — II II I, IVMarcos and Salamero (47) 1990 3 II — II III III III IIO’Brien and Glaudin (53) 1988

Sample 1 6 I I, VI I — IV IV I, IISample 2 8 III VII VI II II II III

Onega and Abraham (56) 1997 4 I I I II II II IPancheri et al. (57) 2002 4 III II — I I I IIIRamos-Brieva and

Cordero-Villafafila (60) 1988 5 III II, III I, III I I I IIISmouse et al. (69) 1981 3 I I I, II I, II I, II I, II ISteinmeyer and Möller (70) 1992

Time 1 6 II V V III III III IITime 2 2 I, II II I, II II I I I

Zheng et al. (77) 1988 5 III IV III V V V IVa Results are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depression

scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.” Roman numerals indicate the numberof the factor on which the item loaded significantly. A factor loading of ≥0.40 was considered statistically significant.






more sensitive to change, compared to the Beck Depres-

sion Inventory. Lambert et al. (39) performed a meta-anal-

ysis that included 36 studies and a total of 1,850 patients

and that compared the Hamilton depression scale to the

Beck Depression Inventory and the Zung Self-Rating De-

pression Scale. They reported that the Hamilton depres-

sion scale was more sensitive to change than were the twoself-report measures. Sayer et al. (66) also demonstrated

that the Hamilton depression scale outperformed the

Beck Depression Inventory in detecting change. Lambert

et al. (40) reported that the Beck Depression Inventory is

more likely to show treatment effects at 12 weeks than the

Zung Self-Rating Depression Scale or the Hamilton de-

pression scale; the Zung Self-Rating Depression Scale and

the Hamilton depression scale were more likely to detect

changes after 3 weeks.

One disadvantage of a multidimensional instrument

such as the Hamilton depression scale in detecting change

is that specific treatments may affect only a single dimen-sion. If the total score includes somatic symptoms that ac-

tually reflect treatment side effects, estimates of treatment

response will be spuriously low (44). In two studies and

one meta-analysis researchers addressed this issue using

the various unidimensional core depression item sets de-

scribed earlier in the section on item characteristics (91,

92). The six-item subscale developed by Bech et al. (87)

was found to be at least as responsive as the full Hamilton

depression scale. A meta-analysis of eight fluoxetine stud-

ies with 1,658 patients showed that the different uni-

dimensional subscales (44, 87) were more sensitive to

change than was the full Hamilton depression scale score.

These results were replicated in a second meta-analysis of

four tricyclic antidepressant studies (25).

Factorial validity. A total of 15 studies with 17 samples

reported a factor analysis of the Hamilton depressionscale (Table 6). In most of the studies, researchers used the

eigenvalue ≥1 rule to determine the number of factors, ex-

tracted those factors from the data using principal-com-

ponent analysis, and then determined the optimal config-

uration of items on factors using varimax rotation. The

number of factors identified ranged from two to eight. In-

somnia items appeared consistently on the same factor in

13 data sets, suggesting a sleep disturbance factor. There

was some support for the presence of a general depression

factor, as depressed mood, guilt, and suicide appeared to-

gether on the same factor in six data sets, and the combi-

nation of depressed mood, suicide, and psychic anxiety appeared on the same factor in seven data sets. Support

was also found for an anxiety/agitation factor, with the ag-

itation, psychic anxiety, and somatic anxiety items ap-

pearing together in six samples. Clearly, the Hamilton de-

pression scale is not unidimensional, as separate sets of

items do seem to reliably represent general depression

and insomnia factors; however, the exact structure of the

Hamilton depression scale’s multidimensionality remains

unclear.

Retardation AgitationPsychicAnxiety

SomaticAnxiety Gastrointestinal

GeneralSomatic

Loss ofLibido Hypochondriasis

WeightLoss

Loss ofInsight

IV III, VI VII I VII I, VII IV, VII III I, VI VIII II, III I, II I I VI VI IV I, IV, V IV — I, II II I VI IV IV I VI

II I I I, II I, III I IV I III IIIII II I, III I, III I I I, IV I I, III IIIIII I I I II VI V IV IV II

I, II I, III V V V II, III II IVI II II II II I — II II — IV I I I V IV — — III, V IIIII — II I — I I I — —

V V I II III II I II, IV III VII

I III V, VI IV V VII VI IV VIIII IV III III II I I III II IV

— II — I I, IV — – I IV —

III II II II IV IV IV V V IIIII I, II I, III I I I

IV I II I VI IV VI I IV IV — II I I I I — — — — IV II I I I I III I I II






Conclusions

The Hamilton depression scale has been the standard for

the assessment of depression for more than 40 years. Re-

searchers and policy makers charged with the task of pro-

viding standards to evaluate treatment outcomes in de-

pression are faced with three possible solutions: retain,

revise, or reject. The latter solution argues for the develop-

ment of a new instrument or the replacement of the Hamil-ton depression scale with existing, psychometrically supe-

rior instruments.

Many of the psychometric properties of the Hamilton

depression scale are adequate and consistently meet es-

tablished criteria. The internal, interrater, and retest reli-

ability estimates for the overall Hamilton depression scale

are mostly good, as are the internal reliability estimates at

the item level. Similarly, established criteria are met for

convergent, discriminant, and predictive validity, al-

though the latter does suffer somewhat due to multidi-

mensionality. At the item level, interrater and retest coeffi-

cients are weak for many items, and the internal reliability coefficients indicate that some items are problematic. The

lack of individual item reliability is not necessarily a fatal

psychometric flaw; what is critical is that the items as a

whole provide adequate reliability.

Evaluation of item response shows that many of the

individual items are poorly designed and sum to generate a

total score whose meaning is multidimensional and un-

clear. The problem of multidimensionality was highlighted

in the evaluation of factorial validity, which showed a fail-

ure to replicate a single unifying structure across studies.

Although the unstable factor structure of the Hamilton de-

pression scale may be partly attributable to the diagnostic

diversity of population samples, well-designed scales as-sessing clearly defined constructs produce factor struc-

tures that are invariant across different populations (88).

Finally, the Hamilton depression scale is measuring a con-

ception of depression that is now several decades old and

that is, at best, only partly related to the operationalization

of depression in DSM-IV.

These findings indicate that continued use of the Hamil-

ton depression scale requires, at the very least, a complete

overhaul of its constituent items. Accumulated empirical

evidence offers some hope that substantial revision can

redress a number of psychometric problems, thereby pro-

viding an improved measure. Shortened versions of the

Hamilton depression scale converge on a common set of

core features and in general have proven more effective in

detecting change. The truncated item sets for these instru-

ments, however, are limited in that they do not permit

capture of the full depressive syndrome. Other studies

based on item response theory methods have indicated

that modifications of the rating scheme are readily imple-

mented and can enhance the unidimensionality of these

core symptoms in a manner that allows uniform assess-

ment of change. Identifying a core set of symptoms with

proven psychometric qualities, along with making rating

scheme changes that would allow consistent assessment

of the severity of depression, could provide a foundation

for a reconstructed scale. One advantage of such a revision

is that it would maintain continuity with the long-stand-

ing use of the original Hamilton depression scale. This sort

of transition is probably more palatable and therefore

more readily acceptable to regulatory commissions.

The Depression Rating Scale Standardization Team re-

vised the Hamilton depression scale (i.e., the GRID-HAMD

[93, 94]) by employing several of the methodological ad-

vances we have been advocating in this article. They used

item response theory methods to inform, in part, the re-

vision process; developed clear structured interview

prompts and scoring guidelines; and to some extent stan-

dardized the scoring system. We nonetheless believe that

by making an effort to retain the original 17 items, the De-

pression Rating Scale Standardization Team failed to ad-

dress many of the flaws of the original instrument. Most of

the items still measure multiple constructs, items that

have consistently been shown to be ineffective have beenretained, and the scoring system still includes differential

weighting of items. Moreover, the GRID-HAMD content is

virtually unchanged from the original. All the items that

appeared on the Hamilton depression scale in 1960

are included in the GRID-HAMD. Thus, this revision has

neither removed items based on outdated concepts nor

added items that incorporate contemporary definitions of

depression.

Rejection of the Hamilton depression scale and replace-

ment with an alternative existing measure or the imple-

mentation of a new instrument has scientifically compel-

ling advantages over revision. The Inventory of Depressive

Symptomatology (95) and the Montgomery- Å sberg De-pression Rating Scale (96), designed to address the limita-

tions of the Hamilton depression scale, represent two

potential replacement alternatives. Although these instru-

ments measure contemporary definitions of depression

(33), neither item response theory methods nor other con-

temporary measurement techniques were employed in

their development. As indicated earlier, such techniques,

especially item response theory, maximize the capacity of

an instrument to detect change. On the other hand, the de-

velopment and implementation of a new instrument that

is based on current knowledge of depression and that takes

advantage of psychometric and statistical advances might

offer the best solution. The decision to replace the Hamil-

ton depression scale with either an existing instrument or a

newly developed instrument would ultimately rest on con-

sensus that such an instrument could capture more ade-

quately the full spectrum of the depression construct and

on empirical evidence of the new instrument’s superiority

in detecting treatment effects.

In conclusion, we have been struck with the marked

contrast between the effort and scientific sophistication

involved in designing new antidepressants and the con-






tinued reliance on antiquated concepts and methods for

assessing change in the severity of the depression that

these very medications are intended to affect. Effort in

both areas is critical to the accessibility of new medica-

tions for patients with depression. Many scales and instru-

ments used in psychiatry today are based on—or at least

include—current DSM symptoms, and the measurement

of depression should follow this trend. It is time to retire

the Hamilton depression scale. The field needs to move

forward and embrace a new gold standard that incorpo-

rates modern psychometric methods and contemporary

definitions of depression.

Received Dec. 7, 2003; revision received Feb. 26, 2004; accepted

March 22, 2004. From the Centre for Addiction and Mental Health,

University of Toronto; and the Department of Psychology, University

of British Columbia, Vancouver, B.C. Address reprint requests to Dr.

Bagby, Centre for Addiction and Mental Health, 250 College St., Tor-

onto, Ont., Canada M5T 1R8; [email protected] (e-mail).

Supported in part by Eli Lilly and Co. and by a Senior Research Fel-

lowship from the Ontario Mental Health Foundation to Dr. Bagby. Mr.

Ryder was supported by a postdoctoral fellowship from the Michael

Smith Foundation for Health Research, Vancouver, B.C., Canada.

The authors thank Arun Ravindrun and Sid Kennedy for their com-

ments and Natasha Owen for assistance with the manuscript.

References

1. Hamilton M: A rating scale for depression. J Neurol Neurosurg

Psychiatry 1960; 23:56 – 62

2. Demyttenaere K, De Fruyt J: Getting what you ask for: on the

selectivity of depression rating scales. Psychother Psychosom

2003; 72:61 – 70

3. Williams JB: Standardizing the Hamilton Depression Rating

Scale: past, present, and future. Eur Arch Psychiatry Clin Neu-

rosci 2001; 251(suppl 2):II6 – II12

4. Hedlund JL, Vieweg BW: The Hamilton Rating Scale for Depres-

sion: a comprehensive review. J Operational Psychiatry 1979;10:149 – 165

5. Bech P: Rating scales for affective disorders: their validity and

consistency. Acta Psychiatr Scand Suppl 1981; 295:1 – 101

6. Bech P: Psychometric development of the Hamilton scales: the

spectrum of depression, dysthymia and anxiety, in The Hamil-

ton Scales. Edited by Bech P, Coppen A. Berlin, Springer-Verlag,

1990, pp 72 – 79

7. Maier W: The Hamilton Depression Scale and its alternatives: a

comparison of their reliability and validity, ibid, pp 64 – 71

8. Aben I, Verhey F, Lousberg R, Lodder J, Honig A: Validity of the

Beck Depression Inventory, Hospital Anxiety and Depression

Scale, SCL-90, and Hamilton Depression Rating Scale as screen-

ing instruments for depression in stroke patients. Psychoso-

matics 2002; 43:386 – 393

9. Addington D, Addington J, Schissel B: A depression rating scale

for schizophrenics. Schizophr Res 1990; 3:247 – 251

10. Addington D, Addington J, Atkinson M: A psychometric com-

parison of the Calgary Depression Scale for Schizophrenia and

the Hamilton Depression Rating Scale. Schizophr Res 1996; 19:

205 – 212

11. Akdemir A, Turkcapar MH, Orsel SD, Demirergi N, Dag I, Ozbay

MH: Reliability and validity of the Turkish version of the Hamil-

ton Depression Rating Scale. Compr Psychiatry 2001; 42:161 –

165

12. Baca-Garcia E, Blanco C, Saiz-Ruiz J, Rico F, Diaz-Sastre C, Cic-

chetti DV: Assessment of reliability in the clinical evaluation of

depressive symptoms among multiple investigators in a multi-

center clinical trial. Psychiatry Res 2001; 102:163 – 173

13. Bech P, Allerup P, Maier W, Albus M, Lavori P, Ayuso JL: The

Hamilton scales and the Hopkins Symptom Checklist (SCL-90):

a cross-national validity study in patients with panic disorders.

Br J Psychiatry 1992; 160:206 – 211

14. Bech P, Tanghoj P, Andersen HF, Overo K: Citalopram dose-re-

sponse revisited using an alternative psychometric approach

to evaluate clinical effects of four fixed citalopram doses com-

pared to placebo in patients with major depression. Psycho-pharmacology (Berl) 2002; 163:20 – 25

15. Berard RMF, Ahmed N: Hospital Anxiety and Depression Scale

(HADS) as a screening instrument in a depressed adolescent

and young adult population. Int J Adolesc Med Health 1995; 8:

157 – 166

16. Berrios GE, Bulbena-Villarasa A: The Hamilton Depression

Scale and the numerical description of the symptoms of de-

pression, in The Hamilton Scales. Edited by Bech P, Coppen A.

Berlin, Springer-Verlag, 1990, pp 80 – 92

17. Brown C, Schulberg HC, Madonia MJ: Assessing depression in

primary care practice with the Beck Depression Inventory and

the Hamilton Rating Scale for Depression. Psychol Assess 1995;

7:59 – 65

18. Carroll BJ, Feinberg M, Smouse PE, Rawson SG, Greden JF: The

Carroll Rating Scale for Depression, I: development, reliabilityand validation. Br J Psychiatry 1981; 138:194 – 200

19. Cicchetti DV, Prusoff BA: Reliability of depression and associ-

ated clinical symptoms. Arch Gen Psychiatry 1983; 40:987 – 990

20. Craig TJ, Richardson MA, Pass R, Bregman Z: Measurement of

mood and affect in schizophrenic inpatients. Am J Psychiatry

1985; 142:1272 – 1277

21. Daradkeh T, Abou-Saleh M, Karim L: The factorial structure of

the 17-item Hamilton Depression Rating Scale. Arab J Psychia-

try 1997; 8:6 – 12

22. Deluty BM, Deluty RH, Carver CS: Concordance between clini-

cians’ and patients’ ratings of anxiety and depression as medi-

ated by private self-consciousness. J Pers Assess 1986; 50:93 –

106

23. Demitrack MA, Faries D, Herrera JM, DeBrota D, Potter WZ: The

problem of measurement error in multisite clinical trials. Psy-chopharmacol Bull 1998; 34:19 – 24

24. Entsuah R, Shaffer M, Zhang J: A critical examination of the

sensitivity of unidimensional subscales derived from the

Hamilton Depression Rating Scale to antidepressant drug ef-

fects. J Psychiatr Res 2002; 36:437 – 448

25. Faries D, Herrera J, Rayamajhi J, DeBrota D, Demitrack M, Pot-

ter WZ: The responsiveness of the Hamilton Depression Rating

Scale. J Psychiatr Res 2000; 34:3 – 10

26. Feinberg M, Carroll BJ, Smouse PE, Rawson SG: The Carroll Rat-

ing Scale for Depression, III: comparison with other rating in-

struments. Br J Psychiatry 1981; 138:205 – 209

27. Fleck MP, Poirier-Littre MF, Guelfi JD, Bourdel MC, Loo H: Facto-

rial structure of the 17-item Hamilton Depression Rating Scale.

Acta Psychiatr Scand 1995; 92:168 – 172

28. Fuglum E, Rosenberg C, Damsbo N, Stage K, Lauritzen L, Bech

P (Danish University Antidepressant Group): Screening and

treating depressed patients: a comparison of two controlled

citalopram trials across treatment settings: hospitalized pa-

tients vs patients treated by their family doctors. Acta Psychiatr

Scand 1996; 94:18 – 25

29. Gastpar M, Gilsdorf U: The Hamilton Depression Rating Scale in

a WHO collaborative program, in The Hamilton Scales. Edited

by Bech P, Coppen A. Berlin, Springer-Verlag, 1990, pp 10 – 19

30. Gibbons RD, Clark DC, Kupfer DJ: Exactly what does the Hamil-

ton Depression Rating Scale measure? J Psychiatr Res 1993; 27:

259 – 273






31. Gilley DW, Wilson RS, Fleischman DA, Harrison DW, Goetz CG,

Tanner CM: Impact of Alzheimer’s-type dementia and informa-

tion source on the assessment of depression. Psychol Assess

1995; 7:42 – 48

32. Gottlieb GL, Gur RE, Gur RC: Reliability of psychiatric scales in

patients with dementia of the Alzheimer type. Am J Psychiatry

1988; 145:857 – 860

33. Gullion CM, Rush AJ: Toward a generalizable model of symp-

toms in major depressive disorder. Biol Psychiatry 1998; 44:

959 – 97234. Hammond MF: Rating depression severity in the elderly physi-

cally ill patient: reliability and factor structure of the Hamilton

and the Montgomery-Åsberg Depression Rating Scales. Int J

Geriatr Psychiatry 1998; 13:257 – 261

35. Hooijer C, Zitman FG, Griez E, van Tilburg W, Willemse A, Dink-

greve MA: The Hamilton Depression Rating Scale (HDRS);

changes in scores as a function of training and version used. J

Affect Disord 1991; 22:21 – 29

36. Hotopf M, Sharp D, Lewis G: What’s in a name? a comparison

of four psychiatric assessments. Soc Psychiatry Psychiatr Epide-

miol 1998; 33:27 – 31

37. Kobak KA, Greist JH, Jefferson JW, Mundt JC, Katzelnick DJ:

Computerized assessment of depression and anxiety over the

telephone using interactive voice response. MD Comput 1999;

16:64 – 6838. Koenig HG, Pappas P, Holsinger T, Bachar JR: Assessing diagnos-

tic approaches to depression in medically ill older adults: how

reliably can mental health professionals make judgments

about the cause of symptoms? J Am Geriatr Soc 1995; 43:472 –

478

39. Lambert MJ, Hatch DR, Kingston MD, Edwards BC: Zung, Beck,

and Hamilton Rating Scales as measures of treatment out-

come: a meta-analytic comparison. J Consult Clin Psychol

1986; 54:54 – 59

40. Lambert MJ, Masters KS, Astle D: An effect-size comparison of

the Beck, Zung, and Hamilton rating scales for depression: a

three-week and twelve-week analysis. Psychol Rep 1988; 63:

467 – 470

41. Leentjens AF, Verhey FR, Lousberg R, Spitsbergen H, Wilmink

FW: The validity of the Hamilton and Montgomery-Åsberg de-pression rating scales as screening and diagnostic tools for de-

pression in Parkinson’s disease. Int J Geriatr Psychiatry 2000;

15:644 – 649

42. Leung CM, Wing YK, Kwong PK, Lo A, Shum K: Validation of the

Chinese-Cantonese version of the Hospital Anxiety and Depres-

sion Scale and comparison with the Hamilton Rating Scale of

Depression. Acta Psychiatr Scand 1999; 100:456 – 461

43. McAdams LA, Harris MJ, Bailey A, Fell R, Jeste DV: Validating

specific psychopathology scales in older outpatients with

schizophrenia. J Nerv Ment Dis 1996; 184:246 – 251

44. Maier W, Philipp M: Improving the assessment of severity of de-

pressive states: a reduction of the Hamilton Depression Rating

Scale. Pharmacopsychiatry 1985; 18:114 – 115

45. Maier W, Philipp M, Heuser I, Schlegel S, Buller R, Wetzel H: Im-

proving depression severity assessment, I: reliability, internalvalidity and sensitivity to change of three observer depression

scales. J Psychiatr Res 1988; 22:3 – 12

46. Maier W, Heuser I, Philipp M, Frommberger U, Demuth W: Im-

proving depression severity assessment, II: content, concurrent

and external validity of three observer depression scales. J Psy-

chiatr Res 1988; 22:13 – 19

47. Marcos T, Salamero M: Factor study of the Hamilton Rating

Scale for Depression and the Bech Melancholia Scale. Acta Psy-

chiatr Scand 1990; 82:178 – 181

48. Meyer JS, Li YS, Thornby J: Validating mini-mental status, cog-

nitive capacity screening and Hamilton depression scales utiliz-

ing subjects with vascular headaches. Int J Geriatr Psychiatry

2001; 16:430 – 435

49. Middelboe T, Ovesen L, Mortensen EL, Bech P: Depressive

symptoms in cancer patients undergoing chemotherapy: a

psychometric analysis. Psychother Psychosom 1994; 61:171 –

177

50. Moberg PJ, Lazarus LW, Mesholam RI, Bilker W, Chuy IL, Ney-

man I, Markvart V: Comparison of the standard and structured

interview guide for the Hamilton Depression Rating Scale in

depressed geriatric inpatients. Am J Geriatr Psychiatry 2001; 9:35 – 40

51. Mottram P, Wilson K, Copeland J: Validation of the Hamilton

Depression Rating Scale and Montgomery and Åsberg Rating

Scales in terms of AGECAT depression cases. Int J Geriatr Psychi-

atry 2000; 15:1113 – 1119

52. Naarding P, Leentjens AF, van Kooten F, Verhey FR: Disease-

specific properties of the Rating Scale for Depression in pa-

tients with stroke, Alzheimer’s dementia, and Parkinson’s dis-

ease. J Neuropsychiatry Clin Neurosci 2002; 14:329 – 334

53. O’Brien KP, Glaudin V: Factorial structure and factor reliability

of the Hamilton Rating Scale for Depression. Acta Psychiatr

Scand 1988; 78:113 – 120

54. O’Hara MW, Rehm LP: Hamilton Rating Scale for Depression:

reliability and validity of judgments of novice raters. J Consult

Clin Psychol 1983; 51:318 – 31955. Olsen LR, Jensen DV, Noerholm V, Martiny K, Bech P: The inter-

nal and external validity of the Major Depression Inventory in

measuring severity of depressive states. Psychol Med 2003; 33:

351 – 356

56. Onega LL, Abraham IL: Factor structure of the Hamilton Rating

Scale for Depression in a cohort of community-dwelling eld-

erly. Int J Geriatr Psychiatry 1997; 12:760 – 764

57. Pancheri P, Picardi A, Pasquini M, Gaetano P, Biondi M: Psycho-

pathological dimensions of depression: a factor study of the

17-item Hamilton depression rating scale in unipolar de-

pressed outpatients. J Affect Disord 2002; 68:41 – 47

58. Paykel ES: Use of the Hamilton Depression Scale in General

Practice, in The Hamilton Scales. Edited by Bech P, Coppen A.

Berlin, Springer-Verlag, 1990, pp 40 – 47

59. Potts MK, Daniels M, Burnam MA, Wells KB: A structured inter-

view version of the Hamilton Depression Rating Scale: evi-

dence of reliability and versatility of administration. J Psychiatr

Res 1990; 24:335 – 350

60. Ramos-Brieva JA, Cordero-Villafafila A: A new validation of the

Hamilton Rating Scale for Depression. J Psychiatr Res 1988; 22:

21 – 28

61. Rehm LP, O’Hara MW: Item characteristics of the Hamilton Rat-

ing Scale for Depression. J Psychiatr Res 1985; 19:31 – 41

62. Reynolds WM, Kobak KA: Reliability and validity of the Hamil-

ton Depression Inventory: a paper-and-pencil version of the

Hamilton Depression Rating Scale clinical interview. Psychol

Assess 1995; 7:472 – 483

63. Riskind JH, Beck AT, Brown G, Steer RA: Taking the measure of

anxiety and depression: validity of the reconstructed Hamilton

scales. J Nerv Ment Dis 1987; 175:474 – 479

64. Santor DA, Coyne JC: Evaluating the continuity of symptomatol-

ogy between depressed and nondepressed individuals. J Ab-

norm Psychol 2001; 110:216 – 225

65. Santor DA, Coyne JC: Examining symptom expression as a func-

tion of symptom severity: item performance on the Hamilton

Rating Scale for Depression. Psychol Assess 2001; 13:127 – 139

66. Sayer NA, Sackheim HA, Moeller JR, Prudic J, Devanand DP,

Coleman EA, Kiersky JE: The relations between observer-rating

and self-report of depressive symptomatology. Psychol Assess

1993; 5:350 – 360

67. Senra Rivera C, Racano Perez C, Sanchez Cao E, Barba Sixto S:

Use of three depression scales for evaluation of pretreatment






severity and of improvement after treatment. Psychol Rep

2000; 87:389 – 394

68. Shain BN, Naylor M, Alessi N: Comparison of self-rated and cli-

nician-rated measures of depression in adolescents. Am J Psy-

chiatry 1990; 147:793 – 795

69. Smouse PE, Feinberg M, Carroll BJ, Park MH, Rawson SG: The

Carroll Rating Scale for Depression, II: factor analyses of the

feature profiles. Br J Psychiatry 1981; 138:201 – 204

70. Steinmeyer EM, Möller HJ: Facet theoretic analysis of the

Hamilton-D scale. J Affect Disord 1992; 25:53 – 6171. Strik JJ, Honig A, Lousberg R, Denollet J: Sensitivity and specific-

ity of observer and self-report questionnaires in major and mi-

nor depression following myocardial infarction. Psychosomat-

ics 2001; 42:423 – 428

72. Teri L, Wagner AW: Assessment of depression in patients with

Alzheimer’s disease: concordance among informants. Psychol

Aging 1991; 6:280 – 285

73. Thase ME, Hersen M, Bellack AS, Himmelhoch JM, Kupfer DJ:

Validation of a Hamilton subscale for endogenomorphic de-

pression. J Affect Disord 1983; 5:267 – 278

74. Thompson WM, Harris B, Lazarus J, Richards C: A comparison

of the performance of rating scales used in the diagnosis of

postnatal depression. Acta Psychiatr Scand 1998; 98:224 – 227

75. Whisman MA, Strosahl K, Fruzzetti AE, Schmaling KB, Jacobson

NS, Miller DM: A structured interview version of the HamiltonRating Scale for Depression: reliability and validity. Psychol As-

sess 1989; 1:238 – 241

76. Williams JB: A structured interview guide for the Hamilton De-

pression Rating Scale. Arch Gen Psychiatry 1988; 45:742 – 747

77. Zheng YP, Zhao JP, Phillips M, Liu JB, Cai MF, Sun SQ, Huang MF:

Validity and reliability of the Chinese Hamilton Depression Rat-

ing Scale. Br J Psychiatry 1988; 152:660 – 664

78. Cronbach LJ: Coefficient alpha and the internal structure of

tests. Psychometrika 1951; 16:297 – 334

79. Briggs SR, Cheek JM: The role of factor analysis in the develop-

ment and evaluation of personality scales. J Pers 1986; 54:

106 – 148

80. Nunnally JC, Bernstein IH: Psychometric Theory, 3rd ed. New

York, McGraw-Hill, 1994

81. Fleiss JL, Shrout PE: The effects of measurement errors onsome multivariate procedures. Am J Public Health 1977; 67:

1188 – 1191

82. Landis JR, Koch GG: The measurement of observer agreement

for categorical data. Biometrics 1977; 33:159 – 174

83. Anastasi A, Urbina S: Psychological Testing, 7th ed. New York,

MacMillan, 1997

84. Bock RD, Gibbons RD, Murraki E: Full information item factor

analysis. Applied Psychol Measurement 1988; 12:261 – 280

85. Gibbons RD, Clark DC, VonAmmon CS, Davis JM: Application of

modern psychometric theory in psychiatric research. J Psychi-

atr Res 1985; 19:43 – 55

86. Bech P, Allerup P, Gram LF, Reisby N, Rosenberg R, Jacobsen O,

Nagy A: The Hamilton depression scale: evaluation of objectiv-

ity using logistic models. Acta Psychiatr Scand 1981; 63:290 – 299

87. Bech P, Gram LF, Dein E, Jacobsen O, Vitger J, Bolwig TG: Quan-

titative rating of depressive states. Acta Psychiatr Scand 1975;

51:161 – 170

88. Gorsuch RL: Factor Analysis. Hillside, NJ, Lawrence Erlbaum As-

sociates, 1983

89. Prusoff B, Klerman GL: Differentiating depressed from anxious

neurotic outpatients. Arch Gen Psychiatry 1974; 30:302 – 309

90. Edwards BC, Lambert MJ, Moran PW, McCully T, Smith KC, Ell-

ingson AG: A meta-analytic comparison of the Beck Depression

Inventory and the Hamilton Rating Scale for Depression as

measures of treatment outcome. Br J Clin Psychol 1984;

23(part 2):93 – 99

91. O’Sullivan RL, Fava M, Agustin C, Baer L, Rosenbaum JF: Sensi-

tivity of the six-item Hamilton Depression Rating Scale. Acta

Psychiatr Scand 1997; 95:379 – 384

92. Hooper CL, Bakish D: An examination of the sensitivity of the

six-item Hamilton Rating Scale for Depression in a sample of

patients suffering from major depressive disorder. J Psychiatry

Neurosci 2000; 25:178 – 184

93. Kalai A, Ginertini M, Kobak K, Engelhardt N, Williams JBW,

Evans K, Bech P, Lipsitz J, Olin J, Pearson J, Rothman M: The

GRID-HAMD: a reliability study in patients with major depres-

sion, in Abstracts of the 43rd Annual New Clinical Drug Evalua-

tion Unit (NCDEU) Meeting. Bethesda, Md, NIMH, 2003, Poster

I-19

94. Kalai A, Williams JB, Koback KA, Lipsitz J, Engelhardt N, Evans K,

Olin J, Pearson J, Rothman M, Bech P: The new GRID HAM-D: pi-

lot testing and international field trials. Int J Neuropsychophar-

macol 2002; 5:S147 – S148

95. Rush AJ, Giles DE, Schlesser MA, Fulton CL, Weissenburger J,

Burns C: The Inventory for Depressive Symptomatology (IDS):

preliminary findings. Psychiatry Res 1986; 18:65 – 87

96. Montgomery SA, Åsberg M: A new depression scale designed to

be sensitive to change. Br J Psychiatry 1979; 134:382 – 389

hamilton depression

Documents