development and reliability of the ham-d/madrs interview: an … · versions of existing scales,...

www.elsevier.com/locate/psychres

Psychiatry Research 1

Development and reliability of the HAM-D/MADRS Interview:

An integrated depression symptom rating scale

Rebecca W. Iannuzzo a,*, Judith Jaeger a,1, Joseph F. Goldberg b,2,

Vivian Kafantaris b,3, M. Elizabeth Sublette c,4

a Center for Neuropsychiatric Outcome and Rehabilitation Research (CENORR), the Zucker Hillside Hospital, Long Island Jewish Medical Center,

75-59 263rd Street, Ambulatory Care Pavilion, Room 2219, Glen Oaks, NY 11004, USAb Department of Psychiatry, the Zucker Hillside Hospital, Long Island Jewish Medical Center, 75-59 263rd Street, Glen Oaks, NY 11004, USAc Department of Child Psychiatry, New York Psychiatric Institute, Columbia University, Suite 2917, Unit 42, 1051 Riverside Drive, New York,

NY 10032, USA

Received 26 May 2005; received in revised form 5 October 2005; accepted 17 October 2005

Abstract

The Hamilton Rating Scale for Depression (HAM-D) and the Montgomery-Asberg Depression Rating Scale (MADRS), two

widely used depression scales, each have unique advantages and limitations for research. The HAM-D’s limited sensitivity and

multidimensionality have been criticized, despite the scale’s popularity. The MADRS, designed to be sensitive to treatment changes,

is briefer andmore uniform. A limitation of theMADRS is the lack of a structured interview, whichmay affect reliability. TheHAM-D

and the MADRS are often used conjointly as endpoints in depression trials. We designed a hybrid questionnaire that allows

administration of MADRS and 31 HAM-D items simultaneously. Seventy mood disorder patients (60 bipolar I, 10 major depressive

disorder) were administered the HAM-D/MADRS Interview (HMI) as part of a larger study. Interrater reliability for 50 patients was

excellent for the HAM-D and theMADRS (ICC=0.97–0.98). MADRS item reliabilities (ICC=0.86–0.97) were higher than obtained

in studies that did not use a structured interview. Reliability coefficients for seven HAM-D31 datypicalT symptoms ranged from 0.77 to

0.95. HMI was highly correlated with the Global Clinical Impressions Scale. This is the first study we know of to investigate the

reliability of a structured interview of either the MADRS or of the HAM-D31. The HMI provides an easily administered, reliable

method of rating depression severity which may improve consistency and validity of study findings.

D 2005 Elsevier Ireland Ltd. All rights reserved.

Keywords: Depression; Rating scales; HAM-D (Hamilton Rating Scale for Depression); MADRS (Montgomery–Asberg Depression Rating Scale);

Structured interview

0165-1781/$ - see front matter D 2005 Elsevier Ireland Ltd. All rights rese

doi:10.1016/j.psychres.2005.10.009

* Corresponding author. Tel.: +1 718 470 8072; fax: +1 718 347 5514.

E-mail addresses: [email protected] (R.W. Iannuzzo),

[email protected] (J. Jaeger), [email protected] (J.F. Goldberg),

[email protected] (V. Kafantaris), [email protected]

(M.E. Sublette).1 Albert Einstein College of Medicine of Yeshiva University.

Tel.: +1 718 470 8342; fax: +1 718 962 2742.2 Tel.: +1 718 470 4134.3 Tel.: +1 718 470 8556.4 Tel.: +1 212 543 6241; fax: +1 212 543 6017.

1. Introduction

Rating scales that are reliable, valid, and sensitive to

treatment-related changes are critical for efficacy studies

in the depression. The increase over the past decade in the

number of medications and psychotherapies under inves-

tigation for the treatment of depression has been accom-

panied by an increase in the number of rating scales,

45 (2006) 21–37

rved.

http://dx.doi.org/10.1016/j.psychres.2005.10.009

mailto:[email protected]





R.W. Iannuzzo et al. / Psychiatry Research 145 (2006) 21–3722

versions of existing scales, and subscales to evaluate

treatment-related improvement.

1.1. Hamilton Rating Scale for Depression

The Hamilton Rating Scale for Depression (HAM-D)

(Hamilton, 1960) is the most widely used rating scale for

depression and is considered by many to be the bgoldstandardQ. Despite its popularity, the HAM-D has been

widely criticized owing to its limited sensitivity to

change in depression severity (Montgomery and Asberg,

1979), heavy weighting toward behavioral and somatic

symptoms, and low item level reliability (Williams,

1988). Modified versions of the HAM-D have prolifer-

ated in response to these limitations. Modifications in-

clude the addition or omission of items, the addition of

standardized interview questionnaires to conduct the

ratings, and alterations in item definitions and anchors.

The first of these modifications came from the scale’s

original author, who added to the original 17-item ver-

sion (Hamilton, 1960) four additional items (diurnal

variation, paranoid ideation, obsessive/compulsive

symptoms, and depersonalization/derealization) (Hamil-

ton, 1967) that are, however, not included in the total

score. Subsequently, many other modified versions of

the HAM-D have been used in published depression

research, leading some investigators to question whether

the HAM-D is bone scale or manyQ (Grundy et al., 1994).Among these versions, there is a wide variability in the

total number of items included, ranging from a brief six-

item version consisting of bcore depressiveQ symptoms

(Bech et al., 1981) to an expanded 31-item version that

contains 5 breverse vegetativeQ symptoms found in atyp-

ical depression and two additional retardation items.

Several standardized interview questionnaires (Wil-

liams, 1988; Whisman et al., 1989; Potts et al., 1990)

have been developed in an effort to improve the

HAM-D’s reliability. The most widely used structured

interview version is Williams’ (1988) Structured Inter-

view Guide for the HAM-D (SIGH-D), which includes

Hamilton’s original 17 and four supplemental, items

(Hamilton, 1967). Other changes to the original HAM-

D have included modified item and anchor descrip-

tions, and variability in total number of items used to

arrive at a total depression score. The specific version

used in particular depression studies, and reliability

and validity data for the version used, are often not

reported or are inaccurately referenced. These meth-

odological differences between various HAM-D ver-

sions have contributed to difficulty in comparing,

evaluating, and drawing conclusions about depression

study findings.

1.2. Montgomery–Asberg Depression Rating Scale

The Montgomery–Asberg Depression Rating Scale

(MADRS) (Montgomery and Asberg, 1979) is a 10-item

scale that has grown in popularity among depression

researchers, partly in response to the problems inherent

in the use of the HAM-D. The 10MADRS items, chosen

from a 65-item comprehensive psychopathology instru-

ment (CPRS) (Asberg et al., 1978), were selected for

their ability to detect changes due to antidepressant

treatment and their high correlations with overall change

in depression.

Studies that have subjected the MADRS to principal

components factor analyses have found a more uniform

internal structure compared with the HAM-D, with

most studies identifying two (Serretti et al., 1999;

Rocca et al., 2002) or three factors (Galinowski and

Lehert, 1995). However, Galinowski and Lehert (1995)

and Rocca et al. (2002) were able to substantiate only a

single factor, representing core depressive symptoms,

following antidepressant treatment.

A number of studies comparing the MADRS and the

HAM-D found the former to have greater sensitivity to

treatment-related changes in depression severity (David-

son et al., 1986; Senra, 1996; Mulder et al., 2003). At

least one study (Maier et al., 1988), however, found the

MADRS’ sensitivity was somewhat lower than that of

either the HAM-D or another measure of depression,

the Bech–Rafaelson Melancholia Scale (BMRS) (Bech

and Rafaelsen, 1980). The MADRS’ brief length com-

pared with the HAM-D results in shorter administration

time, an advantage in large clinical trials.

A potential limitation of the MADRS is that it does

not utilize a standardized interview to guide ratings,

which may lower reliability. Use of a structured inter-

view questionnaire such as the SIGH-D (Williams, 1988)

has been demonstrated to improve the interrater reliabil-

ity of the HAM-D at both the item and total score levels.

It would be reasonable to assume that the use of a

structured interview would improve reliability of the

MADRS as well. A second potential weakness of the

MADRS is that it does not permit evaluation of atypical

and baccessoryQ symptoms of depression as is possible

with several versions of the HAM-D.

Thus, both the HAM-D and the MADRS have

unique advantages for depression research, as well as

potential limitations when used alone. In pharmaceuti-

cal trials, it is common to use both the HAM-D and the

MADRS simultaneously to measure outcome (Hawley

et al., 1998). Use of multiple measures also allows

researchers to take advantage of each scale’s assets

and to assure both comprehensiveness and comparabi-

R.W. Iannuzzo et al. / Psychiatry Research 145 (2006) 21–37 23

lity to previous studies. In response to these factors, we

designed a questionnaire that allows administration and

rating of the MADRS and several of the most widely

used HAM-D versions in a single structured interview.

This combined interview allows depression researchers

to obtain more information than obtained from one

scale alone, while increasing reliability and efficient

administration of both instruments.

In designing the HAM-D/MADRS Interview (HMI),

it was not our intention to develop a new depression

rating scale. Rather, our aim was to increase the efficien-

cy and reliability of rating depression in clinical studies

in which both the HAM-D and MADRS are used.

A description is provided on the development of the

HMI, followed by a report on the HMI’s reliability.

Study participants were patients with mood disorders

(bipolar I disorder or major depressive disorder) who

were administered the HMI as part of a larger assess-

ment battery for ongoing studies on mood disorders.

2. Methods

2.1. Development of the HAM-D/MADRS Interview

(HMI)

To evaluate which among the many HAM-D ver-

sions would be most appropriate to consider as a basis

for a hybrid interview questionnaire, an exhaustive

search of the literature was conducted using the

PSYC Info and Medline electronic databases to locate

articles published between 1960 and 2004. Addition-

ally, a manual search of the reference sections of key

articles was conducted. More than 30 different En-

glish-language versions of the HAM-D were found.

We considered each version’s reliability and validity

(when reported), frequency of use in depression re-

search, and versatility for assessing a wide range of

symptoms. The final versions selected for inclusion in

the HMI are described below.

The Structured Interview Guide for the Hamilton

Depression Rating Scale (SIGH-D) (Williams, 1988)

formed the foundation for the HMI owing to its

advantage in using a standardized interview to en-

hance reliability. Reliability of the SIGH-D has been

established at both the total score and individual item

levels (Williams, 1988) and it is widely used among

depression researchers. The SIGH-D contains Hamil-

ton’s (1967) original 21 items, of which the first 17

are scored.

The HAM-D 24-item version (HAM-D24) (Guy,

1976; Riskind et al., 1987) incorporates a standardized

interview to guide ratings that is based on the SIGH-D

interview. However, the 24-item version includes addi-

tional items to assess cognitive symptoms of depression

(helplessness, hopelessness, and worthlessness). As

with the SIGH-D, only the first 17 items are included

in the total depression score.

The HAM-D 31-item version (HAM-D31) includes,

in addition to the 24 items above, five items that assess

the reverse vegetative symptoms of atypical depression

(increased appetite, weight gain, and three hypersomnia

items) and two additional retardation items (psychic

retardation and motoric retardation). The HAM-D31 is

frequently used in antidepressant clinical trials (e.g.,

Calabrese et al., 1999; Nierenberg et al., 2003; Fava

et al., 2005) due to its ability to detect changes in

atypical depressive symptoms.

A limitation of the HAM-D31 is its lack of a stan-

dardized interview to guide ratings, which may ad-

versely affect its reliability. We were unable to find

any study examining the reliability of the HAM-D31

in our search of the HAM-D literature. We found only

one study, a factor analysis, investigating the psycho-

metric properties of the HAM-D31 (Jamerson et al.,

2003). Additionally, we were unable to locate a pub-

lished primary reference for the HAM-D31. Among

clinical trials in which the HAM-D31 has been used,

the authors have either incorrectly cited or completely

omitted any references to this expanded version. Much

more information is needed on the reliability and va-

lidity of the HAM-D31 if it is to continue to be used to

assess changes in depression severity in clinical trials.

O’Sullivan et al. (1997) demonstrated that a brief

HAM-D subscale identified by Bech et al. (1981)

including six core items of depression (depressed

mood, guilt, work and interests, psychomotor retarda-

tion, psychic anxiety, and somatic symptoms) discri-

minates between typical and atypical depression as

measured by the 28-item HAM-D. However, the

HAM-D31 has the advantage of allowing researchers

to assess changes in severity of specific atypical symp-

toms that use of a briefer version does not allow. This

might affect the generalizability of antidepressant trials

in mood disorder patients with atypical depressive

symptoms and preclude examination of whether atyp-

ical depression is associated with treatment response

(Zimmerman et al., 2005). This may have particular

importance in the study of treatment response in bipo-

lar disorder patients. Patients with bipolar spectrum

disorders have been found to have more atypical

major depressive episodes, and more individual atypi-

cal symptoms, than patients with unipolar depression

(Benazzi, 2001). Use of the HAM-D31 would be en-

hanced, however, with the addition of a structured


interview to facilitate reliable administration and addi-

tional studies on its psychometric properties.

Based upon the above review, and guided by previous

work on structured interview formats, the three versions

of theHAM-Dwere integrated with one another andwith

the 10 MADRS items into a single structured interview

questionnaire. Since neither the MADRS nor the HAM-

D31 uses a standardized interview, existing structured

interview questions from the SIGH-D and the HAM-

D24 interviews were supplemented with new interview

questions corresponding to the 10 MADRS and addi-

tional seven items on the HAM-D31. In order to gain

optimal benefit from interviewing tools with established

reliability, additional interview questions were, wherever

suitable, extracted from, or closely based on, Structured

Clinical Interview for DSM-IV questions (e.g., for

MADRS item #6, bConcentration difficultiesQ).

2.2. Structure of interview questionnaire

HAM-D and MADRS items of similar content (e.g.,

HAM-D ddepressed moodT and MADRS dapparentsadnessT and dreported sadnessT) were assembled togeth-

er, preserving the original item anchors on each scale, but

facilitating their being rated together, based on a single

line of inquiry. Where there were discrepancies between

item anchors or interview questions between HAM-D

versions, a consensus decision as to which questions or

item anchors to include was made by the authors, all

experienced mood disorder researchers.

Most of the corresponding items were very similar or

identical in wording. Substantive differences were, how-

ever, found on the item and anchor descriptions for item

13 (HAM-D24 bsomatic energyQ, HAM-D31 banergia,Qand SIGH-D bgeneral somatic symptomsQ) that couldresult in different ratings being given to the same patient.

The SIGH-D item bgeneral somatic symptomsQwas leastsimilar in content to the corresponding item on the

original scale (see Table 1). Of the three alternatives,

Table 1

Comparison of item 13 on various HAM-D versions

Version 17-item (Hamilton, 1967);

24-item (Guy, 1976)

21-item SIGH

Item label bSomatic energyQ bSomatic sym

Item anchors 0=Normal. 0=None.

1=Occasional, mild fatigue,

easy tiring, aching.

1=Heaviness

Backaches, he

Loss of energ

2=Obviously low in energy, tired all

the time; frequent headaches, backaches,

heavy feeling in limbs.

2=Any clear-

we decided to retain the banergiaQ item from the HAM-

D31 because it was similar to the original HAM-D17 and

because doing so offered consistency with alterations

made to the HAM-D31 that better characterize atypical

depression features.

The final HMI instrument is arranged in a user-

friendly three-column tabular form in which interview

questions and secondary probes are listed on the left,

and aligned vertically with item descriptions and

anchors for HAM-D and corresponding MADRS

items (see Appendix A). Important distinctions between

similar items are made clear and secondary probes

assure that these distinctions are preserved. All HMI

items are linked on the form to the original scale by

their item number and name to permit the investigator

to individually calculate total MADRS and total HAM-

D scores for each of the versions included in this hybrid

instrument (Appendix A).

2.3. Subjects and procedures

Seventy affective disorder patients were adminis-

tered the HMI as part of a larger assessment battery

for studies of affective disorders. Subjects were

recruited for the study as inpatients hospitalized for

an acute manic or depressive episode, as part of a larger

longitudinal study on disability in severe mood disor-

ders (J. Jaeger, P.I.). In the larger study, patients were

followed and reassessed monthly over a 1- or 2-year

period. Therefore, data were collected for subjects with

a wide range of depression severity, ranging from

euthymic to severely depressed.

2.3.1. Reliability

Interrater reliability data were obtained for the first

50 subjects, 42 of whom were diagnosed with bipolar

I disorder (BPI) and eight of whom were diagnosed

with major depressive disorder (MDD), through con-

joint interviews by pairs of raters who had been

-D (Williams, 1988) 31-item HAM-D

ptoms — generalQ bAnergiaQ0=Absent.

in limbs, back, or head.

adaches, muscle aches.

y and fatigability.

1=Mild; infrequent; feelings noted but

not marked.

cut symptom. 2=Obvious and severe; Tires very

quickly; exhausted much of the time;

spontaneously mentions these symptoms.

Table 2

Intraclass correlation coefficients (ICC) for individual HAM-D and

MADRS scale total scores, using the HAM-D/MADRS Interview

Scale ICCs (n =50

MADRS 0.98

HAM-D 17-item 0.98

HAM-D 31-item 0.97


previously trained in the use of the HAM-D24 and the

MADRS instruments, and who were later familiarized

with the newly developed HMI. One rater conducted

the interview while the other observed, and then each

made his or her ratings independently. Ratings were

discussed afterward, but no ratings were changed

based on those discussions. In all, nine raters partic-

ipated. Diagnoses were established using the Struc-

tured Clinical Interview for DSM-IV (SCID-I/P,

version 2.0) (First et al., 1998), administered by an

experienced team of master’s or doctoral level re-

search psychologists who had undergone extensive

training in its administration and scoring. Diagnostic

confidence was confirmed through a consensus com-

mittee review by at least three senior research psy-

chologists and psychiatrists using all available data

from SCID interviews and clinical records.

2.3.2. Concurrent validity

Concurrent validity (the degree to which the HMI

correlates with another measure of depression) was

assessed for all 70 subjects by computing Spearman’s

q correlation coefficients for HMI scores with scores on

the Clinical Global Impressions (CGI) Depression Scale,

a global measure of depression severity. The CGI uses a

7-point Likert scale to rate illness severity, with a score of

1 indicating absence of depression, and a score of 7

indicating severe depression.

2.3.3. Ratings of mania severity

Mania severity ratings were obtained for the 42 BPI

patients using the Young Mania Rating scale (YMRS)

(Young et al., 1978) and the Clinician-Administered

Rating Scale for Mania (CARS-M) (Altman et al.,

1994). The 11-item YMRS is rated on a 5-point scale

from 0 to 4, with higher scores indicating increased

symptom severity. The CARS-M contains 15 items that

are rated on a scale of 0–5. The first 10 CARS-M items

are summed to derive a mania subscale score, with

severity cutoff scores suggested by Altman et al.

(1994) as follows: 0–7 (no mania), 8–15 (mild), 16–

25 (moderate), and 26 or greater (severe).

3. Results

3.1. Patient characteristics

The 50 subjects (53% male) for whom interrater

reliability data were obtained ranged in age from 18 to

59 years (mean=39.4, S.D.=11.72). Years of education

completed ranged from 10 to 20 years (mean=15.05,

S.D.=2.55). A total of 72% of study subjects were

Caucasian, 16% African-American, 5% Hispanic, 5%

Asian-American, and 2% Native American (v2=75.49,

P b0.001). Patients had a diagnosis of either bipolar I

disorder (N =42) or major depressive disorder (N =8)

(v2=21.49, P b0.001), based on SCID interviews

using DSM-IV criteria. HAM-D 17-item total scores

ranged from 0 to 33 (mean=10.07, S.D.=7.15, medi-

an=8), indicating a wide range of depression severity,

from euthymic to severely depressed. Similarly,

MADRS scores ranged from 0 to 49 (mean=11.49,

S.D.=10.57, median=9). Of the 50 subjects, 30%

(n =15) had CGI depression scores of 4 or greater, indi-

cating that the depression should be treated. However,

only 8% of the sample hadHAM-D17 scores in the severe

range (26 or greater), which may limit generalizability of

our results to patients with mild to moderate levels of

depression.

YMRS scores ranged from 0 to 22 (mean=5.22,

S.D.=5.52), indicating an absence of mania in our bipo-

lar subsample (scores of 20 or above suggest a manic

episode is present). Similarly, CARS-M mania subscale

scores ranged from 0 to 21 (mean=4.48, S.D.=4.68).

3.2. Data analysis

3.2.1. Interrater reliability

Intraclass correlation coefficients (ICCs) were

used to obtain interrater reliability data for the

HAM-D17, 21, and 31 versions and the MADRS, at both

the total score and individual item levels for the 50

paired HMI interviews. In addition, Spearman’s q cor-

relation coefficients were calculated to allow compari-

son of interrater reliability on the MADRS with another

MADRS reliability study (Davidson et al., 1986).

Interrater reliability for all individual scale total scores

was excellent (ICC=0.97–0.98), and ranged from good

to excellent for individual items (ICC=0.72–0.97).

Table 2 displays the total score ICCs for the HMI, the

HAM-D17 and 31 versions, and the MADRS.

3.2.2. Intercorrelations between scales

Spearman’s q correlation coefficients were used to

examine the correlations between individual scales

)

able 4

AM-D 24-item and 31-item additional item reliabilities

Intraclass correlation coefficients

(ICCs) (n =50)

AM-D24 additional items

elplessness 0.734

opelessness 0.933

orthlessness 0.788

AM-D31 additional items

ypersomnia (early) 0.800

ypersomnia (middle) 0.836

ypersomnia (late) 0.813

creased appetite 0.767

eight gain 0.947

sychic retardation 0.854

otoric retardation 0.775


and the HMI for all 70 subjects (60 BPI and 10 MDD).

These 70 participants ranged in age from 18 to 58

years (mean=39.28, S.D.=11.23). Education completed

ranged from 10 to 20 years (mean=14.85, S.D.=2.55).

The mean HAM-D17 score was 10.00 (S.D.=7.20, me-

dian=9). The mean MADRS score was 11.93 (S.D.=

10.17, median=10). Of the 70 subjects, 16 (22.9%) had

CGI scores of 4 or above, indicating severe depression.

3.2.3. MADRS

Table 3 shows MADRS total scale and item-level

reliability obtained using the HMI. We compared our

interrater reliability results to those obtained by David-

son et al. (1986). In that study, 44 pairs of MADRS

ratings were obtained through conjoint interviews by a

psychiatrist and a psychiatric nurse in inpatients with

major depression, without the use of a structured in-

terview. Davidson et al. (1986) used Spearman correla-

tions to measure the agreement between raters.

Therefore, to permit this comparison with their study,

in addition to calculating ICCs to assess interrater

reliability for the MADRS scale, we included Spear-

man’s q correlation coefficients as an additional mea-

sure of interrater reliability for the MADRS (see Table

3). Although that study did measure item-level reliabil-

ity, the authors unfortunately did not report those

correlations, which prevented us from drawing com-

parisons at that level.

Other reliability studies of the unstructured MADRS

scale have also been conducted. Kørner et al. (1990),

employing joint interviews, obtained good MADRS

Table 3

MADRS total scale and item interrater reliability

MADRS item ICCs for

MADRS

items:

This study

(n =50)

(ICC)

Spearman’s qcorrelations:

This study

(n =50) (for

comparison with

Davidson et al.’s

(1986) study)

Davidson et al.

(1986) (n =44)

(Spearman q)

MADRS total score 0.98 0.91 0.76

Apparent sadness 0.92 0.86 0.69

Reported sadness 0.94 0.89 0.57

Inner tension 0.92 0.86 0.61

Reduced sleep 0.86 0.77 0.60

Reduced appetite 0.94 0.86 0.75

Concentration 0.90 0.81 0.70

Lassitude 0.90 0.79 0.69

Inability to feel 0.94 0.72 0.76

Pessimistic thoughts 0.93 0.87 0.59

Suicidal thoughts 0.97 0.93 0.63

Comparison between the HAM-D/MADRS interview and the study of

Davidson et al. (1986).

T

H

H

H

H

W

H

H

H

H

In

W

P

M

total score reliability (ICC=0.86) in a sample of 40

inpatients (age 26–89 years old) with major depression

or dysthymic disorder but did not obtain data on indi-

vidual item reliability. Maier et al. (1988), also using

conjoint interviews, reported moderate interrater reli-

ability for MADRS total scores (ICC=0.66 and 0.73) in

two inpatient subsamples (n =48 and n =130, respec-

tively) with major depression.

Total score (ICC=0.98; Spearman’s q =0.91) and

item reliabilities for the MADRS (ICC=0.86–0.97;

Spearman’s q =0.72–0.93) were both higher using the

HMI, compared with the study of Davidson et al. (1986),

in which only low to moderate item-level agreement on

the MADRS was obtained (Spearman’s q =0.57–0.76).

3.2.4. HAM-D

Using Cicchetti and Sparrow’s (1981) guidelines for

evaluating reliability coefficients, we obtained excellent

total score reliability for the HAM-D17 (ICC=0.98). At

the item level, all but one of the first 24 items had

excellent reliability (range=0.76–0.97) using Cicchetti

and Sparrow’s (1981) criteria. Item 22, helplessness,

had good reliability (ICC=0.73). These reliability coef-

ficients are as high or higher than those obtained by

others using a structured interview version of the HAM-

D. Williams (1988) reported test–retest reliabilities of

0.81 for 17-item total score, and 0.00 to 0.80 for

individual items on the SIGH-D. Whisman et al.

(1989) examined interrater reliability for a 17-item

structured interview version of the HAM-D and

obtained an ICC of 0.55 following treatment for the

total score, and ICCs ranging from 0.94 to 1.00 (medi-

an=0.64) for the 17 individual items. Miller et al.

(1985) obtained item ICCs ranging from 0.53 to 0.94

for their 25-item modified HAM-D, which also utilized

a structured interview format.

Table 5

Spearman’s q correlation coefficients for HAM-D/MADRS Interview (HMI), HAM-D, MADRS, and CGI (N =70)

HAM-D17 HAM-D31 HAM-D6 subscale MADRS HMI CGI

HAM-D17 0.947 0.911 0.895 0.951 0.842

HAM-D31 0.895 0.871 0.968 0.810

HAM-D6 subscale 0.894 0.928 0.858

MADRS 0.959 0.844

HMI 0.847

Includes the following HAM-D items: depressed mood, guilt, work and interests, psychomotor retardation, psychic anxiety, and anergia/somatic

symptoms (Bech et al., 1981). ** P b0.001 for all correlations.


The additional seven items included in the HAM-

D31 are shown in Table 4. Excellent interrater agree-

ment for the HAM-D31 total score (ICC=0.97) was

obtained. Item-level ICCs for the additional breversevegetativeQ and motor items on the HAM-D31 ranged

from 0.73 to 0.95, indicating good to excellent inter-

rater agreement (see Table 4).

3.2.5. Intercorrelations between scales

The HMI was highly correlated with all individual

scales (Spearman’s q =0.951–0.968, P b0.001 for all

correlations).

3.2.6. Concurrent validity

HMI scores were highly correlated with scores on a

global measure of depression severity, the CGI (Spear-

man’s q =0.847, P b0.001). CGI severity scores were

also highly correlated with scores on all individual

depression scales (see Table 5).

4. Discussion

We have described the development of a semi-struc-

tured interview questionnaire designed to permit simul-

taneous administration of the MADRS and three of the

most widely used versions of the HAM-D, with a single

semi-structured interview questionnaire. Advantages of

the HMI for depression research include improved

interrater reliability (demonstrably so for the MADRS

items), decreased time needed to administer and rate

both scales, and enhanced ability to compare findings

with other studies that use only one of the two scales or

different HAM-D versions.

In our sample of BPI and MDD patients with a

wide range of depression severity, we demonstrated

that the HMI is a reliable and efficient method for

administering both rating scales. Total time to admin-

ister the HMI was approximately 30 min, about the

same length of time required to administer the 21-item

HAM-D alone (Hamilton, 1967). Williams (1988)

reported that it took an average of 28 min to administer

the HAM-D21 using a structured interview question-

naire (SIGH-D).

4.1. Study limitations and directions for future research

A limitation of the present study is that interrater

reliability data were obtained through joint, rather than

independent, rating interviews. Due to the design of

the larger study from which our data were obtained,

independent interviews were not possible. However,

future studies of the reliability and validity of the

HMI should be based upon interviews conducted by

independent raters. Williams (1988) used independent

raters to assess interviewer agreement on the SIGH-D

to avoid inflated reliability estimates that might occur

in joint interviews.

A second limitation of our study was that it did

not compare reliability of ratings using the HMI to

ratings using a nonstructured interview. Ideally, vali-

dation of the HMI would entail comparison of valid-

ity and reliability between a structured interview

approach and a nonstructured approach in the same

population. However, this is the first study we know

of to examine the impact of a standardized interview

on interrater reliability in the MADRS and showed

higher reliability on the MADRS at both the item and

total score levels than found in studies of the

MADRS in which a structured interview was not

utilized.

Our sample was composed primarily of patients

with BPI disorder. Only 16% of subjects had a diag-

nosis of MDD, and we did not include subjects with

bipolar II disorder. Additionally, only 8% of our sam-

ple had HAM-D17 scores in the severely depressed

range. These limitations should be addressed in future

replication studies.

Scores on the HMI subscales may have been influ-

enced by the different order and interrelation of items,

as well as the addition of new ones. This may affect

comparability with the scores of the original versions of

these rating scales.


4.2. Conclusions

The findings of the present study suggest that use

of the HMI to administer and rate the HAM-D31 and

the MADRS may increase the reliability of both de-

pression scales. Use of the HMI should also promote

more efficient administration and rating of both scales.

This is the first study to use a structured interview to

administer either the 31-item HAM-D or the MADRS,

or to document the psychometric properties of a struc-

tured interview version of either of these scales. It has

been previously demonstrated by Williams (1988) and

others that use of a structured interview improves both

total score and item reliability of the HAM-D21.

In this study, HMI scores correlated highly with

scores on the CGI, a frequently used measure of global

depression severity, indicating that the HMI is a valid

instrument for assessing depression in adult patients

with affective disorders.

Reliable administration and rating of depression, using

valid measures, is important for quantifying improve-

ments in depression severity in treatment outcome studies.

Inconsistency in ratings between interviewers can lead to

inaccurate conclusions about treatment efficacy, exclu-

sion or over-inclusion of potential study subjects, and

study findings. The HAM-D/MADRS Interview provides

an easy-to-administer and reliable method of rating de-

pression severity which may be used to improve consis-

tency and validity of depression study findings.

Acknowledgements

The research reported was supported in part by

NIMH-R01MH60904—bTargeting Disabilities for Re-

habilitation in Bipolar DisorderQ, J. Jaeger, Principal

Investigator, NARSAD Independent Investigator

Award—bLimits of Recovery in Major Depression:

The Role of Neurocognitive Factors in Persistent

DisabilityQ, J. Jaeger, Principal Investigator, and the

Stanley Medical Research Institute. The authors have

no conflicts of interest relevant to this article.

Appendix A

This interview questionnaire and the accompanying

rating guidelines are based on the Hamilton Rating Scale

for Depression (HAM-D), the Structured Interview Guide

for the Hamilton Depression Rating Scale (SIGH-D),

and the Montgomery–Asberg Depression Rating Scale

(MADRS). Rating instructions were kept as close to orig-

inal scale instructions as possible in order to maintain the

integrity and standardized administration of both scales.

A.1. HAM-D/MADRS Interview (HMI) Rating

Guidelines

The HMI, as with other structured interviews, was

developed to be administered by clinicians trained in

the use of symptom rating scales.

A.1.1. Interview questions

Interview questions are primarily from the HAM-D

interview. However, these questions should be used to

facilitate MADRS ratings as well. Always begin the

interview with the Overview section before moving on

to specific items. The first question for each item

should be asked exactly as written. Often this will elicit

enough information about symptom severity and fre-

quency to rate the item. Follow-up questions should be

asked when further information is necessary to rate the

item. You may also ask your own follow-up questions

to elicit the needed information. When a patient cannot

give adequate information to rate an item, other sources

(e.g., chart notes, clinical observation) should be used

to facilitate accurate rating of the item.

A.1.2. Time period

The ratings for each item should be based on the

patient’s condition in the past week (past 7 days).

However, for studies in which more frequent symptom

monitoring is desired, questions should be reworded to

reflect the actual number of days between interviews

(e.g., bin the past 3 days. . .Q). Time intervals other than

1 week should be clearly noted on the HMI.

A.1.3. Rating of individual items

Circle the rating for each item that most accurately

describes the patient during the past week. Unlike the

HAM-D, the MADRS contains midpoint ratings that

allow the interviewer to give a rating that falls between

two defined anchors. In rating theMADRS items, the rater

should decide whether the rating lies on these defined

scale anchors (0, 2, 4, 6) or between them (1, 3, 5).

A.1.4. Scoring

Only HAM-D items 1-17 should be used to arrive at

a total HAM-D score. A space is provided next to each

of those 17 items to record ratings. After rating all

items, sum the ratings for items 1–17 to arrive at a

HAM-D total score, and record this total in the space

provided at the end of the HMI. For the MADRS,

include all 10 items in the total score. MADRS and

HAM-D items do not necessarily appear in numerical

order, as items were grouped together based on simi-

larity of content.

A.2. HAM-D/MADRS Interview (HMI)

OVERVIEW: I’d like to ask you some questions about the past week. How have you been feeling since last (DAY OF THE WEEK)? IF OUTPATIENT:

Have you been working? IF NOT: Why not?

R.W.Iannuzzo

etal./Psych

iatry

Resea

rch145(2006)21–37

29


References

Altman, E.G., Hedeker, D.R., Janicak, P.G., Peterson, J.L., Davis,

J.M., 1994. The Clinician-Administered Rating Scale for Mania

(CARS-M): development, reliability, and validity. Biological Psy-

chiatry 36, 124–134.

Asberg, S.A., Montgomery, S., Perris, C., Schalling, D., Sedvall, G.,

1978. A comprehensive psychopathological rating scale. Acta

Psychiatrica Scandinavica. Supplementum, 272.

Bech, P., Rafaelsen, O.J., 1980. The use of rating scales exemplified

by a comparison of the Hamilton and the Bech–Rafaelsen mel-

ancholia scale. Acta Psychiatrica Scandinavica 62 (suppl 285),

128–131.

Bech, P., Allerup, P., Gram, L.F., Reisby, N., Rosenberg, R., Jacobsen,

O., Nagy, A., 1981. The Hamilton depression scale: evaluation of

objectivity using logistic models. Acta Psychiatrica Scandinavica

63, 290–299.

Benazzi, F., 2001. Factor analysis of the Montgomery Asberg depres-

sion rating scale in 251 bipolar II and 306 unipolar depressed

outpatients. Progress in Neuro-Psychopharmacology & Biological

Psychiatry 25, 1369–1376.

Calabrese, J.R., Bowden, C.L., McElroy, S.L., Cookson, J., Andersen,

J., Keck Jr., P.E., Rhodes, L., Bolden-Watson, C., Zhou, J.,

Ascher, J.A., 1999. Spectrum of activity of lamotrigine in treat-

ment–refractory bipolar disorder. American Journal of Psychiatry

156, 1019–1023.

Cicchetti, D.V., Sparrow, S.S., 1981. Developing criteria for establish-

ing the interrater reliability of specific items in an inventory:

applications for the assessment of adaptive behaviors. American

Journal of Mental Deficiency 86, 127.

Davidson, J., Turnbull, C.D., Strickland, R., Miller, R., Graves, K.,

1986. The Montgomery–Asberg depression scale: reliability and

validity. Acta Psychiatrica Scandinavica 73, 544–548.

Fava, M., Thase, M.E., DeBattista, C., 2005. A multicenter, pla-

cebo-controlled study of modafinil augmentation in partial

responders to selective serotonin reuptake inhibitors with per-

sistent fatigue and sleepiness. Journal of Clinical Psychiatry 66,

85–93.

First, M.B., Spitzer, R.L., Gibbon, M., Williams, J.B.W., 1998. Struc-

tured Clinical Interview for DSM-IV Axis I Disorders – Patient

Edition (SCID–I/P), version 2.0. Biometrics Research, New York

State Psychiatric Institute, New York.

Galinowski, A., Lehert, P., 1995. Structural validity of MADRS

during antidepressant treatment. International Clinical Psycho-

pharmacology 10, 157–161.

Grundy, C.T., Lunnen, K.M., Lambert, M.J., Ashton, J.E., Tovey, D.R.,

1994. The Hamilton Rating Scale for Depression: one scale or

many? Clinical Psychology: Science and Practice 1 (2), 197–205.

Guy, W. (Ed.), 1976. ECDEU Assessment Manual for Psychophar-

macology, Publication No. ADM 76-336. US Department of

Health, Education, and Welfare, Rockville, MD.

Hamilton, M., 1960. A rating scale for depression. Journal of Neu-

rology, Neurosurgery and Psychiatry 23, 56–62.

Hamilton, M., 1967. Development of a rating scale for primary

depressive illness. British Journal of Social and Clinical Psychol-

ogy 6, 278–296.

Hawley, C.J., Gale, T.M., Smith, V.R.H., Sen, P., 1998. Depres-

sion rating scales can be related to each other by simple

equations. International Journal of Psychiatry in Clinical Practice

2, 215–219.

Jamerson, B.D., Krishnan, K.R.R., Roberts, J., Krishen, A., Modell,

J.G., 2003. Effect of buproprion SR on specific symptom clusters

of depression: analysis of the 31-item Hamilton Rating Scale for

Depression. Psychopharmacology Bulletin 37 (2), 67–78.

Kørner, A., Nielsen, B.M., Eschen, F., Møller-Madsen, S., Stender,

A., Christensen, E.M., Aggernaes, H., Kastrup, M., Larsen, J.K.,

1990. Quantifying depressive symptomatology: inter-rater reli-

ability and inter-item correlations. Journal of Affective Disorders

20 (2), 143–149.

Maier, W., Philipp, M., Heuser, I., Schlegel, S., Buller, R., Wetzel, H.,

1988. Improving depression severity assessment: I. Reliability,

internal validity and sensitivity to change of three observer de-

pression scales. Journal of Psychiatric Research 22 (1), 3–12.

Miller, I.W., Bishop, S., Norman, W.H., Maddever, H., 1985. The

Modified Hamilton Rating Scale for Depression: reliability and

validity. Psychiatry Research 14, 131–142.

Montgomery, S.A., Asberg, M., 1979. A new depression scale

designed to be sensitive to change. British Journal of Psychiatry

134, 382–389.

Mulder, R.T., Joyce, P.R., Frampton, C., 2003. Relationships among

measures of treatment outcome in depressed patients. Journal of

Affective Disorders 76 (1–3), 127–135.

Nierenberg, A.A., Papakostas, G.I., Petersen, T., Montoya, H.D.,

Worthington, J.J., Tedlow, J., Alpert, J.E., Fava, M., 2003. Lithium

augmentation of nortriptyline for subjects resistant to multiple

antidepressants. Journal of Clinical Psychopharmacology 23,

92–95.

O’Sullivan, R.L., Fava, M., Agustin, C., Baer, L., Rosenbaum, J.F.,

1997. Sensitivity of the six-item Hamilton Depression Rating

Scale. Acta Psychiatrica Scandinavica 95, 379–384.

Potts, M.K., Daniels, M., Burnam, M.A., Wells, K.B., 1990. A

structured interview version of the Hamilton Depression Rating

Scale: evidence of reliability and versatility of administration.

Journal of Psychiatric Research 24 (4), 335–350.

Riskind, J.H., Beck, A.T., Brown, G., Steer, R.A., 1987. Taking the

measure of anxiety and depression: validity of the reconstructed

Hamilton scales. Journal of Nervous and Mental Disease 175,

474–479.

Rocca, P., Fonzo, V., Ravizza, L., Rocca, G., Scotta, M., Zanalda, E.,

Bogetto, F., 2002. A comparison of paroxetine and amisulpride in

the treatment of dysthymic disorder. Journal of Affective Disor-

ders 70, 313–317.

Senra, C., 1996. Evaluation and monitoring of symptom severity and

change in depressed outpatients. Journal of Clinical Psychology

52 (3), 317–324.

Serretti, A., Jori, M.C., Casadei, G., Ravizza, L., Smeraldi, E., Akis-

kal, H., 1999. Delineating psychopathologic clusters within dys-

thymia: a study of 512 outpatients without major depression.

Journal of Affective Disorders 56, 17–25.

Whisman, M.A., Strosahl, K., Fruzzetti, A.E., Schmaling, K.B.,

Jacobson, N.S., Miller, D.M., 1989. A structured interview ver-

sion of the Hamilton Rating Scale for Depression: reliability and

validity. Psychological Assessment: A Journal of Consulting and

Clinical Psychology 1 (3), 238–241.

Williams, J.B.W., 1988. A structured interview guide for the Hamilton

Depression Rating Scale. Archives of General Psychiatry 45,

742–747.

Young, R.C., Biggs, J.T., Ziegler, V.E., Meyer, D.A., 1978. A rating

scale for mania: reliability, validity, and sensitivity. British Journal

of Psychiatry 133, 429–435.

Zimmerman, M., Posternak, M.A., Chelminski, I., 2005. Is it time to

replace the Hamilton Depression Rating Scale as the primary

outcome measure in treatment studies of depression? Journal of

Clinical Psychopharmacology 25, 105–110.

development and reliability of the ham-d/madrs interview: an … · versions of existing scales,...

Documents