juan torres , ashraf saad , elliot moore · juan torres 1, ashraf saad 2, elliot moore 1 2computer...

09/18/2006 - 10/06/2006 1

Evaluation of Objective Features for Classification of Clinical Depression in Speech by Genetic Programming

Juan Torres1, Ashraf Saad2, Elliot Moore12Computer Science Department

School of Computing Armstrong Atlantic State University

Savannah, GA 31419, USA [email protected]

1School of Electrical and Computer Engineering

Georgia Institute of TechnologySavannah, GA 31407, USA

[email protected], [email protected]

209/18/2006 - 10/06/2006

Clinical Depression Classification

� Would like to detect clinical depression by analyzing a patient’s speech.

� Binary decision classification problem� Large number of features in dataset. Feature

Selection is necessary for:� Designing a robust classifier� Identifying of small set of useful features, which may

in turn provide physiological insight

309/18/2006 - 10/06/2006

Speech Database

� 15 patients (6 male, 9 female)� 18 control subjects (9 male, 9 female)� Corpus: 65 sentence short story� Observation Groupings:

�G1: 13 obs/speaker (5 sentences each)

�G2: 5 obs/speaker (13 sentences each)

409/18/2006 - 10/06/2006

Speech Features

� Prosodics� Vocal Tract Resonant Frequencies

(Formants)� Glottal Waveform� Teager FM

509/18/2006 - 10/06/2006

Speech Features (cont.)

� Raw features extracted frame by frame (25-30ms), and grouped into 10 categories:

� EDS = STD(DFS(Ev))� EMS = MED(DFS(Ev))

Teager FM (TFM)

Glottal Timing (GLT)

Formant Bandwidths

(FBW)

Speaking Rate (SPR)

Formant Locations (FMT)

Energy Deviation Statistics (EDS)

Glottal Spectrum (GLS)

Energy Median Statistics (EMS)

Glottal Ratios (GLR)

Pitch (PCH)

609/18/2006 - 10/06/2006

Statistics

� Sentence-level statistics were computed for each raw feature → Direct Feature Statistics

� Same set of statistics used on DFS’s over each entire observation →Observation Level Statistics

75th percentile – 25th

percentileInterquartile Range

(IQR)

log10(MAX) –log10(MIN)

Dynamic Range (DRNG)

MAX – MINRange (RNG)

95th PercentileMaximum (MAX)

5th PercentileMinimum (MIN)

Sqrt(1/(N-1) * Sum{(xi-Mean(x))2})

Standard Deviation (STD)

50th percentileMedian (MED)

1/N * Sum{xi}Average (AVG)

EquationStatistic

709/18/2006 - 10/06/2006

Final Feature Sets

� Result: 2000+ distinct features (OFS)

� Statistical significance tests (ANOVA) used to initially prune feature set.

� Final size: 298 – 1246 features → large FS problem.

85790FG2

1246234FG1

29875MG2

724195MG1

FeaturesObservationsExperiment

809/18/2006 - 10/06/2006

Feature Selection

� Goal: Select (small) group of features that maximizes classifier performance

� Approaches�Filter: optimize computationally inexpensive

fitness function�Wrapper: fitness function = classification

performance

909/18/2006 - 10/06/2006

Genetic Programming for Classification and FS (GPFS)� Estimate optimal feature set and classifier

simultaneously → “online approach”. (Muni, Pal, Das 2006)

� Advantages:� Evolutionary search: explores (potentially) large

portion of feature space� Resulting classifier consists of a simple algebraic

expression (easy to read and interpret)� Stochastic: multiple runs should yield different

solutions. Frequency of selection can be regarded as approximate fitness measure, given large number of runs.

1009/18/2006 - 10/06/2006

Genetic Programming

� Classifier consists of expression trees� Binary decision → single tree

Class assigned by algebraic sign of evaluation (T>0 → c1, T<0 → c2)

� Internal nodes: { +, -, x, / (protected)}

� External nodes: {features, rnd_dbl(0-10)}

1109/18/2006 - 10/06/2006

Genetic Programming (Cont.)

� Large population of classifier trees is evolved over several generations.� Population Initialization

� Random Trees (height 2-6), ramped half and half method

� Fitness Function = Classification Performance� Evolutionary Operators

� Reproduction (fitness-proportional selection)� Mutation (random selection)� Crossover (tournament selection)

1209/18/2006 - 10/06/2006

Evolutionary Rules for Simultaneous Feature Selection

� Initial tree generation� Probability of selecting a feature set decreases

linearly with feature set size.� Fitness

� Biased toward trees that use few features.� Crossover

� Homogeneous: only between parents with same feature set

� Heterogeneous: biased toward selecting parents with similar feature sets

1309/18/2006 - 10/06/2006

Dynamic Parameters

� Fitness bias toward smaller subsets decreases with generations.

� Probability of heterogeneous crossover decreases with generations

� Motivation: Explore feature space during first few generations, then gradually concentrate on improving classification performance with current feature sets.

1409/18/2006 - 10/06/2006

GP Parameters

3000 for G1 / 2000 for G2Population size

12Maximum height of a tree

350Maximum allowed nodes of a tree

2-6Initial height of trees

30 for G1 / 20 for G2Number of generations

10Tournament size

0.7 / 0.3Prob. of selecting int./ext. node during mutation

0.8 / 0.2Prob. of selecting int./ext. node during crossover

0.15Mutation probability

0.05Reproduction probability

0.80Crossover probability

ValueParameter

1509/18/2006 - 10/06/2006

GP ResultsClassification Performance, Averaged over 10 runs

of Leave-one-out Cross-validation

16.014.216.115.318.5Feature Set Size

75.081.884.469.164.8Specificity

80.982.785.474.780.9Sensitivity

77.482.284.971.371.2Classification

Accuracy

G2G1G2G1Mean

FemaleMaleExperiment

1609/18/2006 - 10/06/2006

Feature Selection Histograms

1709/18/2006 - 10/06/2006

“Best” Features -- Males

GLT: Max((CP)MIN)PCH: Med(A1)EDS: Avg(MED)GLT: IQR((CP)IQR)GLR: Min((rOPO)IQR)EDS: Avg(AVG)GLR: Med((rCPOP)MIN)GLT: Std((CP)MIN)GLR: Max((rCPOP)MIN)GLS: Avg((gSt1000)MAX)

GLT: Max((CP)MIN)GLT: DRng((CP)IQR)GLS: Med((gSt1000)MAX)GLT: Std((OP)IQR)GLR: Rng((rCPO)IQR)GLS: Avg((gSt1000)MAX)EDS: Avg(AVG)EDS: Avg(MED)EDS: Med(MED)GLT: Med((CP)MIN)

Male - G2Male - G1

1809/18/2006 - 10/06/2006

“Best” Features -- Females

EMS: IQR(AVG_1)EMS: Med(STD_1)PCH: IQR(IQR)EMS: Med(STD)EMS: Max(MR)TFM: Avg(MAX(IQR))EMS: Med(MR)FBW: Med((bwF3)IQR)EMS: Med(MAX)EMS: Med(RNG)

EMS: Med(MR)EMS: Med(STD_1)EMS: Max(MR)EMS: Med(RNG)EMS: Max(STD_1)EMS: Med(AVG)EMS: Max(MAX)EMS: Avg(STD_1)EMS: Avg(MED)EMS: Avg(AVG)

Female - G2Female - G1

1909/18/2006 - 10/06/2006

GP Results (Cont.)

� GP results were not as good as hoped for. However, the fact that certain features were selected in the final solutions more frequently than others can be regarded as a measure of their usefulness.

� To test this hypothesis, we train Bayesian classifiers using the 16 features most frequently-selected by GP.

2009/18/2006 - 10/06/2006

Naive Bayesian Classification

� Assign Class Cj with highest probability given observation (features).

� Can be estimated using Bayes’ rule:

� Under naive assumption, class-conditional distributions can be expressed as:

)(

)()|()|(

Xp

CPCXpXCp jj

j =

∏=i

jij CxpCXp )|()|(

2109/18/2006 - 10/06/2006

PDF estimation methods� Uniform Bins

� A histogram with N uniformly spaced intervals (bins) is computed for each feature and each class using the training data.

� The optimum value for N was found by exhaustive search.� Optimal Threshold

� Similar to uniform bins with N=2, but the cutoff threshold between the two bins is chosen separately for each feature.

� (Naive) Gaussian Assumption� Model the PDF of each feature and each class as a 1-D Gaussian density function whose

mean and variance are taken as the sample mean and (unbiased) variance of the training data.

� Gaussian Mixtures� Each likelihood function p(X | Cj) is modeled as a weighted sum of multivariate Gaussian

densities. � The expectation-maximization (EM) algorithm is used to estimate means, covariance

matrices, and weights. We use diagonal covariance matrices and limit the number of mixtures to 3 for the G1 experiments and 2 for the G2 experiments in order to reduce the number of parameters to be estimated.

� Multivariate Gaussian� Each (class-conditional) likelihood function is modeled as a single multivariate Gaussian

PDF with a full covariance matrix. Like the GMM, this method does not follow the naïve assumption.

2209/18/2006 - 10/06/2006

Results

97.886.792.2MVG91.180.086.7MVG

91.183.388.0GMM91.190.090.7GMM

86.795.691.1Gaussian86.793.389.3Gaussian

97.875.686.7Opt Thresh88.950.073.3Opt Thresh

93.393.393.3Unif Bin(N = 5)

FemaleG2

88.993.390.7Unif Bin(N = 2)

MaleG2

87.283.885.5MVG84.683.384.1MVG

87.788.087.6GMM89.787.288.7GMM

82.991.587.2Gaussian86.388.587.2Gaussian

91.565.878.6Opt Thresh82.982.182.6Opt Thresh

90.685.588.0Unif Bin(N = 9)

FemaleG1

88.983.386.7Unif Bin(N = 8)

MaleG1

SpecSenAccMethodExpSpecSenAccMethodExp

Average Improvement: 18.5% (Males), 7.1% (Females)

2309/18/2006 - 10/06/2006

Conclusion

� GPFS was successful in finding small set of highly discriminating features.

� Need to measure not just FS frequency for single features, but for groups of features.

� GPFS may be performing FS too quickly. It may be beneficial to encourage more exploration of the feature space.

2409/18/2006 - 10/06/2006

References

1. E. Moore, M. Clements, J. Peifer, and L. Weisser, Comparing objective feature statistics of speech for classifying clinical depression. In Proceedings of the 26th Annual Conf. on Eng. in Medicine and Biology, pages 17-20, San Francisco, CA, 2004.

2. E. Moore, M. Clements, J. Peifer, and L. Weisser, Analysis of prosodic variation in speech for clinical depression. In Proceedings of the 25th Annual Conf. on Eng. in Medicine and Biology, pages 2849-2852, Cancun, Mexico, 2003.

3. M. Dash, H. Liu, Feature selection for classification. Intelligent Data Analysis, 1(3):131-156, 1997.

4. D. Muni, N. Pal, and J. Das, A novel approach to design classifiers using genetic programming. IEEE Transactions on Evolutionary Computation, 8(2):183- 196, 2003

5. D. Muni, N. Pal, and J. Das, Genetic programming for simultaneous feature selection and classifier design. IEEE Transactions on Systems, Man and Cybernetics, Part B, 36(1):106- 117, 2006.

6. D. Zongker, and W. Punch, Lilgp 1.01 User’s Manual. Genetic Algorithms and Research Application Group, Michigan State University, East Lansing, MI, 1998. http://garage.cse.msu.edu/software/lil-gp/index.html.

7. T.F. Quatieri, Discrete-Time Speech Signal Processing, Prentice Hall, Upper Saddle River, NJ, 2001.

2509/18/2006 - 10/06/2006

References (cont.)

8. G. Zhou, J. Hansen, J. Kaiser, Nonlinear feature based classification of speech under stress. IEEE Transactions on Speech and Audio Processing, 9(3):201-216, 2001.

9. J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, 1992.

10. R. Duda, P. Hart, and D. Stork, Pattern Classification. Wiley, New York, 2001.

11. C. Elkan, Naive Bayesian learning. Adapted from Technical Report No. CS97-557, Dept. of Computer Science and Engineering, University of California, San Diego, CA, 1997.

12. Y. Yang and G. Webb, On why discretization works for naïve-Bayes classifiers. In Proceedings of the 16th Australian Joint Conference on Artificial Intelligence (AI), pages 440-452, Perth, Australia, 2003.

13. M. Wiggins, A. Saad, B. Litt, and G. Vachtsevanos, Genetic Algorithm-Evolved Bayesian Network Classifier for Medical Applications. In Proceedings of the Tenth World Soft Computing Conference , 2005.

14. S. Theodoridis, and K. Koutroumbas, Pattern Recognition. Academic Press, San Diego, CA, 1999.

15. H.B. Amor and A. Rettinger, Intelligent exploration for genetic algorithms: using self-organizing maps in evolutionary computation. In Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO), pages 1531-1538, Washinton D.C, 2005.

juan torres , ashraf saad , elliot moore · juan torres 1, ashraf saad 2, elliot moore 1 2computer...

Documents