validation of clustering methods for pet...
TRANSCRIPT
Validation of Clustering
Methods for PET data
Prasanna K VelamuruAdvisors:
Dr.Rosemary A Renaut
Dr.Hongbin GuoDr Huan Liu
Outline
� Motivation
� Clustering Algorithms Validated
� Validation Metrics
� Results
� Significance of Results
� Conclusions and Future Work
PET – Quick Overview
� Diagnostic Imaging technique
� Functional Information
� Knowledge of biochemical basis of both normal as well
as abnormal functions
� Integral part of clinical care
� Oncology, Cardiology, Neurology/Psychiatry
Domain Of PET data used in this Study
� Measurement of biochemical and physiological
parameters from dynamic human brain PET data.
� Three-compartment FDG tracer kinetic model
Dynamic Human Brain PET Data
� Siemens 951/31 Scanner
� Output Image Matrix: 128 x 128 x 31 x 21
� 4-Dimensional: 3 in Space, 1 in Time.
X
Y
Z
T=0.1min T=0.55min T=5.75min T=45min
Characteristics of Dynamic PET data
� Output data is very noisy� Low Signal to Noise Ratio (SNR)
� Data obtained at irregular time intervals� Initial time intervals
� Very short: to observe gradient in the output
� last time interval � Long
� Output changes little
� Useful for determining long term decay term in the output
� Contributes significantly in the estimation of the kinetic rates.
Time Activity Curves
� Vector: [X1, X2,…,X20,X21]
� Defined for each Voxel / Pixel
� Total for each slice:
128*128=16384 TACs
� For Entire brain Volume:
128*128*31= 507904
Integrals
� Obtained by multiplying individual TACswith time intervals matrix.
[X1, X2,…,X20,X21] * ∆t1∆t2:
:
∆t20∆t21
0.0333
0.0333
:
:
0.5000
:
1.5000
:
10.000
30.000
• Single Scalar value • Approximate Estimate • Reduces Dimension of data
Clustering: In context of PET data
� Important preprocessing step performed prior to parametric estimation from dynamic PET data.
� Form of Segmentation
� Provide better information
� Partially overcome the effect of noise in the data
� Improve the accuracy of voxel level quantification in PET images
Motivation
� New preprocessing clustering techniques designed
and published by Guo et al
� Significantly reduces the overall time for clustering
� Basic initial validation done in the past
� Not comprehensively compared and validated with
classical clustering methods
Preliminary Concepts: Distance Measures
� Measurement of (dis)similarity between multivariable vectors / integrals
� Weighted Minkowski p norm
( Σnl=1|(xl-yl)wl|
p)1/p
� Distances weighted by the time duration intervals ([∆t1,∆t2, . . . , ∆t20,∆t21])
� Weighted Euclidean (``timeL2”) and Weighted Manhattan (``timeL1”)
� Found to be good distance measures for PET
Preliminary Concepts: Histogram-based
Thresholding
Properties of Last Frame
•Provides more accurate image reconstruction•Higher SNR compared to short time interval frames
•FDG accumulation is higher•Distribution of voxel intensities
-Provides closer correlation with spatial variation of
trace in tissue.•Supports thresholding to identify active voxels
Preliminary Concepts: Preclustering
� Initial thresholding� Does not provide enough information to cluster data
� Should include features over the entire set of TACs
� Compute mean TAC� From set of TACs that contain voxels with highest frequency
� Mean TAC used to find initial cluster
� Search is performed over all voxels� Distances are measured with respect to the entire TAC
Preclustering: Ilustration
Frequency=5934
Density~0.03
Precluster
Clustering Algorithms Used for
Validation: Hierarchical Clustering
� Agglomerative (bottom-up)
� Algorithm:
� Initialize: each item as a
cluster
� Iterate:
� select two most similarclusters
� merge them
� Halt: when required number
of clusters is reached
1 2 3 4 5
1−clustering
2−clustering
3−clustering
4−clustering
5−clustering
Clustering Algorithms Used for
Validation: HAL
� Average Linkage
� cluster similarity = average similarity of all pairs
Clustering Algorithms Used for
Validation: HCL
� Centroid Linkage
� cluster similarity = similarity between centroids
Clustering Algorithms Used for Validation: HCL1/HAL1 (with Preclustering)
� Algorithm Outline:� Initialization Phase
� Filter out active and inactive voxels
� Thresholding using final frame
� Determine preclusters within each active voxels interval
� Repeat preclustering process
� No more active voxel intervals are available
� Maximum iteration number is reached
� Perform HAL/HCL on reduced data set containing active voxels
� Preclusters and isolated voxels
Clustering Algorithms Used for Validation: HCL2/HAL2 (with Preclustering and Merging)
� Algorithm Outline� Perform same steps for initialization and preclustering as in HCL1/HAL1
� Calculate mean TAC for all preclusters
� For all isolated voxels
� Calculate distance to the mean TACs of preclusters
� Merge voxel with precluster of closest similarity (minimum distance)
� Perform HAL/HCL on reduced data set containing active voxels
� Only preclusters
Clustering Algorithms Used for Validation: K-Means
� Partitional clustering algorithm� Iterative relocation
� Locally minimizes sum of squared distance (“energy”) between the data points and their corresponding cluster centers
Σl=1k Σx
iε X
l[d(xi , µl)]
2
� Initialization of K cluster centers:� Totally random
� Random perturbation from global mean
� Heuristic to ensure well-separated centers
Clustering Algorithms Used for Validation: K-Means Illustration
Randomly Initialize Means
Assign Points to Clusters
x
x
Clustering Algorithms Used for Validation: K-Means Illustration
x
x
Re-estimate Means
Clustering Algorithms Used for Validation: K-Means Illustration
x
x
Re-assign points to clusters
Clustering Algorithms Used for Validation: K-Means Illustration
Re-estimate Means
x
x
Clustering Algorithms Used for Validation: K-Means Illustration
Re-assign points to clusters
x
x
Clustering Algorithms Used for Validation: K-Means Illustration
Re-estimate Means and Converge
x
x
Clustering Algorithms Used for Validation: K-Means
� Algorithm Outline� Initialize K cluster centers µi randomly. 1 ≤ i ≤ K
� Repeat until convergence:
� Cluster Assignment Step: Assign each data point x to the cluster Xl, such that ``timeL2” / ``timeL1” distance of x from µi (center of Xi) is minimum
� Center Re-estimation Step: Re-estimate each cluster center µi as the mean of the points in that cluster
Clustering Algorithms for dynamic PET:
Intangibles� Unsupervised
� no predefined classes
� no specific examples that would show what kind of important relationships within the data are of biological significance.
� Optimal Number of clusters
� Not known a priori� Difficult to rely on visual perception due to noise and high
dimension
� How appropriate is the clustering method for the data at hand?� Would it be possible to have a better set of clusters?
� Trade-off between computational cost and cluster quality� How much tolerance is accepted?
Clustering of PET data: Average Computational Costs (Minutes) (2-D, slice 16)
K Means
HCL2
HCL1
HCL
HAL2
HAL1
HAL
#Clusters 76543
(0.34,0.79)(0.28,0.33)(0.16,0.19)(0.18,0.20)(0.12,0.16)
(0.19,0.41)(0.20,0.40)(0.27,0.47)(0.52,0.65)(0.44,1.43)
(0.71,3.59)(0.84,3.85)(0.83,3.23)(1.29,3.49)(1.70,3.83)
(113,183)(120,184)(122,184)(122,184)(124,185)
(1.20,5.59)(1.58,6.23)(1.22,6.11)(1.29,6.89)(2.06,8.28)
(0.53,7.3)(1.04,7.68)(0.56,8.13)(1.53,7.95)(2.31,8.28)
(113,186)(120,186)(122,185)(122,187)(128,191)
Cluster Validation
� Procedure of evaluating results of a clustering algorithm in a quantitative and objective manner
� Index of cluster validity � Used to measure the quality and appropriateness of a clustering
structure.
� Approach employed in this study� Intra cluster characteristics (Compactness, low magnitude)
� Inter cluster characteristics (Well-separated, high magnitude)
� Measures based on combination of both (Overall assessment)
Cluster Validation Metrics:
Intra-Cluster
� Average Distance from Mean
� For each element xi in cluster j, 1≤ j ≤ p
mji= d(xi,µj) where µj =1/nj(Σmx
jm)
� For each cluster
mj= 1/nj(Σimji)
� For all (p) clusters
m= 1/p Σjmj
� Sensitive to noisy points
Cluster Validation Metrics:
Intra-Cluster� Maximum Distance from Mean
� Measures distance from edge to center (radius); influences HCL algorithm
� Highly sensitive to noise
� Maximum Diameter
� Distance between the farthest points within a cluster
� Tests for width of cluster; Sensitive to noisy points
Cluster Validation Metrics:
Intra-Cluster� Average Spread
� Average distance of elements within a cluster to all other elements of the same cluster
� Measure of homogeneity of cluster
� Total Energy� Sum of squares of the distances to the mean for all cluster points
(Recall: K-means objective function)
� Expected to be least for K-means
� Useful measure to compare hierarchical methods with K-means
Cluster Validation Metrics:
Inter-Cluster
� Separation
� For each element : Average distance of element i in cluster j to
elements of cluster k. (k ≠ j)
� For each cluster: Average of individual separation values
� Measure of average distance of separation
� Minimum Separation
� min (separation values for a cluster)
� Sensitive to noisy points
� Useful to detect questionable cluster assignments
Cluster Validation Metrics:
Inter-Cluster
� Average Split
� Split: Closest point in cluster k to point i in cluster j.
� Average of split values for all points in a cluster; average for all clusters
� How close are two neighboring clusters?
Cluster Validation Metrics:
Silhouette� Measures the standardized
difference between b(i) and a(i)
� a(i): average dissimilarity of object i to all other objects in its own cluster (average spread)
� b(i): average dissimilarity of object i to all other objects in its nearest cluster (separation)
close to 1, well classified
Cij= close to 0, not certain
close to -1, mis-classified
Average Silhouette Width
= Average value over all clusters
> 0.5 Good Classification
< 0.2 Lack of cluster structure
Useful to determine Optimal Number of
clusters (Maximum Width)
Cluster Validation Metrics:
Modified Dunn’s Index*� General form of Dunn’s Index for a partition matrix U
� Modified Dunn’s Index (Average Linkage)
� Modified Dunn’s Index (Complete Linkage)
*Bezdek, J. C. and Pal,N.R.,Some new indexes of cluster validity, IEEE Trans. on Systems, Man, and Cybernetics, Part B: Cybernetics 28 (1998), no. 3, 301-315.
Cluster Validation Metrics:
Modified Dunn’s Index
� Modified Dunn’s Index (Combined Average)
� All 3 measures� Useful for determining optimal number of clusters
� Maximum value: Indicator of optimal number
� Complete Linkage: Affected by noisy points
Results
Conclusions
� Methods comparable to each other
� Fast Hierarchical methods comparable to more expensive traditional methods
� K-means
� Greater average dissimilarity within clusters
� Fast
� Modestly well separated clusters
� Optimal Number of clusters
� Not very conclusive based on given measures
� Most instances give value between 3 and 6. � Require domain expert knowledge to determine biological relevance
Future Work
� Fuzzy Clustering schemes, SOM, PCA
� Using fast methods to exclusively study entire brain
volume (in progress)
� Advanced statistical techniques
� More indices to obtain optimal number of clusters
(Davies-Bouldin, Modified Hubert’s Gamma
statistic)
� Publish all results and scripts
Things I Learned
� Practical application and use of clustering concepts
� Very broad field, several applications, active
research area
� Hone Matlab scripting skills
� Opportunity to deal with real data and understand
associated challenges
� Patience- Invaluable in such studies
References
� Phelps, M.E., Positron Emission Tomography. In: Mazziotta, J. and Gilman, S., Eds.,1992,Clinical Brain Imaging: Principles and Applications
� Hongbin Guo, Rosemary Renaut, Kewei Chen and Eric Reiman,2003, Clustering Huge Data sets for Parametric PET imaging, Biosystems, 71, 1-2, pp.81-92
� Jain, A.K., Murty M.N., and Flynn P.J. (1999): Clustering: A Review, ACM Computing Surveys, Vol 31, No. 3, 264-323
� Kaufman L., and Rousseeuw P., 1990. Finding groups in Data : An introduction to Cluster Analysis. John Wiley and Sons, New York, NY.
� Everitt B.S., Landau S., Leese M., 2001. Cluster Analysis, 4 th Edition. Edward Arnold, London, UK.
� Halkidi, M., Batistakis, Y., Vazirgiannis M., 2001. On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107–145.
� Kimura, Y., Senda, M., Alpert, N.,2002, Fast formation of statistically reliable FDG parametric images based on clustering and principal components, Phys. Med. Biol. 47(3),pp.455-468
� A.K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
Acknowledgements
� Dr. Rosemary Renaut
� Dr. Hongbin Guo
� Dr. Huan Liu