k means spectral

Upload: ethan-carres-hidalgo

Post on 20-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 K Means Spectral

    1/10

    Spectral Comparison Usingk-Means Clustering

    Vignesh R. RamachandranJohns Hopkins Applied Physics Laboratory

    Laurel, MD, [email protected]

    Samantha K. JacobsJohns Hopkins Applied Physics Laboratory

    Laurel, MD, [email protected]

    Alexer H. FirpiJohns Hopkins Applied Physics Laboratory

    Laurel, MD, [email protected]

    Herbert J. MitchellNaval Postgraduate School

    Monterey, CA, [email protected]

    Nigel H. TzengJohns Hopkins Applied Physics Laboratory

    Laurel, MD, [email protected]

    Benjamin M. RodriguezJohns Hopkins Applied Physics Laboratory

    Laurel, MD, [email protected]

    AbstractThere is a growing number of infrared (IR) spectralsignature data in the scientific community gathered from avariety of sensors using a variety of collection techniques. Asthe quantity of collected data grows, automated solutions forsearching and matching signatures need to be developed. When

    searching and matching signatures, reducing computationalcomplexity and increasing matching accuracy are essential. Wepresent a signature classification method via k-means clusteringusing a novel application of spectral angle mapping to efficientlydetermine spectral similarity. We evaluate the method againstspectral data in the SigDB spectral analysis software applica-tion developed by the Johns Hopkins University Applied PhysicsLaboratory (JHU/APL). The key component to this approachis the set of characteristic functions used to map signaturessimilarity into a spatial representation. Existing methods usedto autonomously identify and classify IR spectral data includespectral angle mapping and key feature detection. Spectralmapping is computationally slow due to the need for directindividual comparison, and key feature detection improves com-putation time but is limited by the specific features selected forcomparison. The accuracy and computation time of the spectralcluster classification method is evaluated against spectral angle

    mapping and visual analyses on the ASTER NASA spectrallibrary. The goal of this method is to improve both the accuracyand speed of classifying newly collected unlabeled spectra. Wefind that the proposed method of scoring signatures offers aspeed increase of three orders of magnitude in comparing spec-tra at the expense of a high false positive rate, suitable for use asa first-pass filter. We further find that the k-means cluster-basedclassification is highly sensitive to the selection of initial clustercentroids, and offer alternative solutions to use with our scoringmethod.

    TABLE OF C ONTENTS

    1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 RELATEDW OR K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    4 FINDINGS ANDA NALYSIS . . . . . . . . . . . . . . . . . . . . . . . 7

    5 CONCLUSION ANDF UTUREW OR K . . . . . . . . . . . . . 8

    REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    BIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    978-1-4799-1622-1/14/$31.00 c2014 IEEE.1 IEEEAC Paper #2635, Version 1, Updated 11/15/2013.

    1. INTRODUCTION

    Growing use of infrared spectral signature data in scientificand forensic analysis requires collecting large quantities ofdata from a variety of spectrometers using a variety of

    techniques in diverse environmental conditions. Variationsin observed spectral features, regardless of the quality ofthe data, make signature classification and comparison bothchallenging for spectral analysts and often impossible forautomated systems. As the quantity of collected data con-tinues to grow, automated solutions are increasingly critical.For example, the national Integrated Signatures Program(ISP) has collected approximately one million infrared (IR)spectra, which largely rely on manually produced metadatafor identification and classification. The manual productionof metadata, at this scale, requires significant and rising costand time investment to reduce errors and inconsistencies.

    The primary methods currently used to autonomously iden-tify and classify IR spectral data are spectral angle mappingand key feature detection [1]. Spectral angle mapping cannotcompare spectra with differing domains (e.g., spectral range,spectral resolution or inconsistently removed bands) withoutsignificant preprocessing. Thus, spectral angle mapping is acomputationally slow process, running in linear time againstan entire reference library to identify a single new or un-known signature. Key feature detection improves computa-tion time by comparing only predetermined feature locationsin the reference spectra, but this method requires the user tospecifically identify the spectral features of interest.

    A novel approach to signature classification via scoring andclustering is presented. A set of characteristic mathematicalfunctions are used as artificial reference spectra to scorelibrary signatures, and a k-means clustering algorithm de-termines classification clusters in the score space. New

    signatures are scored against the same characteristic func-tions to determine their location in score space, and thusdetermine their likely classification. Since new signaturesneed only be compared against cluster centroids to determineclassification, the algorithm performs in O(k) time, where

    k

    n/2; i.e., the computation time increases linearly withrespect to the predetermined number of clusters k. We applythis approach to a reference sampling of signatures fromthe ASTER spectral library [2] and evaluated accuracy andcomputation time versus direct spectral angle comparison.

    This methodology identifies close matches in a spectral li-

    1

  • 7/24/2019 K Means Spectral

    2/10

    brary to newly collected spectra three orders of magnitudefaster than the direct spectral angle mapping. Used in tan-dem with traditional signature analysis, this method providesa first-pass coarse screening of spectral classification toreduce the size of the identification pool. Reducing theworkload on a more intensive secondary analysis allows amuch larger reference libraries to be used in the near-real-time classification of field-collected spectra. Greater accessto spectral data, in addition to the ability to provide a pre-liminary classification of newly collected spectra, providesforensic analysts and first responders enhanced chemicaldetection capabilities when they need it most.

    The remainder of this paper is structured as follows. In Sec-tion 2, related work in the area of spectral signature matchingis presented. Our proposed method, signature classificationvia scoring and clustering, is described in Section 3. Findingsand analyses are provided in Section 4. Finally, we concludeand describe opportunities for future work in Section 5.

    2. RELATED W OR K

    The need for spectral signature comparison and identificationhas driven substantial work in the application of pattern

    recognition, unsupervised and semi-supervised learning, anddata clustering [3] [4] [5]. Further, the need to quantify andanalyze enormous quantities of spectral data has spawnedmany attempts at spectral collections or databases, withmixed results [6]. The inability to reduce various methodsof collection and phenomenologies of spectra into a leastcommon denominator representation has made the problemcomputationally challenging, especially without the use ofcopious metadata to explain the exact conditions and contextin which the spectra were collected.

    The classification and identification of unlabeled data hasbeen studied in great detail in the spatial domain, and awealth effective of solutions have been developed to ad-dress the problem [3] [7]. Spatial clustering algorithms

    in general attempt to determine natural boundaries betweennon-uniformly distributed spatial data points. Of these, k-means is relatively simplistic approach: given some knownnumber of clusters k, cluster center points are randomlydistributed among the sample data, then iteratively updatedto reflect the average of their nearby constituents. A keyassumption is advance knowledge ofk, as this algorithm hasno ability to merge or split existing clusters. However, itis the simplest in a large field of clustering solutions thatincludes hierarchical clustering, fuzzy k-means, DBSCAN,expectation-maximization, and many others [3][8]. If thespectral classification problem can be effectively adapted intothe spatial domain, any of these existing methods can beapplied.

    Several direct and indirect methods exist to compare signa-tures against each other, such as Spectral Angle Mapping(SAM) [4], multiple endmember spectral mixture analysis(MESMA) [9], peak detection and similarity indices such asthe Pearson correlation [1]. In general practice, the sensitivityand utility of each method is inversely correlated with itsruntime computational complexity [1]. The SAM algorithmis of specific interest due to its ability to precisely describethe difference between two signatures without regard for therelative illumination within the spectra (which is irrelevantto the spectral features of the observed material) [4]. SAMmeasures similarity by taking a signature in two dimensions(X and Y) and creating a spectral vector consisting of its Y

    values. The vectors dimensionality is equal to the numberof data points in the signature. Two spectral vectors arecompared by simply computing the angle between them viatheir vector dot product. This method inherently assumesthat the two spectra being compared share precisely the samedomain: not only the same spectral band, but also the samesampling resolution and specific domain values. Thus asignature sampled at 5, 10, 15... 100 microns cannot beimmediately compared with another signature sampled at 7,12, 17... 102 microns, though the domains almost entirelyoverlap. Any mismatch in domain must be resampled throughinterpolation and extrapolation. Resolving the mismatch iscomputationally inefficient since, pathologically, every pos-sible pair of spectra may require resampling.

    3. METHODOLOGY

    We formulate a methodology to rapidly classify a new,unknown signature by identifying signatures in a spectrallibrary with similar spectral features. Instead of individu-ally comparing the unknown signature against each memberof the library, the proposed method precomputes a scorerepresentation of the library against a small number (N) ofartificial reference spectra. Using a derivative of the SAM

    method, the scalar spectral angle values between each librarysignature and the reference spectra are treated as coordinatesof a point in N-dimensional space, and cached within thelibrary; when a new signature is introduced, it is comparedagainst the reference spectra to produce a corresponding setof coordinates. Then, sets of spatial coordinates with smallerEuclidean distances correspond to library spectra with thegreater similarity to the new signature. This method is in-tended to filter the library down to a small subset of candidatematches.

    Though the primary goal is to optimize spectral comparison,this paper also investigates the opportunity to classify andidentify unlabeled signatures by characterizing the generatedN-dimensional map as a spatial clustering problem.k-Means,

    an elementary but very popular [3] clustering algorithm,generates cluster associations among spatial coordinates. Anew, unlabeled signatures spectral scores will place it withina defined cluster; then, the labels of the library spectra sharingthe same cluster become preliminary guesses at the unknownsignatures classification. Since the labels of library signa-tures are predetermined, and signatures with the same labelare expected to have very similar spectral characteristics, thescore clustering process also serves to validate the choice ofreference spectra.

    Characteristic Spectral Angle Mapping (cSAM)

    Our proposed method adapts SAM spectral comparison asa measure of indirect spectral similarity, rather than direct.Traditionally, SAM operates on the principle that if signature

    A has spectral angle AC to signature C, then small ACimplies spectral similarity betweenA andC. Our modifiedapplication instead proposes that a characteristic function,such as y = cos x, can serve as an artificial referencesignatureB. If signaturesA andChave spectral anglesABandBCto Crespectively, thenAB BCimplies a degreeof similarity between A and C. This relationship is not asprecise as SAM: the set of spectral vectors satisfying a givenspectral angle to B trace a surface around the referencesignatures vector, as shown in Figure 1: here bothA andC present the same spectral angle to B, but A = C. Inthis three-dimensional representation, the area of ambiguitypresents as the surface of a cone; the spectral vector of a 500-

    2

  • 7/24/2019 K Means Spectral

    3/10

    Figure 1. Conical Surface of 3D Vectors Satisfying Spectral

    Angle to Reference Vector B

    point signature has 500 dimensions, and thus the equal-angleambiguity surface cannot be represented graphically.

    The solution set of spectral vectors represents many spectrathat are equally similar to the reference signature, at anymagnitude. The magnitude of the spectral vector representsthe illumination of the signature, which is an artifact of thecollection environment and irrelevant for the purposes ofdetermining the similarity of spectral characteristics [4]. Thisleaves a cross-section of spectral vectors that all exhibit thesame degree of similarity to B ; for vectors in three dimen-sions, this presents as a circle orthogonal to B . This level ofambiguity is irreducible using only one reference signature.Using multiple reference signatures further constrains thesolution to the intersection of the corresponding vector sets,as shown in Figure 2. Thus if two signaturesAandCexhibitspectral angles AB1 CB1 to B1 and AB2 CB2 toB2, it becomes increasingly likely that A C. Introducing

    additional reference signaturesBican constrain the solutionsstill further, at the expense of additional calculations.

    The application of cSAM allows every signature in a spectrallibrary to be reduced into a score vector:

    =

    1...

    N

    (1)

    whereNis the number of characteristic functions B. Thevalues can then be considered the coordinates of a point inN-dimensional space, whereB1..BN serve as axes (orthog-onality is not required, but is desirable). In this new spatial

    representation, score similarity can be characterized as theEuclidean distance between two points; thus this enables theuse of existing spatial clustering algorithms, such ask-means,to perform classification of spectra.

    Preconditions

    The choice of mathematical functions used to produce ref-erence spectra, and thus the spectral angle scores used forcomparison, is a critical factor in the utility of this approach.Poorly chosen functions result in spectral angles that arehighly similar for many or all library spectra. Functions thatperform well in one spectral band may perform poorly inothers, thus requiring different sets of functions for different

    Figure 2. Intersection of 3D solutions with Spectral Angles

    1,2 to Reference Vectors B1, B2 respectively

    bands within the library. For the purposes of this paper, asmall set of simplistic functions has been chosen to illustratethe general approach. Selection and evaluation of more ap-propriate characteristic functions will be the subject of furtherwork, as are methods of dynamically generating appropriatecharacteristic functions for a given spectral library.

    Thek-means algorithm requires, as initialization parameters,the expected number of classification clusters (k); someinitial selection of centroid locations for the clusters; andan error threshold, to limit the number of iterations. Thenumber of expected classifications was estimated using a

    common rule-of-thumb value, k

    n/2 [10], and the ini-tialization centroids were randomly chosen from the sample

    set. However, given that the library spectra are generallywell-labeled, the clustering problem can instead be tackledwith a semi-supervised solution: that is, the known materialand chemical composition of the samples can be used tointelligently select a diverse set of signatures to serve asthe initial centroid locations. Guided initialization has thepotential to pose a significant impact on the determinationof cluster associations, as starting-point selection is knownto strongly affect the result of the k-means algorithm [11].Initialization is also a focus of ongoing work.

    Detailed Approach

    Each signature in the spectral database consists of columnsof spectral data accompanied by various optional metadataproperties, such as sensor identification and calibration, envi-

    ronmental conditions, sample identification and description,axis units and labels, and any known observable associations.The data is in the wavelength domain with value columnsrepresenting either reflectivity or emissivity, as indicated byaxis properties (see Figure 3, an example signature fromthe ASTER library [12]). Note that NaN float values areused to represent invalid or removed data points within thespectra, such as deliberately suppressed water bands. Ahash of the spectral data uniquely identifies the signatures;therefore, two signatures having the same identifier are as-sumed to be identical, and cannot both exist in the database.The phenomenology of the signature (LWIR, MWIR, SWIR,VIS/NIR) is also indicated by metadata properties.

    3

  • 7/24/2019 K Means Spectral

    4/10

    Figure 3. Sulphur Signature from the ASTER SpectralLibrary

    A set of characteristic mathematical functions are selectedbased on a set of desirable properties:

    Each function is two-dimensional, because the comparisonspectra are two-dimensional. Each function is real and differentiable within the domainof interest, so that a spectral angle mapping against any realspectra will yield a real-valued solution. Each function generates a broad range of score valuesagainst available library spectra within the domain of interest. Each function is linearly independent and preferably or-thogonal from the other selected characteristic functionswithin the domain of interest, so that each presents a uniquespectral vector.

    With each of the desirable properties, we attempt to guideand constrain the selection of characteristic functions to thosethat generate a wide distribution of real-valued spectral anglescores within the database of spectra. Thus, the selectionof appropriate specific characteristic functions is contingenton the nature of the library against which it is applied. Thenumber of functions to be used is likewise flexible; more axesfor comparison lead to less ambiguity in the set of solutions,at the expense of computation time and volume of score data.The choice of characteristic functions can be validated by aset of empirical desirability tests against the database:

    1. When the function is calculated against the library spectra,are any of the produced scores NaN (undefined, infinite,or otherwise non-real)?If so, this indicates the function isnot real and/or differentiable throughout the entire domain ofinterest.

    2. Are the produced spectral scores broadly distributed?Ifnot, the function is not a good discriminator for the signaturesof interest.

    3. Does the function produce scores that are highly correlatedwith scores produced by another function?If so, the func-tions may not be linearly independent, or they may measurehighly correlated spectral features. One of the functions maybe used, but not both.

    The cSAM algorithm is used to determine score values for thedatabase, as shown in Algorithm 1. The procedure compares

    each signature against the characteristic functions at the exactdomain locations in which the former is defined. Becausethe characteristic functions are real and continuous, real Yvalues are always returned at any signature-specified domainlocation. NaN values are ignored in the comparison (lines 9-10), and thus removed or suppressed bands have no impacton the score. For each real-valued datapoint, the productis accumulated through the algebraic definition of the dot-product (line 12 of the algorithm):

    A B =Mi=1

    aibi = a1b1+...+aMbM (2)

    Mis the dimensionality of the spectral vector and eachai isthe Y value of one data point in the signature. Here,B isan artificial vector that is automatically generated based on afunctionfj . Correspondingly,bi= fj(xi), wherexiis the Xvalue corresponding toais Y value. The geometric definitionof the dot product then determines the angle between thespectral vectors (line 17):

    A B= |A||B| cos AB =

    M

    i=1

    aibi (3)

    = AB= cos1M

    i=1aibi|A||B|

    (4)

    Thus the magnitudes of the vectors are also accumulated inthe loop (lines 13-14). Each resultingAB is that signaturesscore against a characteristic functionfj ; together they resultin a score vector:

    Si =

    Sif1Sif2

    ...SifN

    (5)

    The procedure is performed against every signature in thelibrary to create a baseline set of score values. The two-dimensional array of scores is indexed by each signaturesunique identifier. If additional signatures are added to thelibrary, sets of scores are calculated for the new additions andstored.

    When an unknown signature S is introduced, the samemethod is used to calculate a set of scores. A Euclideandistance value is then computed against each existing signa-tures set of scores. The scores of each database signatureserve as its coordinates in the N-dimensional function space.The unknown signatures location, as determined by its ownscores, should then be located closest to other signatures that

    share similar spectral characteristics. These nearest neighborsare then selected for further automated or visual analysis, atthe users discretion.

    The k-means clustering algorithm (Algorithm 2) is appliedany time the spectral library is updated. The set ofk clustercentroidsC is initialized as a random sampling of locationswithin the dataset (line 5). Each signatureSi is associatedwith the centroid closest to it by Euclidean distance (lines 7-9), then new centroid locations are calculated for each clusterrepresenting the average of the cluster constituents locations(lines 10-12). Then the sum change in centroid positionsbetween the current iteration and the previous is calculated

    4

  • 7/24/2019 K Means Spectral

    5/10

    Algorithm 1cSAM Signature Scoring

    1: procedure GENERATESCORES(S, f)2: scores:= 2D array of [signature IDs][score values]3: foreachSiin Sdo4: foreachfj(x)infdo5: product 06: sMag 07: fMag 08: foreach datapoint(X, Y)inSi do

    9: ifXorY is NaNthen10: skip datapoint11: else12: product= product+ (Y fj(X))13: sMag= sMag+Y2

    14: fMag= fMag+fj(X)2

    15: end if16: end for17: scores[Si][fj]=cos

    1 ( productsMag

    fMag

    )

    18: end for19: end for20: returnscores21: end procedure

    (lines 13-16). The associate / update loop continues untilfalls below the threshold parameter (line 6), indicating thatthe cluster associations have stabilized. The algorithm returnsthe final cluster centroid locations, along with the mapping ofsignatures to assigned cluster (line 19).

    Algorithm 2k-Means Clustering

    1: procedure KMEANS(dbScores[S(f)], Number of classesk, Error Threshold)

    2: C, C := sizek arrays of centroid locations3: A:= mapping of signatures to assigned cluster #4: Change between iterations5: C kpoints randomly selected fromdbScores6: while do7: foreachSiin Sdo8: A (Si nearest cluster inC)9: end for10: foreachCj inCdo11: Cj = average of all points mapped toCj in A12: end for13: = 014: foreachCj inC

    do15: = + |Cj Cj |16: end for17: C C

    18: end while19: returnC, A20: end procedure

    Implementation

    In support of ISP spectral analysis, JHU/APL has developeda standardized database schema representation of spectralsignatures, and an associated Java-language software applica-tion SigDB to aid in exchange, preliminary analysis, com-parison and classification of collected spectra. The scoringand clustering methodology described herein was developedas a plug-in capability for the SigDB application, whichenabled immediate access to a large quantity of spectral dataand a framework for analysis. SigDB stores signature datain an SQL relational database as IEEE754 64-bit floating

    Table 1. Selected Characteristic Functions (xin microns)

    Name Equation

    10-nm Cosine y= cos(100x)

    1-m Cosine y= cos(x)

    100-m Cosine y = cos( x100

    )

    point values. Signature, sample, environment, sensor, andobservable metadata are all stored in various database tablesand referenced by the signatures unique hash identifier.

    The data used for comparison was selected from the Ad-vanced Spaceborne Thermal Emission Reflection Radiometer(ASTER) Spectral Library 2.0, a collection of spectra of nat-ural and man-made materials produced by a collaboration ofthe Jet Propulsion Laboratory, the Johns Hopkins University,and the United States Geological Survey [2]. The data spansthe 0.4 to 15.4 m wavelength, which includes the visualand near-infrared (VIS/NIR), shortwave (SWIR), and thermalinfrared (TIR) electromagnetic bands. All of the selected sig-

    natures describe directional hemispherical reflectance as col-lected by the NASA Terra spaceborne hyperspectral imagingplatform, are represented in percent reflectivity, and consistof approximately 400-600 data points each. The data are notuniformly sampled; for example, some begin at 0.43 m andothers at 0.3 m. However, all of the data do exhibit thesame sampling resolution: 2-nm up to 0.8m, 20-nm between0.8m - 5m, and 100-nm between 5m - 14m.

    Table 1 lists the preliminary characteristic functions selectedas reference spectra for this paper. The selections, all cosinefunctions of varying frequency, are intended to mirror thedesirable properties identified above while remaining com-putationally trivial to execute. Under the assumption thatall signature wavelength values are represented in microns,the selected functions capture spectral features at the 10-nanometer, 1-micrometer and 100-micrometer resolutions.The computation of discrete values for these functions withrespect to the described Signature Scoring algorithm washard-coded into the software implementation.

    1,800 samples of various minerals, soils, vegetation, andmanmade materials were selected from the ASTER libraryfor comparison. Spectral angle scores were generated againstthe three characteristic functions above and stored in thedatabase. The k-Means algorithm was performed against

    the score dataset using k =

    N2

    = 30 randomly selected

    signatures as initial centroid locations. The selected errorthreshold was = 1105. The centroid locations convergedon a stable solution at this error threshold after 29 iterations.

    One signature, 14259.61 (a sample of lunar dust collectedfrom the Apollo 11 mare site), was chosen to represent theunknown signature (Figure 4). Its score data was manuallyremoved from the database, then the cSAM plugin was runto recalculate the score and determine its cluster association.The runtime of both the database scoring / clustering processand the signature classification process were recorded. Also,to evaluate the accuracy and run-time of the scoring processwithout clustering, the unknown signatures score valueswere recomputed and spatially compared against all 1,799database spectra via Euclidean distance computation.

    In addition, a traditional SAM algorithm was also imple-

    5

  • 7/24/2019 K Means Spectral

    6/10

    Figure 4. Selected Unknown Signature for Search Com-parison

    mented as a SigDB plug-in capability and run against thesame dataset. The same signature was selected as the un-known and compared against the 1,799 other signatures. ThisSAM implementation does not perform any interpolation orextrapolation to align signature domains; as a result, databasesignatures with minimum/maximum domain values were au-tomatically pre-filtered. Further, if the code detects anymismatch in domain values within the two compared spectraduring comparison (such as a missing/suppressed datapointin one but not the other), it immediately terminates thatcomparison, reports a NaN spectral score, and moves on tothe next comparison; however, these still impact the run-timeof the algorithm. These are known and accepted limitationsof the traditional SAM algorithm, and are usually workedaround via data interpolation and extrapolation. The reported

    matches and run-times of each process were recorded andcompared.

    Results

    Table 2 shows the number of computations and runtimesfor each of the algorithm processes performed. The firstrow is a traditional spectral angle mapping comparison ofthe unknown signature against the database spectra. Al-though the database contains 1,800 signatures, only 477match the same minimum/maximum domain, so only thesewere selected for comparison; of these, only 17 signaturesmatched every datapoints domain precisely. Thus, the other460 calculations were terminated before completion. Thisprocess took approximately forty seconds; the generatedscores, along with the name of each signature, is shown

    in Table 3. The second row includes computation of scores against each of the three characteristic functions, k-means cluster classification (which included 29 iterations ofthe clustering algorithm), and storage in the database. Alldatabase spectra were scored, but no actual comparisons wereperformed in this step. This process took approximately fourminutes. The third row includes calculation of the unknownsignatures scores against the characteristic functions, thenEuclidean distance comparison of those scores against eachof the 30 cluster centroids to determine classification. Thisprocess took ten milliseconds. The final row, which is notperformed in the normal course ofk -means analysis, was arecalculation of the unknown signatures scores (in order to

    Figure 5. Scores of Selected ASTER Data against 1-m and100-m Cosine

    Figure 6. Signatures Scored by SAM (Lunar Dust and SeaWater)

    fairly compare run-time) and Euclidean distance comparisonagainst all 1,799 other score sets in the database. This processtook fifteen milliseconds, and the cSAM Euclidean distancesto the signatures scored by traditional SAM are also shown inTable 3.

    Figure 5 illustrates the overall spread of spectral angle scores

    of the 1,800 ASTER signatures against two of the threeselected characteristic functions. The third function, 10-nmcosine, results in minimal differentiation between signatures;all fall in the range [88.2, 92.1] degrees, so this axis is omittedin the figure. The narrow differentiation of scores by the10-nm cosine and the relatively broad distribution of scoresagainst the other two functions illustrate both implications ofthe second empirical test of desirability: the former indicatesthat the 10-nm cosine is undesirable for the dataset at hand,while the latter indicates that the 1-m and 100-m cosinesdo perform well as discriminators. Table 4 describes thelocation of the cluster centroids in the score space producedby the three functions.

    6

  • 7/24/2019 K Means Spectral

    7/10

    Table 2. Run-time Comparison

    Process # Score Calculations # Comparisons Run-Time (ms) Avg ms perComparison

    SAM Algorithm 477 (17) 477 (17) 39,794 83

    cSAM Score Computation 1800 0 223,340 N/A

    cSAM Cluster Comparison 1 30 10 0.33

    cSAM Euclidean Score Comparison 1 1800 15 0.01

    Table 3. SAM Direct Comparison Results (Scores inDegrees) and cSAM Euclidean Distances

    Signature Name Spectral Angle cSAM Distance

    14148.183 1.333 1.063

    12024.69 1.656 0.695

    12023.139 2.651 1.901

    14149.18 2.670 2.032

    12070.405 2.760 2.094

    12030.135 3.041 2.046

    61241.98 3.369 3.017

    64801.34 4.159 3.642

    68501.609 4.561 4.303

    10084.1939 4.643 4.810

    62231.15 4.776 4.329

    60051.19 5.550 5.080

    14141.146 7.066 5.637

    67941.72 8.133 7.648

    67701.36 8.356 7.972

    61221.79 8.855 7.852

    Sea water 24.929 8.786

    4. FINDINGS AND A NALYSIS

    The results were evaluated by comparing the results of thetraditional SAM approach to the score values produced bycSAM, as well as the cluster classifications produced by k -means. The SAM algorithm was only able to compare theunknown signature with those that were in precisely the samedomain, which coincided with data produced by the samesensor, and thus largely correlated with the most probableassociation: as shown in the second column of Table 3 andFigure 6, lunar dust signatures all scored within < 12.9,and the one non-lunar signature compared, sea water, scored

    = 23.003. For the same signatures, cSAM produced Eu-clidean distances as shown in the third column of Table 3. Thefull spread of Euclidean distances to the unknown signatureis shown in Figure 7, which illustrates the spatial distributionof spectral scores with respect to the unknown. All of thelunar dust signatures, which we consider true matches, scoredwithin the closest 11% of spectra within the database.

    The k-means clustering placed the unknown signature inCluster #3. The other lunar dust signatures were placedin Clusters 3 (7 signatures), 11 (8 signatures) and 27 (1signature). The sea water signature was also placed in Cluster27.

    Table 4. Cluster Centroid Locations (Scores in Degrees)

    Cluster 10-nm 1-m 100-m Signatures

    0 90.443 90.380 16.014 61

    1 89.401 123.94 42.293 622 90.636 93.475 38.517 57

    3 89.874 124.72 33.820 70

    4 90.451 80.886 7.3951 59

    5 89.293 119.21 48.691 44

    6 90.587 116.74 22.568 40

    7 90.633 114.72 29.288 68

    8 90.104 126.14 20.840 89

    9 90.366 118.52 37.312 43

    10 90.575 106.13 27.278 53

    11 90.630 107.24 37.788 72

    12 90.175 131.17 24.880 122

    13 90.522 86.036 9.4432 4914 90.289 125.50 27.491 61

    15 90.282 119.15 9.6618 89

    16 90.859 75.508 58.144 13

    17 90.380 117.53 14.352 72

    18 89.745 140.77 30.669 48

    19 90.624 97.332 24.690 64

    20 90.476 74.903 7.8526 77

    21 90.355 68.062 15.130 48

    22 89.594 122.00 1.4625 39

    23 89.747 134.34 29.652 78

    24 90.483 57.068 30.304 47

    25 89.429 130.32 36.741 56

    26 90.591 78.987 2.5388 120

    27 90.121 124.53 13.485 38

    28 90.413 76.509 24.125 18

    29 90.546 105.788 19.860 43

    7

  • 7/24/2019 K Means Spectral

    8/10

    Figure 7. Sorted Euclidean Distances of 1,799 Database

    Spectral Scores to the Unknown Signature (sorted ascendingfrom left)

    Based on these results and run-time comparisons, we find:

    1. Scoring against characteristic functions via the cSAMalgorithm generally approximates the spectral similarity be-tween signatures, as appropriate for a first-pass filter.

    2. Though the cSAM method requires a greater up-front timeinvestment to perform characteristic computations, the timeto compare a newly captured signature against a large libraryis reduced from a linear-scale operation to near-constant time.

    3. The unsupervisedk-means clustering algorithm does noteffectively partition the cSAM score-space into usable clas-sifications. However, the use of semi-supervised approaches(using existing classification information stored in the spec-tral library), better heuristic selection of the number of likelyclassesk, and informed selection of the cluster-initializationcentroids all are likely to dramatically improve classificationaccuracy.

    The importance of careful selection of characteristic func-tions was clearly illustrated by the 10-nm cosine functionsinability to discriminate amongst the library spectra. Intu-itively, a 10-nm-scale curvature is negligible when comparedagainst spectra on the micron scale; therefore, the range ofspectral resolutions of the library spectra are a significant

    factor in the efficacy of the functions.

    5. CONCLUSION AND F UTURE W OR K

    Characteristic spectral angle mapping is a potentially pow-erful approach to reducing the run-time cost of autonomousspectral classification and identification against large sig-nature data sets. By converting the spectral classificationproblem into a spatial problem, cSAM enables the applicationof many existing well-developed classification approaches.Our preliminary results indicate a good correlation betweenthe chosen characteristic functions spatial scores and brute

    force SAM comparison values, while offering a significantdecrease in the time to identify potential matches. This couldvastly improve the performance and accuracy of existingspectral systems in use by scientific, defense and emergencyresponse stakeholders.

    Two primary areas have been identified for further inves-tigation. Characteristic functions appropriate for use withlibraries consisting of different spectral bands and spectralresolutions should be considered and evaluated, based onthe desirable properties and empirical tests described above.Automated polynomial-based approaches may also allow thecharacteristic functions to be generated dynamically basedon the actual content of the spectral library. Other methodsbesides k-means should also be considered and evaluated.This includes fuzzy k-means, which would reduce the par-titioning of dense areas of the score space; semi-supervisedapproaches, which can take advantage of the copious labeldata within the library; and dynamic determination of thenumber of classes/clusters, based on the known materialcontent of the library.

    ACKNOWLEDGMENT

    The authors would like to thank the Integrated SignaturesProgram for their support, Thomas Spisz (JHU/APL) forinformation on the Spectral Angle Mapper algorithm, andEdward Birrane (JHU/APL) and Jason Oxenrider (JHU/APL)for editing and review.

    REFERENCES

    [1] J. Li, D. B. Hibbert, S. Fuller, and G. Vaughn, A com-parative study of point-to-point algorithms for matchingspectra,Chemometrics and Intelligent Laboratory Sys-tems, vol. 82, no. 1-2, pp. 5058, May 2006.

    [2] A. M. Baldridge, S. J. Hook, C. I. Grove, and G. Rivera,The aster spectral library version 2.0,Remote Sensingof Environment, vol. 113, pp. 711715, 2009.

    [3] R. O. Duda, P. E. Hart, and D. G. Stork, PatternClassification, 2nd ed. John Wiley and Sons, 2001.

    [4] Y. Sohn and N. S. Rebello, Supervised and unsu-pervised spectral angle classifiers, PhotogrammatricEngineering and Remote Sensing, vol. 68, no. 12, pp.12711280, December 2002.

    [5] F. A. Kruse, J. W. Boardman, and J. F. Huntington,Comparison of airborne hyperspectral data and eo-1hyperion for mineral mapping, IEEE Transactions onGeoscience and Remote Sensing, vol. 41, no. 6, pp.13881400, June 2003.

    [6] C. Salvaggio, L. E. Smith, and E. J. Antoine,Spectral signature databases and their applica-tion/misapplication to modeling and exploitation ofmultispectral/hyperspectral data, in Algorithms andTechnologies for Multispectral, Hyperspectral, and Ul-traspectral Imagery XI, S. S. Shen and P. E. Lewis, Eds.,vol. 5806. SPIE, 2005.

    [7] K. Fukunaga,Introduction to Statistical Pattern Recog-nition, 2nd ed., W. Rheinboldt, Ed. New York: Aca-demic Press, October 1990.

    [8] M. Ester, H.-P. Kriegel, J. S, and X. Xu, A density-based algorithm for discovering clusters in large spa-

    8

  • 7/24/2019 K Means Spectral

    9/10

    tial databases with noise, in Proceedings of 2nd In-ternational Conference on Knowledge Discovery andData Mining, E. Simoudis, J. Han, and U. Fayyad,Eds., American Association for Artificial Intelligence.Menlo Park, California: The AAAI Press, 1996, pp.226231.

    [9] P. E. Dennison, K. Q. Halligan, and D. A. Roberts, Acomparison of error metrics and constraints for multipleendmember spectral mixture analysis and spectral angle

    mapper,Remote Sensing of Environment, vol. 93, no. 3,pp. 359367, November 2004.

    [10] K. V. Mardia, J. T. Kent, and J. M. Bibby,MultivariateAnalysis. London: Academic Press, 1979, pp. 360384.

    [11] F. Robinson, A. Apon, D. Brewer, L. Dowdy, D. Hoff-man, and B. Lu, Initial starting point analysis for k-means clustering: a case study, in Proceedings of ALAR2006 Conference on Applied Research in InformationTechnology, 2006.

    [12] (2008, December) Aster spectral library. [Online].Available: http://speclib.jpl.nasa.gov/

    BIOGRAPHY[

    Vignesh Ramachandran received aB.S. in Computer Science from the Geor-gia Institute of Technology in 2007and an M.S. In Aerospace Engineeringfrom the University of Maryland, Col-lege Park in 2013. He has workedat the Johns Hopkins University Ap-

    plied Physics Laboratory since 2008 asa Ground Software Engineer, design-ing command, telemetry, data process-

    ing and network engineering solutions for NASA missions(such as the Van Allen Probes and MESSENGER spacecraft)as well as a variety of other civil and defense applications.Mr. Ramachandran currently serves as the Vice-Chair of theAmerican Institute of Aeronautics and Astronautics (AIAA)Mid-Atlantic Section, and has twice served as the GeneralConference Chair of the AIAA Young Professionals, Studentsand Education Conference (YPSE).

    Herbert Mitchell received a B.S. inChemistry from Washington and LeeUniversity and a M.S. in AnalyticalChemistry from University of Virginia.He entered the U.S. Navy after gradu-ation and served as a scientist dealingwith the effects of nuclear weapons ef-fects on humans and on the chemistryof the atmosphere. In his governmentroles and afterwards as a contractor

    supporting the defense department and other agencies, heauthored several reports, worked on several special projects,and served on several committees investigating scientificphenomena. He has a record of leading them to successfulconclusions. Often these projects were of interest to highlevels of government. His interests have generally been todevelop novel ways to use wide ranges of sensors to betteracquire data of needed interest to the Defense Department.

    For the last decade he has been working for the PhysicsDepartment of the Naval Postgraduate School and has beenworking at several agencies in the Washington, DC area, mostrecently at the Joint IED Defeat agency (JIEDDO).

    Samantha Jacobs received a B.S. inPhysics from Georgia Southern Uni-versity in 2012. In 2013 she joinedthe Johns Hopkins University AppliedPhysics Laboratory as an associateGround Software Engineer in the SpaceDepartment. Her work in the SpaceDepartment includes automated testing,network engineering solutions, and dataprocessing.

    9

  • 7/24/2019 K Means Spectral

    10/10

    Nigel Tzeng received a B.S. In Com-puter Science and a M.S. In Software En-gineering from the University of Mary-land College Park. Mr. Tzeng hasover 20 years experience in spacecraftground systems, command and con-trol (C2) systems, data visualizationand software engineering. He joinedthe Johns Hopkins University AppliedPhysics Laboratory (JHUAPL) in 2003

    and is currently a senior member of the Space Departmenttechnical staff. Mr. Tzeng leads the development of signatureand geospatial analysis/exploitation software systems andserved as the Group Chief Scientist for the C2 Systems Engi-neering Group from 2007-2009 as well as been the PrincipalInvestigator of several C2 research initiatives. His primaryarea of research are command and control, geospatial visu-alization and collaboration. Prior to joining JHUAPL, Mr.Tzeng worked in telecommunications, e-commerce, advancedtraffic management systems, spacecraft simulation (Landsat,SOHO), spacecraft command and control (SAMPEX, TRMM,FUSE), and science data processing/visualization (COBE).He was the lead software architect and designer of the City ofLouisville Advanced Traffic Management System and devel-oper of the DIRBE, FIRAS and DMR sky map visualization

    software on COBE.

    Alexer Firpi received a B.S. in electricalengineering from Polytechnic University(San Juan, Puerto Rico), an M.S. inelectrical engineering from the Univer-sity of Puerto Rico (Mayaguez, PuertoRico), and a Ph.D. in electrical engi-neering from Michigan State University(East Lansing, MI). After concluding hisdoctoral studies, Dr. Firpi did post-doctoral work at different institutions in

    diverse research areas such as intelligent control, biomedicalengineering, imaging genetics, and bioinformatics. He iscurrently a senior staff member at Johns Hopkins University

    - Applied Physics Lab. Dr. Firpis research focuses onmachine learning, brain-computer interfaces, computationalintelligence, and any other research problem that can beautomated using machine-learning approaches. He is theauthor of more than 20 peer-reviewed publications and twobook chapters.

    Benjamin Rodriguezreceived a Bach-elors of Science (B.S.) and Masters ofScience (M.S.) in Electrical Engineer-ing from the University of Texas, andreceived a Doctor of Philosophy (Ph.D.)in Electrical and Computer Engineeringfrom the Air Force Institute of Technol-ogy, Graduate School of Engineeringand Management, Electrical and Com-puter Engineering Department, Wright-

    Patterson Air Force Base, OH. He is the Section Supervisorfor Space Systems and Architectures in the Space Depart-ment with The Johns Hopkins University Applied PhysicsLaboratory. He is also an instructor at The Johns HopkinsUniversity, Whiting School of Engineering for the Depart-ment of Electrical and Computer Engineering as well as theDepartment of Computer Science.

    10