cpsc 340: machine learning and data miningfwood/cs340/lectures/l10.pdf · other clustering methods...

CPSC340:MachineLearningandDataMining

OutlierDetectionFall2020

LastTime:HierarchicalClustering• Wediscussedhierarchicalclustering:– Performsclusteringatmultiplescales.– Outputisusuallyatreediagram (“dendrogram”).– Revealsmuchmorestructureindata.– Usuallynon-parametric:

• Atfinestscale,everypointisitsownclusters.

• Wediscussedsomeapplicationareas:– Animals(phylogenetics).– Languages.– Stories.– Fashion.

http://www.nature.com/nature/journal/v438/n7069/fig_tab/nature04338_F10.html

Biclustering• Biclustering:– Clusterthetrainingexamplesandfeatures.– Alsogivesfeaturerelationshipinformation.

• Simplestandmostpopularmethod:– Runclusteringmethodon‘X’(examples).– Runclusteringmethodon‘XT’(features).

• Oftenplottedwith‘X’asaheatmap.– Whererows/columnsarrangedbyclusters.– Helpsyou‘see’whythingsareclustered.

http://openi.nlm.nih.gov/detailedresult.php?img=2731891_gkp491f3&req=4

Biclustering• Visualization:hierarchicalbiclustering +heatmap+dendrograms.– Popularinbiology/medicine.

https://arxiv.org/pdf/1408.0856v1.pdf

Application:Medicaldata• Hierarchicalclusteringisverycommoninmedicaldataanalysis.– Biclustering differentsamplesofbreastcancer:

http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Finetti2008Sixteen-kinase.pdf

OtherClusteringMethods• Mixturemodels:– Probabilisticclustering.

• Mean-shiftclustering:– Findslocal“modes”indensityofpoints.– Alternativeapproachtovectorquantization.

• Bayesianclustering:– Avariantonensemblemethods.– Averagesovermodels/clusterings,weightedby“prior”beliefinthemodel/clustering.

Graph-BasedClustering• Spectralclusteringandgraph-basedclustering:– Clusteringofdatadescribedbygraphs.

https://griffsgraphs.wordpress.com/tag/clustering/http://ascr-discovery.science.doe.gov/2013/09/sifting-genomes/https://www.hackdiary.com/2012/04/05/extracting-a-social-graph-from-wikipedia-people-pages/

(pause)

MotivatingExample:FindingHolesinOzoneLayer

• ThehugeAntarcticozoneholewas“discovered”in1985.

• Ithadbeeninsatellitedatasince1976:– Butitwasflaggedandfilteredoutbyaquality-controlalgorithm.

https://en.wikipedia.org/wiki/Ozone_depletion

OutlierDetection• Outlierdetection:

– Findobservationsthatare“unusuallydifferent”fromtheothers.– Alsoknownas“anomalydetection”.– Maywanttoremoveoutliers,orbeinterestedintheoutliersthemselves(security).

• Somesourcesofoutliers:– Measurementerrors.– Dataentryerrors.– Contaminationofdatafromdifferentsources.– Rareevents.

http://mathworld.wolfram.com/Outlier.html

ApplicationsofOutlierDetection• Datacleaning.• Securityandfaultdetection(networkintrusion,DOSattacks).• Frauddetection(creditcards,stocks,votingirregularities).

• Detectingnaturaldisasters(underwaterearthquakes).• Astronomy(findnewclassesofstars/planets).• Genetics(identifyingindividualswithnew/ancientgenes).

ClassesofMethodsforOutlierDetection1. Model-basedmethods.2. Graphicalapproaches.3. Cluster-basedmethods.4. Distance-basedmethods.5. Supervised-learningmethods.

• Warning:thisisthetopicwiththemostambiguous“solutions”.

Butfirst…• Usuallyit’sgoodtodosomebasicsanitychecking…

– WouldanyvaluesinthecolumncauseaPython/Julia“type”error?– Whatistherangeofnumericalfeatures?– Whataretheuniqueentriesforacategoricalfeature?– Doesitlooklikepartsofthetableareduplicated?

• ThesetypesofsimpleerrorsareVERYcommoninrealdata.

Egg Milk Fish Wheat Shellfish Peanuts Peanuts Sick?

0 0.7 0 0.3 0 0 0 1

0.3 0.7 0 0.6 -1 3 3 1

0 0 0 “sick” 0 1 1 0

0.3 0.7 1.2 0 0.10 0 0 2

900 0 1.2 0.3 0.10 0 0 1

Model-BasedOutlierDetection• Model-basedoutlierdetection:

1. Fitaprobabilisticmodel.2. Outliersareexampleswithlowprobability.

• Example:– Assumedatafollowsnormaldistribution.– Thez-score for1Ddataisgivenby:

– “Numberofstandarddeviationsawayfromthemean”.– Say“outlier”if|z|>4,orsomeotherthreshold.


ProblemswithZ-Score• Unfortunately,themeanandvariancearesensitivetooutliers.

– Possiblefixes:usequantiles,orsequentiallyremoveworseoutlier.

• Thez-scorealsoassumesthatdatais“uni-modal”.– Dataisconcentratedaroundthemean.


Globalvs.LocalOutliers• Istheredpoint anoutlier?

Globalvs.LocalOutliers• Istheredpoint anoutlier?Whatifweaddthebluepoints?


• Redpointhasthelowestz-score.– Inthefirstcaseitwasa“global”outlier.– Inthissecondcaseit’sa“local”outlier:

• Withinnormaldatarange,butfarfromotherpoints.

• It’shardtopreciselydefine“outliers”.




• It’shardtopreciselydefine“outliers”.– Canwehaveoutliergroups?




• It’shardtopreciselydefine“outliers”.– Canwehaveoutliergroups?Whataboutrepeatingpatterns?

GraphicalOutlierDetection• Graphicalapproachtooutlierdetection:

1. Lookataplotofthedata.2. Humandecidesifdataisanoutlier.

• Examples:1. Boxplot:

• Visualizationofquantiles/outliers.• Only1variableatatime.

http://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/boxplot/



• Examples:1. Boxplot.2. Scatterplot:

• Candetectcomplexpatterns.• Only2variablesatatime.




• Examples:1. Boxplot.2. Scatterplot.3. Scatterplotarray:

• Lookatallcombinationsofvariables.• Butlaboriousinhigh-dimensions.• Stillonly2variablesatatime.

https://randomcriticalanalysis.wordpress.com/2015/05/25/standardized-tests-correlations-within-and-between-california-public-schools/



• Examples:1. Boxplot.2. Scatterplot.3. Scatterplotarray.4. Scatterplotof2-dimensionalPCA:

• ‘See’high-dimensionalstructure.• Butlosesinformationandsensitivetooutliers.

http://scienceblogs.com/gnxp/2008/08/14/the-genetic-map-of-europe/

Cluster-BasedOutlierDetection• Detectoutliersbasedonclustering:

1. Clusterthedata.2. Findpointsthatdon’tbelongtoclusters.

• Examples:1. K-means:

• Findpointsthatarefarawayfromanymean.• Findclusterswithasmallnumberofpoints.



• Examples:1. K-means.2. Density-basedclustering:

• Outliersarepointsnotassignedtocluster.

http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap10_anomaly_detection.pdf



• Examples:1. K-means.2. Density-basedclustering.3. Hierarchicalclustering:

• Outlierstakelongertojoinothergroups.• Alsogoodforoutliergroups.

http://www.nature.com/nature/journal/v438/n7069/fig_tab/nature04338_F10.html

Distance-BasedOutlierDetection• Mostoutlierdetectionapproachesarebasedondistances.• Canweskipthemodel/plot/clusteringandjustmeasuredistances?– Howmanypointslieinaradius‘epsilon’?– Whatisdistancetokth nearestneighbour?

• UBCconnection(firstpaperonthistopic):

GlobalDistance-BasedOutlierDetection:KNN• KNNoutlierdetection:– Foreachpoint,computetheaveragedistancetoitsKNN.– Choosepointswithbiggestvalues(orvaluesaboveathreshold)asoutliers.

• “Outliers”arepointsthatarefarfromtheirKNNs.

• GoldsteinandUchida[2016]:– Compared19methodson10datasets.– KNNbestforfinding“global”outliers.– “Local”outliersbestfoundwithlocaldistance-based methods…

http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0152173

LocalDistance-BasedOutlierDetection• Aswithdensity-basedclustering,problemwithdifferingdensities:

• Outliero2 hassimilardensityaselementsofclusterC1.• Basicideabehindlocaldistance-based methods:– Outliero2is“relatively”farcomparedtoitsneighbours.

http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf

LocalDistance-BasedOutlierDetection• “Outlierness”ratio ofexample‘i’:

• Ifoutlierness >1,xi isfurtherawayfromneighbours thanexpected.

http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdfhttps://en.wikipedia.org/wiki/Local_outlier_factor

IsolationForests• Recentmethodbasedonrandomtreesisisolationforests.– Growatreewhereeachstumpusesarandomfeatureandrandomsplit.– Stopwheneachexampleis“isolated”(eachleafhasoneexample).– The“isolationscore”isthedepthbeforeexamplegetsisolated.

• Outliersshouldbeisolatedquickly,inliersshouldneedlotsofrulestoisolate.

– Repeatfordifferentrandomtrees,takeaveragescore.

https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

ProblemwithUnsupervisedOutlierDetection• Whywasn’ttheholeintheozonelayerdiscoveredfor9years?

• Canbehardtodecidewhentoreport anoutler:– Ifyoureporttoomanynon-outliers,userswillturnyouoff.– MostantivirusprogramsdonotuseMLmethods(see"base-ratefallacy“)

https://en.wikipedia.org/wiki/Ozone_depletion

SupervisedOutlierDetection• Finalapproachtooutlierdetectionistousesupervisedlearning:

• yi =1ifxi isanoutlier.• yi =0ifxi isaregularpoint.

• Wecanuseourmethodsforsupervisedlearning:– Wecanfindverycomplicatedoutlierpatterns.– Classiccreditcardfrauddetectionmethodsuseddecisiontrees.

• Butitneedssupervision:– Weneedtoknowwhatoutlierslooklike.– Wemaynotdetectnew“types”ofoutliers.

(pause)

Motivation:ProductRecommendation• Acustomercomestoyourwebsitelookingtobuyanitem:

• Youwanttofindsimilaritemsthattheymightalsobuy:

User-ProductMatrix

AmazonProductRecommendation• Amazonproductrecommendationmethod:

• ReturntheKNNsacrosscolumns.– Find‘j’valuesminimizing||xi – xj||.– Productsthatwereboughtbysimilarsetsofusers.

• Butfirstdivideeachcolumnbyitsnorm,xi/||xi||.– Thisiscallednormalization.– Reflectswhetherproductisboughtbymanypeopleorfewpeople.

EndofPart2:KeyConcepts• Wefocusedon3unsupervisedlearningtasks:– Clustering.

• Partitioning(k-means)vs.density-based.• “Flat”vs.hierarachial (agglomerative).• Vectorquantization.• Labelswitching.

– OutlierDetection.• Surveyedcommonapproaches(andsaidthatproblemisill-defined).

Summary• Biclustering:clusteringoftheexamplesand thefeatures.• Outlierdetectionistaskoffindingunusuallydifferentexample.

– Aconceptthatisverydifficulttodefine.– Model-basedfindunlikelyexamplesgivenamodelofthedata.– Graphicalmethods plotdataandusehumantofindoutliers.– Cluster-basedmethods checkwhetherexamplesbelongtoclusters.– Distance-basedoutlierdetection:measure(relative)distancetoneighbours.– Supervised-learningforoutlierdetection:turnstaskintosupervisedlearning.

• Nexttime:supervisedlearningwithcontinuouslabels.

Application:Medicaldata• Hierarchicalclusteringisverycommoninmedicaldataanalysis.– Clusteringdifferentsamplesofcolorectoral cancer:

– Thisplotisdifferent,it’snotabiclustering:• Thematrixis‘n’by‘n’.• Eachmatrixelementgivescorrelation.• Clustersshouldlooklike“blocks”ondiagonal.• Orderofexamplesisreversedincolumns.

– Thisiswhydiagonalgoesfrombottom-to-top.– Pleasedon’tdothisreversal,it’sconfusingtome.

https://gut.bmj.com/content/gutjnl/66/4/633.full.pdf

“QualityControl”:OutlierDetectioninTime-Series

• Afieldprimarilyfocusingonoutlierdetectionisqualitycontrol.• Oneofthemaintoolsisplottingz-scorethresholdsovertime:

• Usuallydon’tdotestslike“|zi|>3”,sincethishappensnormally.• Instead,identifyproblemswithtestslike“|zi|>2twiceinarow”.

https://en.wikipedia.org/wiki/Laboratory_quality_control

Outlierness (SymbolDefinition)• LetNk(xi)bethek-nearestneighbours ofxi.• LetDk(xi)betheaveragedistancetok-nearestneighbours:

• Outlierness isratioofDk(xi)toaverageDk(xj)foritsneighbours ‘j’:

• Ifoutlierness >1,xi isfurtherawayfromneighbours thanexpected.

Outlierness withCloseClusters• Ifclustersareclose,outlierness givesunintuitiveresults:

• Inthisexample,‘p’hashigheroutlierness than‘q’and‘r’:– ThegreenpointsarenotpartoftheKNNlistof‘p’forsmall‘k’.

http://www.comp.nus.edu.sg/~atung/publication/pakdd06_outlier.pdf

Outlierness withCloseClusters• ‘Influencedoutlierness’(INFLO)ratio:

– Includeindenominatorthe‘reverse’k-nearestneighbours:• Pointsthathave‘p’inKNNlist.

– Adds‘s’and‘t’frombiggerclusterthatincludes‘p’:

• Butstillhasproblems:– Dealingwithhierarchicalclusters.– Yieldsmanyfalsepositivesifyouhave“global”outliers.– GoldsteinandUchida[2016]recommendjustusingKNN.

http://www.comp.nus.edu.sg/~atung/publication/pakdd06_outlier.pdf

Training/Validation/Testing(Supervised)• Atypicalsupervisedlearningsetup:– Train parametersondatasetD1.– Validate hyper-parametersondatasetD2.– Test errorevaluatedondatasetD3.

• WhatshouldwechooseforD1,D2,andD3?

• Usualanswer:shouldallbeIIDsamplesfromdatadistributionDs.

Training/Validation/Testing(OutlierDetection)• Atypicaloutlierdetection setup:

– Train parametersondatasetD1 (theremaybeno“training”todo).• Forexample,findz-scores.

– Validate hyper-parametersondatasetD2 (foroutlierdetection).• Forexample,seewhichz-scorethresholdseparatesD1 andD2.

– Test errorevaluatedondatasetD3 (foroutlierdetection).• Forexample,checkwhetherz-scorerecognizesD3 asoutliers.

• D1 willstillbesamplesfromDs (datadistribution).• D2 coulduseIIDsamplesfromanotherdistributionDm.

– Dm representsthe“none”or“outlier” class.– TuneparameterssothatDm samplesareoutliersandDs samplesaren’t.

• Couldjustfitabinaryclassifierhere.

Training/Validation/Testing(OutlierDetection)• Atypicaloutlierdetection setup:

– Train parametersondatasetD1 (theremaybeno“training”todo).• Forexample,findz-scores.

– Validate hyper-parametersondatasetD2 (foroutlierdetection).• Forexample,seewhichz-scorethresholdseparatesD1 andD2.

– Test errorevaluatedondatasetD3 (foroutlierdetection).• Forexample,checkwhetherz-scorerecognizesD3 asoutliers.

• D1 willstillbesamplesfromDs (datadistribution).• D2 coulduseIIDsamplesfromanotherdistributionDm.• D3 coulduseIIDsamplesfromDm.

– Howwelldoyoudoatrecognizing“data”samplesfrom“none”samples?

Training/Validation/Testing(OutlierDetection)• Seemslikeareasonablesetup:– D1 willstillbesamplesfromDs (datadistribution).– D2 coulduseIIDsamplesfromanotherdistributionDm.– D3 coulduseIIDsamplesfromDm.

• Whatcangowrong?

• YouneededtopickadistributionDm torepresent“none”.– Butinthewild,youroutliersmightfollowanother“none”distribution.– Thisprocedurecanoverfit toyourDm.

• Youcanoverestimateyourabilitytodetectoutliers.

OD-Test:abetterwaytoevaluateoutlierdetections

• Areasonablesetup:– D1 willstillbesamplesfromDs (datadistribution).– D2 coulduseIIDsamplesfromanotherdistributionDm.– D3 coulduseIIDsamplesfromDm.– D3 coulduseIIDsamplesfromyet-anotherdistributionDt.

• “Howdoyouperformatdetectingdifferenttypesofoutliers?”– Seemslikeaharderproblem,butarguablyclosertoreality.

OD-Test:abetterwaytoevaluateoutlierdetections

• “Howdoyouperformatdetectingdifferenttypesofoutliers?”

https://arxiv.org/pdf/1809.04729.pdf

cpsc 340: machine learning and data miningfwood/cs340/lectures/l10.pdf · other clustering methods...

Documents