sas programming for data mining

8
4/28/15, 7:02 PM SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test Page 1 of 8 http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html © Copyright 2006-2014 / SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. SAS Programming for Data Mining About Home Bayesian using SAS Friday, October 23, 2009 AUC calculation using Wilcoxon Rank Sum Test Accurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linear logistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1 respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able to obtain accurate measurement of AUC for this given data. The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 and N0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W is the Wilcoxon Rank Sums. In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it as AUC=0.9119491555 %macro AUC( dsn, Target, score); ods select none; ods output WilcoxonScores=WilcoxonScore; proc npar1way wilcoxon data=&dsn ; where &Target^=.; class &Target; var &score; run; ods select all; data AUC; set WilcoxonScore end=eof; retain v1 v2 1; if _n_=1 then v1=abs(ExpectedSum - SumOfScores); v2=N*v2; if eof then do; d=v1/v2; Gini=d * 2; AUC=d+0.5; put AUC= GINI=; keep AUC Gini; output; end; run; %mend; data test; do i = 1 to 10000; x = ranuni(1); y=(x + rannor(2315)*0.2 > 0.35 ) ; output; end; run; ods select none; ods output Association=Asso; proc logistic data = test desc; model y = x; score data = test out = predicted ; run; ods select all; data _null_; About Me Join this site with Google Friend Connect Members (67) More » Already a member? Sign in Follow Me Search Search This Blog Analytics in Writing MySAS.NET PROC-X Aggregator SAS Analysis by Charlie SAS Community SAS Die Hard SAS Graph Examples SAS Support SAS-L Archives StatComput by Wensui Sites on SAS Sites on R & Python Este sitio emplea cookies como ayuda para prestar servicios. Al utilizar este sitio, estás aceptando el uso de cookies. Más información Entendido

Upload: jennifer-parker

Post on 28-Sep-2015

75 views

Category:

Documents


19 download

DESCRIPTION

AUC calculation using Wilcoxon Rank Sum Test

TRANSCRIPT

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 1 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    Copyright 2006-2014 / SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks ofSAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of theirrespective companies.

    SAS Programming for Data Mining

    About Home Bayesian using SAS

    Friday, October 23, 2009

    AUC calculation using Wilcoxon Rank Sum TestAccurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data

    In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linearlogistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able toobtain accurate measurement of AUC for this given data.

    The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 andN0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W isthe Wilcoxon Rank Sums.

    In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it asAUC=0.9119491555

    %macro AUC( dsn, Target, score);ods select none;ods output WilcoxonScores=WilcoxonScore;proc npar1way wilcoxon data=&dsn ; where &Target^=.; class &Target; var &score; run;ods select all;

    data AUC; set WilcoxonScore end=eof; retain v1 v2 1; if _n_=1 then v1=abs(ExpectedSum - SumOfScores); v2=N*v2; if eof then do; d=v1/v2; Gini=d * 2; AUC=d+0.5; put AUC= GINI=; keep AUC Gini; output; end;run;%mend;

    data test; do i = 1 to 10000; x = ranuni(1); y=(x + rannor(2315)*0.2 > 0.35 ) ; output; end;run;

    ods select none;ods output Association=Asso;proc logistic data = test desc; model y = x; score data = test out = predicted ; run;ods select all;

    data _null_;

    About Me

    Join this sitewith Google Friend Connect

    Members (67) More

    Already a member? Sign in

    Follow Me

    Search

    Search This Blog

    Analytics in Writing

    MySAS.NET

    PROC-X Aggregator

    SAS Analysis by Charlie

    SAS Community

    SAS Die Hard

    SAS Graph Examples

    SAS Support

    SAS-L Archives

    StatComput by Wensui

    Sites on SAS

    Sites on R & Python

    Este sitio emplea cookies como ayuda para prestar servicios. Al utilizar este sitio, ests aceptando el uso de cookies. Ms informacin Entendido

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 2 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    set Asso; if Label2='c' then put 'c-stat=' nValue2;run;%AUC( predicted, y, p_0);

    NPAR1WAY gets AUC = 0.91766634744;LOGISTIC reports c-statistic = 0.917659

    So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure,PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, fromNPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the columnvariable is predictor)from PROC FREQ:

    proc freq data=test noprint; tables y*x/ measures; output out=_measures measures;run;

    data _null_; set _measures; put _SMDCR_=;run;

    Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED!Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:

    35463547 data one;3548 call streaminit(98676876);3549 do id=1 to 1e5;3550 score=ranuni(0)*1000;3551 if score+rannor(0)>0 then y=1;3552 else y=0;3553 output;3554 drop id;3555 end;3556 run;

    NOTE: The data set WORK.ONE has 100000 observations and 2 variables.NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.04 seconds

    35573558 proc freq data=one noprint;3559 tables score*y/measures noprint;3560 output out=_freq_out measures;3561 run;

    NOTE: There were 100000 observations read from the data set WORK.ONE.NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.NOTE: PROCEDURE FREQ used (Total process time): real time 37.63 seconds cpu time 37.56 seconds

    35623563 data _null_;3564 set _freq_out;3565 AUC=_smdrc_/2 + 0.5;3566 put "AUC = " AUC " SOMER'S D R|C = " _smdrc_;3567 run;

    AUC = 0.9995285252 SOMER'S D R|C = 0.9990570504NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds

    35683569 %AUC(one, y, score);

    NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.

    Exploring Data Blog

    Python Scikit

    Python SciPy

    R Bloggers Aggregator

    R Cookbook

    R Graphics

    R Project

    Baidu

    Bing

    Colt: JAVA Lib for Computing

    Google

    Kaggle (DM Competition)

    MITBBS

    NIST Math & Stat Div

    Stats Blog

    Tim's TextMining

    UCI Machine LearningRepository

    UCLA Stat Computing

    Recommended Sites

    Array (5)

    AUC (1)

    Bayesian (2)

    Boost Algorithms (4)

    Data Manipulation (14)

    Data Mining (12)

    Erlang C (1)

    Filter (1)

    Finite Mixture Model (1)

    Format (1)

    Gap Statistic (1)

    Gini Index (1)

    GRAPH (2)

    Hash Object (4)

    Heckman Selection model (1)

    HOSVD (2)

    Tag

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 3 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    Posted by Liang Xie at 10/23/2009 02:09:00 PM

    Labels: AUC, Macro Programming, predictive modeling, PROC NPAR1WAY

    NOTE: There were 100000 observations read from the data set WORK.ONE. WHERE y not = .;NOTE: PROCEDURE NPAR1WAY used (Total process time): real time 0.10 seconds cpu time 0.09 seconds

    AUC=0.9995285252 Gini=0.9990570504NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.NOTE: The data set WORK.AUC has 1 observations and 2 variables.NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds

    Recommend this on Google

    Post a Comment

    Create a Link

    6 comments:Charlie Shipp Family said...

    Looks great .!.

    Thanks for your work in sasCommunity.

    Charlie Shipp

    11:25 PM, February 27, 2010

    eskimokitty said...

    It is awesome. Thank you so much!!

    4:37 PM, June 08, 2011

    raspcompote said...

    Thank you for this useful post.

    3:47 PM, March 13, 2012

    Luis Gustavo said...

    I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm tryingto study more about it and it would really help!

    Thanks

    10:32 AM, February 06, 2014

    Liang Xie said...

    The relationship is well explained at the Wikipedia page below:http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U

    7:37 PM, February 13, 2014

    Jon Dickens said...

    You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make thesame mistake.

    if you are interested in discussing this issue, then contact me via linkedin.

    Jon Dickens

    3:11 PM, July 20, 2014

    Links to this post

    HPGLIMMIX (1)

    Index (2)

    K-means Clustering (3)

    K/N Algorithm (1)

    kernel (1)

    KNN (3)

    LGD (1)

    Macro Programming (7)

    Moore-Penrose pseudoinverse(3)

    multi-threading (1)

    Nearest Neighbor (3)

    Over-dispersion (1)

    PCA (3)

    predictive modeling (17)

    PROC APPEND (1)

    PROC CANDISC (1)

    PROC CORR (3)

    PROC DISCRIM (5)

    PROC DISTANCE (2)

    PROC EXPAND (1)

    PROC FACTOR (1)

    PROC FASTCLUS (3)

    PROC FCMP (1)

    PROC FMM (1)

    PROC FORMAT (1)

    PROC GENDMO (1)

    PROC GLIMMIX (3)

    PROC GLMMOD (1)

    PROC GLMSELECT (2)

    PROC GPLOT (2)

    PROC HPMIXED (3)

    PROC KRIGE2D (1)

    PROC LIFEREG (1)

    PROC MEANS (3)

    PROC MIXED (1)

    PROC MODECLUS (1)

    PROC NPAR1WAY (1)

    PROC ORTHOREG (1)

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 4 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    Newer Post Older PostHome

    Subscribe to: Post Comments (Atom)

    SAS SQL SAS Programming Read SAS Dataset SAS SPSS

    PROC PLS (2)

    PROC PRINCOMP (9)

    PROC QLIM (1)

    PROC REG (6)

    PROC SCORE (7)

    PROC SQL (2)

    PROC STANDARD (1)

    PROC STDIZE (1)

    PROC UNIVARIATE (2)

    quantile computing (1)

    Queueing Model (1)

    R (3)

    random number (1)

    Random Split (1)

    RISK (1)

    SAS (2)

    Statistical Graphics (1)

    SVD (11)

    Tensor (2)

    Tobit Model (1)

    sklearn DecisionTree plotexample needs pydotplusIn Python, sklearn (scikit-learn)'s DecisionTreeexample uses pydot forplotting the generated tree:@here. But for Python 3,pydot has...Apr-25-2015 | More

    Migrating code pieces toGitHubOne of the original reasonsfor this blog was to keeptrack of my SAS code aswell as its relevant context.That was the mindset whenI was a SAS...Feb-05-2015 | More

    %SVD macro with BY-Processing

    For theRegularizedDiscriminantAnalysis Cross

    Validation, we need tocompute SVD for each pairof \((\lambda, \gamma)\),and the factorization...Dec-18-2014 | More

    Recent Posts

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 5 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    Dec-18-2014 | More

    Experient downdatingalgorithm for Leave-One-Out CV in RDAIn this post, I want todemonstrate a piece ofexperiment code fordowndating algorithm forLeave-One-Out (LOO)Cross Validation inRegularized...Dec-15-2014 | More

    Control Excel via SAS DDE& Python win32comExcel is probably the mostused interface betweenhuman and data.Whenever you are dealingwith business people,Excel is the de facto meansfor all...Dec-15-2014 | More

    %HPGLIMMIX SAS macrois available online at JSSwebsiteMy paper "%HPGLIMMIX:A High-Performance SASMacro for GLMMEstimation" is nowavailable at Journal ofStatistical Software website@here. SAS macro...Jul-01-2014 | More

    Market trend in advancedanalytics for SAS, R andPython

    Disclaimer: Thisstudy is a view onthe markettrend on demand

    of advanced analyticssoftware and theiradoptions from the jobmarket perspective,...Dec-06-2013 | More

    I don't always doregression, but when I do, Ido it in SAS ...

    There are severalexciting add-insfrom SASAnalytics products

    running on v9.4, especiallythe SAS/STAT highperformance procedures,where "high...Jul-19-2013 | More

    Finding the closest pair indatat using PROCMODECLUS

    UPDATE: RickWicklin kindlyshared hisvisualization efforts

    on the output to put a more

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 6 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    on the output to put a morestraightforward sense onthe results. Thanks. Here...May-08-2013 | More

    Large Scale Linear MixedModel

    Update at the end:

    ****************************;Bob at r4stats.com claimedthat a linear mixed modelwith over 5 millionobservations and 2million...Mar-26-2013 | More

    Poor man's HPQLIM?Tobit model is atype of censoredregression and isone of the most

    important regressionmodels you will encounterin business. Amemiya1984...Feb-26-2013 | More

    Kaggle Digit Recoginizer:SAS k-Nearest Neighborsolution

    Kaggle is hostingan educationaldata miningcompetition:

    Kaggle Digit Recognizer,using MNIST data.Handwritten digitrecognition is one of...Dec-10-2012 | More

    KNN Classification andRegression in SAS

    PDF available athere. Related poston KNNclassification using

    SAS is here. In data miningand predictive modeling, itrefers to a memory-based(or...Nov-25-2012 | More

    Finite Mixture Model forLoss Given Default (LGD)

    Loss Given Default(LGD) is a keybusiness metric ofrisk in financial

    service. One uniquefeature of this metric isoverdispersion and theother is...Oct-04-2012 | More

    SAS functions forcomputing parameters inErlang-C model

    Call centermanagement is

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 7 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    management isboth Arts andSciences. While

    driving moral and settingup strategies is more aboutArts, staffing and servicinglevel...Jul-12-2012 | More

    Stochastic GradientDecending LogisticRegression in SASTest the StochasticGradient DecendingLogistic Regression inSAS. The logic and codefollows the code piece ofRavi Varadhan, Ph.D fromthis...May-24-2012 | More

    Multi-Threaded PrincipleComponent Analysis

    SAS used to notsupportmultithreading inPCA, then I figured

    out that its server versionsupports this functionality,see here. Today, I...Jan-31-2012 | More

    Random Number Seeds:NOT only the first onematters!

    Today, Rick (blog@ here) wrotean article aboutrandom number

    seed in SAS to be used inrandom number functionsin DATA Step. Rick noticedwhen...Jan-30-2012 | More

    Using PROC CANCORR tosolve large scale PLSproblem

    Partial LeastSquare (PLS) is apowerful tool fordiscriminant

    analysis with large numberof predictors [1]. PLSextracts latent factorsthat...Nov-16-2011 | More

    Bayesian Computation (3)In Chapter 3 of "BayesianComputation with R", JimAlbert talked about how toconduct 2 fundamentaltasks of Statistics, namelyEstimation and...Oct-06-2011 | More Powered By : Blogger Plugins

    Blog Archive

  • 4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

    Page 8 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

    2015 (2)

    2014 (4)

    2013 (5)

    2012 (7)

    2011 (11)

    2010 (19)

    2009 (12)

    December (3)

    October (1)

    AUC calculationusing WilcoxonRank Sum Test

    September (2)

    August (2)

    July (1)

    June (1)

    April (1)

    March (1)

    2008 (1)

    2007 (5)

    2006 (3)

    SAS Data Mining SAS Output SAS Analysis SAS Macro

    Pageviews last month

    4 9 8 9

    Copyright (c). Liang Xie. Awesome Inc. template. Powered by Blogger.