sas programming for data mining

4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

Page 1 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Copyright 2006-2014 / SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks ofSAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of theirrespective companies.

SAS Programming for Data Mining

About Home Bayesian using SAS

Friday, October 23, 2009

AUC calculation using Wilcoxon Rank Sum TestAccurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data

In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linearlogistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able toobtain accurate measurement of AUC for this given data.

The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 andN0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W isthe Wilcoxon Rank Sums.

In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it asAUC=0.9119491555

%macro AUC( dsn, Target, score);ods select none;ods output WilcoxonScores=WilcoxonScore;proc npar1way wilcoxon data=&dsn ; where &Target^=.; class &Target; var &score; run;ods select all;

data AUC; set WilcoxonScore end=eof; retain v1 v2 1; if _n_=1 then v1=abs(ExpectedSum - SumOfScores); v2=N*v2; if eof then do; d=v1/v2; Gini=d * 2; AUC=d+0.5; put AUC= GINI=; keep AUC Gini; output; end;run;%mend;

data test; do i = 1 to 10000; x = ranuni(1); y=(x + rannor(2315)*0.2 > 0.35 ) ; output; end;run;

ods select none;ods output Association=Asso;proc logistic data = test desc; model y = x; score data = test out = predicted ; run;ods select all;

data _null_;

About Me

Join this sitewith Google Friend Connect

Members (67) More

Already a member? Sign in

Follow Me

Search

Search This Blog

Analytics in Writing

MySAS.NET

PROC-X Aggregator

SAS Analysis by Charlie

SAS Community

SAS Die Hard

SAS Graph Examples

SAS Support

SAS-L Archives

StatComput by Wensui

Sites on SAS

Sites on R & Python

Este sitio emplea cookies como ayuda para prestar servicios. Al utilizar este sitio, ests aceptando el uso de cookies. Ms informacin Entendido



set Asso; if Label2='c' then put 'c-stat=' nValue2;run;%AUC( predicted, y, p_0);

NPAR1WAY gets AUC = 0.91766634744;LOGISTIC reports c-statistic = 0.917659

So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure,PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, fromNPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the columnvariable is predictor)from PROC FREQ:

proc freq data=test noprint; tables y*x/ measures; output out=_measures measures;run;

data _null_; set _measures; put _SMDCR_=;run;

Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED!Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:

35463547 data one;3548 call streaminit(98676876);3549 do id=1 to 1e5;3550 score=ranuni(0)*1000;3551 if score+rannor(0)>0 then y=1;3552 else y=0;3553 output;3554 drop id;3555 end;3556 run;

NOTE: The data set WORK.ONE has 100000 observations and 2 variables.NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.04 seconds

35573558 proc freq data=one noprint;3559 tables score*y/measures noprint;3560 output out=_freq_out measures;3561 run;

NOTE: There were 100000 observations read from the data set WORK.ONE.NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.NOTE: PROCEDURE FREQ used (Total process time): real time 37.63 seconds cpu time 37.56 seconds

35623563 data _null_;3564 set _freq_out;3565 AUC=_smdrc_/2 + 0.5;3566 put "AUC = " AUC " SOMER'S D R|C = " _smdrc_;3567 run;

AUC = 0.9995285252 SOMER'S D R|C = 0.9990570504NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds

35683569 %AUC(one, y, score);

NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.

Exploring Data Blog

Python Scikit

Python SciPy

R Bloggers Aggregator

R Cookbook

R Graphics

R Project

Baidu

Bing

Colt: JAVA Lib for Computing

Google

Kaggle (DM Competition)

MITBBS

NIST Math & Stat Div

Stats Blog

Tim's TextMining

UCI Machine LearningRepository

UCLA Stat Computing

Recommended Sites

Array (5)

AUC (1)

Bayesian (2)

Boost Algorithms (4)

Data Manipulation (14)

Data Mining (12)

Erlang C (1)

Filter (1)

Finite Mixture Model (1)

Format (1)

Gap Statistic (1)

Gini Index (1)

GRAPH (2)

Hash Object (4)

Heckman Selection model (1)

HOSVD (2)

Tag



Posted by Liang Xie at 10/23/2009 02:09:00 PM

Labels: AUC, Macro Programming, predictive modeling, PROC NPAR1WAY

NOTE: There were 100000 observations read from the data set WORK.ONE. WHERE y not = .;NOTE: PROCEDURE NPAR1WAY used (Total process time): real time 0.10 seconds cpu time 0.09 seconds

AUC=0.9995285252 Gini=0.9990570504NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.NOTE: The data set WORK.AUC has 1 observations and 2 variables.NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds

Recommend this on Google

Post a Comment

Create a Link

6 comments:Charlie Shipp Family said...

Looks great .!.

Thanks for your work in sasCommunity.

Charlie Shipp

11:25 PM, February 27, 2010

eskimokitty said...

It is awesome. Thank you so much!!

4:37 PM, June 08, 2011

raspcompote said...

Thank you for this useful post.

3:47 PM, March 13, 2012

Luis Gustavo said...

I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm tryingto study more about it and it would really help!

Thanks

10:32 AM, February 06, 2014

Liang Xie said...

The relationship is well explained at the Wikipedia page below:http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U

7:37 PM, February 13, 2014

Jon Dickens said...

You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make thesame mistake.

if you are interested in discussing this issue, then contact me via linkedin.

Jon Dickens

3:11 PM, July 20, 2014

Links to this post

HPGLIMMIX (1)

Index (2)

K-means Clustering (3)

K/N Algorithm (1)

kernel (1)

KNN (3)

LGD (1)

Macro Programming (7)

Moore-Penrose pseudoinverse(3)

multi-threading (1)

Nearest Neighbor (3)

Over-dispersion (1)

PCA (3)

predictive modeling (17)

PROC APPEND (1)

PROC CANDISC (1)

PROC CORR (3)

PROC DISCRIM (5)

PROC DISTANCE (2)

PROC EXPAND (1)

PROC FACTOR (1)

PROC FASTCLUS (3)

PROC FCMP (1)

PROC FMM (1)

PROC FORMAT (1)

PROC GENDMO (1)

PROC GLIMMIX (3)

PROC GLMMOD (1)

PROC GLMSELECT (2)

PROC GPLOT (2)

PROC HPMIXED (3)

PROC KRIGE2D (1)

PROC LIFEREG (1)

PROC MEANS (3)

PROC MIXED (1)

PROC MODECLUS (1)

PROC NPAR1WAY (1)

PROC ORTHOREG (1)



Newer Post Older PostHome

Subscribe to: Post Comments (Atom)

SAS SQL SAS Programming Read SAS Dataset SAS SPSS

PROC PLS (2)

PROC PRINCOMP (9)

PROC QLIM (1)

PROC REG (6)

PROC SCORE (7)

PROC SQL (2)

PROC STANDARD (1)

PROC STDIZE (1)

PROC UNIVARIATE (2)

quantile computing (1)

Queueing Model (1)

R (3)

random number (1)

Random Split (1)

RISK (1)

SAS (2)

Statistical Graphics (1)

SVD (11)

Tensor (2)

Tobit Model (1)

sklearn DecisionTree plotexample needs pydotplusIn Python, sklearn (scikit-learn)'s DecisionTreeexample uses pydot forplotting the generated tree:@here. But for Python 3,pydot has...Apr-25-2015 | More

Migrating code pieces toGitHubOne of the original reasonsfor this blog was to keeptrack of my SAS code aswell as its relevant context.That was the mindset whenI was a SAS...Feb-05-2015 | More

%SVD macro with BY-Processing

For theRegularizedDiscriminantAnalysis Cross

Validation, we need tocompute SVD for each pairof \((\lambda, \gamma)\),and the factorization...Dec-18-2014 | More

Recent Posts



Dec-18-2014 | More

Experient downdatingalgorithm for Leave-One-Out CV in RDAIn this post, I want todemonstrate a piece ofexperiment code fordowndating algorithm forLeave-One-Out (LOO)Cross Validation inRegularized...Dec-15-2014 | More

Control Excel via SAS DDE& Python win32comExcel is probably the mostused interface betweenhuman and data.Whenever you are dealingwith business people,Excel is the de facto meansfor all...Dec-15-2014 | More

%HPGLIMMIX SAS macrois available online at JSSwebsiteMy paper "%HPGLIMMIX:A High-Performance SASMacro for GLMMEstimation" is nowavailable at Journal ofStatistical Software website@here. SAS macro...Jul-01-2014 | More

Market trend in advancedanalytics for SAS, R andPython

Disclaimer: Thisstudy is a view onthe markettrend on demand

of advanced analyticssoftware and theiradoptions from the jobmarket perspective,...Dec-06-2013 | More

I don't always doregression, but when I do, Ido it in SAS ...

There are severalexciting add-insfrom SASAnalytics products

running on v9.4, especiallythe SAS/STAT highperformance procedures,where "high...Jul-19-2013 | More

Finding the closest pair indatat using PROCMODECLUS

UPDATE: RickWicklin kindlyshared hisvisualization efforts

on the output to put a more



on the output to put a morestraightforward sense onthe results. Thanks. Here...May-08-2013 | More

Large Scale Linear MixedModel

Update at the end:

****************************;Bob at r4stats.com claimedthat a linear mixed modelwith over 5 millionobservations and 2million...Mar-26-2013 | More

Poor man's HPQLIM?Tobit model is atype of censoredregression and isone of the most

important regressionmodels you will encounterin business. Amemiya1984...Feb-26-2013 | More

Kaggle Digit Recoginizer:SAS k-Nearest Neighborsolution

Kaggle is hostingan educationaldata miningcompetition:

Kaggle Digit Recognizer,using MNIST data.Handwritten digitrecognition is one of...Dec-10-2012 | More

KNN Classification andRegression in SAS

PDF available athere. Related poston KNNclassification using

SAS is here. In data miningand predictive modeling, itrefers to a memory-based(or...Nov-25-2012 | More

Finite Mixture Model forLoss Given Default (LGD)

Loss Given Default(LGD) is a keybusiness metric ofrisk in financial

service. One uniquefeature of this metric isoverdispersion and theother is...Oct-04-2012 | More

SAS functions forcomputing parameters inErlang-C model

Call centermanagement is



management isboth Arts andSciences. While

driving moral and settingup strategies is more aboutArts, staffing and servicinglevel...Jul-12-2012 | More

Stochastic GradientDecending LogisticRegression in SASTest the StochasticGradient DecendingLogistic Regression inSAS. The logic and codefollows the code piece ofRavi Varadhan, Ph.D fromthis...May-24-2012 | More

Multi-Threaded PrincipleComponent Analysis

SAS used to notsupportmultithreading inPCA, then I figured

out that its server versionsupports this functionality,see here. Today, I...Jan-31-2012 | More

Random Number Seeds:NOT only the first onematters!

Today, Rick (blog@ here) wrotean article aboutrandom number

seed in SAS to be used inrandom number functionsin DATA Step. Rick noticedwhen...Jan-30-2012 | More

Using PROC CANCORR tosolve large scale PLSproblem

Partial LeastSquare (PLS) is apowerful tool fordiscriminant

analysis with large numberof predictors [1]. PLSextracts latent factorsthat...Nov-16-2011 | More

Bayesian Computation (3)In Chapter 3 of "BayesianComputation with R", JimAlbert talked about how toconduct 2 fundamentaltasks of Statistics, namelyEstimation and...Oct-06-2011 | More Powered By : Blogger Plugins

Blog Archive



2015 (2)

2014 (4)

2013 (5)

2012 (7)

2011 (11)

2010 (19)

2009 (12)

December (3)

October (1)

AUC calculationusing WilcoxonRank Sum Test

September (2)

August (2)

July (1)

June (1)

April (1)

March (1)

2008 (1)

2007 (5)

2006 (3)

SAS Data Mining SAS Output SAS Analysis SAS Macro

Pageviews last month

4 9 8 9

Copyright (c). Liang Xie. Awesome Inc. template. Powered by Blogger.

sas programming for data mining

Documents

auc calculation

alldata auc

auc area

auc gini output endrun

freq data

macro auc dsn

given data

wilcoxon rank sum statistics