automating role-based provisioning by learning …...automating role-based provisioning by learning...

Automating Role-based Provisioning by Learning fromExamples∗

Qun NiPurdue University, [email protected]

Jorge LoboIBM T. J. Watson, [email protected]

Seraphin CaloIBM T. J. Watson, [email protected]

Pankaj RohatgiIBM T. J. Watson, USA

[email protected]

Elisa BertinoPurdue University, USA

[email protected]

ABSTRACTRole-based provisioning has been adopted as a standard componentin leading Identity Management products due to its low administra-tion cost. However, the cost of adjusting existing roles to entitle-ments from newly deployed applications is usually very high. Inthis paper, a learning-based approach to automate the provisioningprocess is proposed and its effectiveness is verified by real provi-sioning data. Specific learning issues related to provisioning areidentified and relevant solutions are presented.

Categories and Subject DescriptorsC.2.0 [Computer Communication Networks]: General—secu-rity and protection; D.4.6 [Operating Systems]: Security and Pro-tection—Access Controls; K.6.5 [Management of Computing andInformation Systems]: Security and Protection

General TermsManagement, Security, Standardization

KeywordsProvisioning, Access Control, Classification, Role

1. INTRODUCTIONUser provisioning, that is, the process of providing users with ac-

cess to data and technology resources, is a fundamental process inenterprise-level resource management. Provisioning can be thoughtof as a combination of duties of the human resources and IT de-partments in an enterprise, where users are given access to datarepositories or granted authorization to use systems, applications,

∗The work reported in this paper has been partially supported byIBM under the OCR project “Privacy and Security Policy Manage-ment”, the NSF grant 0712846 “IPS: Security Services for Health-care Applications”, and MURI award FA9550-08-1-0265 from theAir Force Office of Scientific Research.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SACMAT’09, June 3–5, 2009, Stresa, Italy.Copyright 2009 ACM 978-1-60558-537-6/09/06 ...$5.00.

and databases based on a unique user identity and possibly othercontext information.

In the past decade role-based user provisioning has been incor-porated into the Identity Management (IdM) Products from leadingvendors [11], such as Oracle, IBM, Sun, Novel. In such role-basedaccess control (RBAC) systems, privileges are assigned to roles,and each user is classified into one or more roles. These systemsassign privileges to users strictly through their role membership.Role-based user provisioning promises some important benefits:

• Where many users require the same access privileges, thoseprivileges can be defined once and applied to multiple recip-ients. This is cost effective.

• In the event that the set of privileges required by a whole setof users changes, the role definition can be adjusted and thechanges can then be applied to many users at once.

• By changing a user’s role classification, it is possible to quicklyand reliably revoke some privileges and grant new ones.

The benefits of the role-based provisioning approach comes, how-ever, with some costs, namely: the cost of role definition, and thecost of role maintenance. In order to reduce the cost of role def-inition, approaches based on role mining have been proposed [10,15, 7]. Role mining refers to the process of mining data about theactual user-to-resource permission assignments to extract role def-initions. Role maintenance, and especially the adjustment of themapping between existing roles and new privileges from new ap-plications, has received however very little attention. Efficient andeffective role maintenance is crucial. Routine business events, suchas changing the responsibilities of some employees, extending thefunction of a department, or deploying new applications, requirethe remodeling and readjustment of the roles. The ongoing natureof such events means that a significant team of expert staff is re-quired to continuously maintain the role model. It is also importantto note that the mapping between roles and application privilegesis more complicated than the mapping between users and roles be-cause of the scale of privileges and complexity of applications. Ina successful role-based provisioning case for SunTrust [18], oneof the US’s largest banking organizations, roles are discovered bythe combination of top-down role analysis and bottom-up role min-ing. However, SunTrust still “took great pains” to go down into theinternal application table settings to define technical level accessprivileges in order to ensure the implementation of the regulatoryrequirements imposed by the Sarbanes-Oxley Act [18].

Since modern enterprises have to rapidly adapt to constantly chang-ing environments and requirements, new applications and services

75

need to be deployed, current applications and services be recon-figured, and roles assignments to tasks be modified. This in turnrequires continuous role adjustments, a process that ends up rep-resenting a significant proportion of the total cost of role mainte-nance, possibly outweighing the benefits of the adoption of role-based user provisioning. Such an observation explains why thereare proposals [1] suggesting that role-based provisioning is not suit-able for general cases and be limited to special situations.

In this paper, we investigate the role adjustment problem, thatis, how to automate the process that provisions existing roles withentitlements from newly deployed applications. The approach pro-posed in this paper can also be applied to a relevant but simplercase: provisioning new users with existing roles. Thus our ap-proach represents a general method to automate role-based provi-sioning. Our contributions are as follows:

• We formalize role-based provisioning as a function of rele-vant attributes and roles, and thus propose a supervised learning-based approach as a mechanism to discover such a functionand to automate role-based provisioning.

• We suggest solutions for specific problems that arise in learning-based provisioning.

• We verify the effectiveness of our approach based on realprovisioning data. We have essentially shown that a recom-mendation system based on our approach for assigning enti-tlements to roles is feasible in a real-wold scenario.

• We present experimental results that show the advantages ofthe learning-based approach over a rule-based approach.

• We identify two important and specific problems in the learning-based approach, namely, adapting to changes in role defini-tions and the enforcement of constraints, and suggest solu-tions to these problems.

The rest of the paper is organized as follows. Section 2 presentsa motivation case that is used in the subsequent sections. Section 3proposes a supervised learning-based solution to the provisioningproblem. Section 4 demonstrates the effectiveness of our approachon real provisioning data and discusses the experimental results.Section 5 compares our approach with a rule-based approach. Sec-tion 6 focuses on how to maintain a training set and how to enforcerelevant RBAC constraints and present solutions to these issues.Section 7 discusses related work. Section 8 concludes the paper.

2. A MOTIVATION CASEIn this section we discuss a real application scenario from a well

known enterprise, whose name cannot be disclosed for businessconfidentiality reasons. Such a scenario, which was the initial mo-tivation for this work, deals with a role-aware point of service prod-uct (rPOS) being deployed by the enterprise at some renown re-tailers that had already adopted role-based provisioning. In whatfollows, we focus on a specific retailer, referred to as retailer A.

A business role in retailer A, e.g. “Sales consultant MB in US-West” and “Monitoring and Reporting User”, is able to performmultiple business processes and functionalities through user inter-face elements in rPOS which are classified by organization units(Country, Group/Zone, e.g. US-Central, US-East, US-West) andsub-teams (e.g. marketing, sales).

An entitlement in rPOS is defined by the entitlement resource(e.g. access to user interface (UI) elements or functionality), action,parameter, and various context constraints, such as context scope,context source, and context type. Figure 1 shows a fraction of the

mapping between business roles in retailer A and UI entitlementsin rPOS. The meaning of the attributes in the mapping is shown inTable 1.

For instance, the mapping between entitlement UI_1_002 andbusiness role “Receptionist” means that “Receptionist” can per-form the “view” action on the “DLGMainLeadArea” dialog in the“Information area” if the context type is “Org-Unit”, the contextsource is “Lead”, and the context scope is “1”.

Such entitlements implement fine-grained access control poli-cies, able to take into account relevant context information. Forexample, even though a user is assigned to a business role “Re-ceptionist”, whether this user can perform the “view” action stilldepends on the result of the verification of the context constraints,which is dynamically performed by a run-time entitlement checker.

Although such a rPOS design looks fine, its deployment by theenterprise showed a problem not expected at design time, that is,that the creation of the mapping between the fine-grained entitle-ments and existing business roles is time consuming and costly.Currently such role-based provisioning is a manual process andcould involve the user calling a help-desk to obtain access to aparticular function. Rules regarding provisioning may not be wellspecified, which can lead to inconsistent decisions by the help-deskstaff. Calls to the help-desk are expensive, and mistakes can bemade.

The rPOS deployment in retailer A has 35 business roles, whichmay not look a large number. However, the problem is that evenfor a sub-application, the number of mappings between the UI en-titlements and the business roles is 659. It is easy to see how muchlarger such number would be in an on-going rPOS deployment inanother worldwide retailer, referred to as retailer B, that has over500 business roles.

The motivation case clearly shows that, whereas finer entitle-ments are needed for security and are required in many applica-tions, they make the mapping between entitlements and businessroles very complex and manually unmanageable.

3. A SUPERVISED LEARNING-BASED AP-PROACH

It is clear from the discussion in the previous section that weneed mechanisms to simplify the management of assignments ofentitlements to roles, based on the attributes of the entitlements.In this section we show how machine learning techniques can beadapted with this aim.

3.1 Supervised LearningIn a machine learning context, we may consider the mapping

between the various attributes of the entitlements and business rolesto be a function, f : X → Y , where:

1. X is a set of tuples; each such tuple represent an entitlement;the components of each such tuple can be either strings ornumbers (e.g. values of the attributes of the entitlement rep-resented by the tuple);

2. Y is a set of interchangeable, arbitrarily numbered labels(e.g. business roles).

The goal is thus to learn the function f . Once f is learned,we can automatically provision business roles with entitlementsby running a classifier based on f . In our approach we adoptedsupervised learning-based techniques to learn f from examples ofprovisioning. Unsupervised learning that does not need examplescan only group entitlements into different clusters based on theirsimilarities and thus is not suitable for our case. The application of

76

MBC POS Role Consolidation

E_ID Dialog ID UI-Element UI-Type

Context-

Type

Context-

Source

Context-

Scope Action Required by Default xxx rPOS Role

UI_1_001 DLGGeneralMenu Dialog None 6 View SP1-FK01 Receptionist

UI_1_001 DLGGeneralMenu Dialog None 6 View SP1-FK01 Lead Qualifier

UI_1_001 DLGGeneralMenu Dialog None 6 View SP1-FK01 Sales Consultant - Product X

UI_1_001 DLGGeneralMenu Dialog None 6 View SP1-FK01 Sub-Team Manager

UI_1_001 DLGGeneralMenu Dialog None 6 View SP1-FK01 Lead Manager

UI_1_002 DLGMainLeadArea Information area Dialog Org-Unit Lead 1 View SP1-FK01 Receptionist

UI_1_002 DLGMainLeadArea Information area Dialog Org-Unit Lead 1 View SP1-FK01 Lead Qualifier

UI_1_002 DLGMainLeadArea Information area Dialog Org-Unit Lead 1 View SP1-FK01 Sales Consultant - Product X

UI_1_002 DLGMainLeadArea Information area Dialog Org-Unit Lead 1 View SP1-FK01 Sub-Team Manager

UI_1_002 DLGMainLeadArea Information area Dialog Org-Unit Lead 1 View SP1-FK01 Lead Manager

UI_1_003 DLGMainLeadArea Inbound message area PageSection Owner Lead 0 Modify SP1-FK01 Receptionist

UI_1_003 DLGMainLeadArea Inbound message area PageSection Owner Lead 0 Modify SP1-FK01 Lead Qualifier

UI_1_003 DLGMainLeadArea Inbound message area PageSection Owner Lead 0 Modify SP1-FK01 Sales Consultant - Product X

10/6/2008 UI-Entitlement 1

Figure 1: The Mapping between UI Entitlements and Business Roles

Table 1: The meaning of attributes in Figure 1Attribute Meaning ExampleE_ID a unique entitlement identifier; it may need <UI_1_nnn>

a prefix depending on entitlement typeDialog_ID a UI dialog name <DLGMainLeadArea>UI-Element a UI element name <Process Area>UI-Type a UI element type <Dialog, Tab, PageSection, Field>Context-Type the scope type for which entitlement will be set <Org-Unit, Sub-Team, Owner, None>Context-Source define source business objects <Lead>Context-Scope a scope value, see Table 2 <0,1,2,3,4,6>Actions actions which can be set by the entitlement <View, Modify>Required by additional information required by software developers <SPx-FKxy>xxx rPOS Role business roles deployed in xxx <Sales Consultant MB US-West>

a supervised learning-based approach to our problem requires how-ever addressing several issues and to carry out several activities thatwe discuss in what follows.

3.2 Learning ProcedureThe learning-based provisioning process follows the steps out-

lined in Figure 2. The first step in the provisioning process is dataselection. In this step, we need to collect some provisioning datafrom human experts, e.g. role-entitlement mappings, as the train-ing set for the classifier. Another crucial activity in this step is theso called “feature selection”, aka attribute selection. It is possiblethat not all attributes are relevant to the provisioning mapping. InSection 5, we will show that irrelevant attributes may reduce theclassification accuracy. Features can be selected either by expertswith domain knowledge or automatically by the use of feature se-lection algorithms, e.g. feature ranking or subset selection [16].The optimal feature set should be both necessary and sufficient. Ifsome attributes are missing, after or even before feature selection,the accuracy would decrease as well. Even though in our specificscenario, the features (attributes) had already been chosen by ap-plication developers, we have identified feature selection as an im-portant element of our approach.

The second step deals with pre-processing the provisioning data.Original provisioning data may not be suitable for direct use byclassification algorithms. Missing attribute values is common inmany real application domains and it may impact the accuracy ofclassifications. Although some classification algorithms includemechanisms for dealing with missing values, these mechanismstypically do not take into account the semantics of the data. Forinstance, a missing value in a numerical attribute is typically re-placed with the average value of the attribute and a missing valuein a nominal attribute is replaced with the most frequent nominalvalue.

However, it is usually the case that a missing value in an attributereferred in an access control policy results in the “not applicable”or “not available” evaluation for the policy. Returning such an eval-uation result to the application is important in order to let the ap-plication know that the policy could not be evaluated because of amissing value for one of the attributes referenced in the policy. Theapplication is then able to manage such a situation, by for exampleevaluating alternative access control policies or taking alternativeactions. In such a situation, if the missing value were replacedwith the most frequent value of the attribute, the application wouldnot be provided with an accurate policy evaluation, for example, itcould results in a “permit” decision, thus causing potential viola-tions to the intended policies. Moreover, such a replacement mayundermine the accuracy of the learned classifier as well. Therefore,in our provisioning data, we replace all missing values with a newvalue “not applicable” in order differentiate it from current valuesand to conform with the common semantics of missing values inpolicies.

Another relevant issue specific to our provisioning problem isthat the mapping between roles and entitlements is many-to-many,that is, an entitlement can be assigned to several roles and each rolecan have several entitlements. Standard classification algorithms,like decision trees or naive Bayes, can only assign one label (role)to one instance (entitlement) at the time. In other words, even if wemay have multiple labels (roles), each instance (entitlement) canonly have exactly one label (role).

To address such an issue, we adopted a simple yet effective ap-proach based on the transformation of each “many-to-many” map-ping into several “many-to-one” mappings for each label [8]. Theidea is illustrated in Figure 3. Given a “many-to-many” mappingbetween entitlements and roles, we create several “many-to-one”mappings between entitlements and each role. Each role will havean exclusive new mapping table that contains all entitlements as-

77

Table 2: The meaning of scopeScope Meaning

0 Enabled for data or functions on data which is personally assigned to / owned by the user1 Enabled for the organizational unit or sub-team the business role is assigned to2 Enabled for the organizational unit the business role is assigned to and all subordinate units3 Enabled for the primary organizational unit the user is assigned to4 Enabled for the primary organizational unit the user is assigned to and all subordinate units6 Enabled regardless of the the ownership of data / function data

The Flowchart of Learning-based User Provisioning

•Identify relevant attributes

•Collect existing provisioning data provided by experts

Provisioning Data Selection

•Handle missing values

•Process data for multi-label classifications, i.e. entitlements and users can belong to several roles

Provisioning Data Pre-processing •Run different classifiers

on data

•Choose one or several best classifiers by standard evaluation methods

Classifiers Comparison

•Classify (assigning roles ) new provisioning data

•Ensemble methods may be applied if needed

Classification

1/23/2009 Reducing TCO of RBAC Lifecycle 18

Figure 2: Organization of the learning-based user provisioning process

signed to it. If there is a mapping between an entitlement and therole in the original mapping, the corresponding label is “true” in thenew mapping. Otherwise, the label is “false”. For each label r, theclassifier is trained based on the corresponding role mapping table.In the classification step (that is, when classifying new provisioningdata), a new instance is associated with a label r if and only if itsfunction f evaluates to “true”. Such approach has the benefit thatno information is lost during the transformation and all classicalclassification algorithms can be directly applied without any mod-ification. The drawback is that we have to learn many functions,one for each particular role, instead of one function for all roles.

Entitlements Roles

EID_1 R1

EID_1 R2

EID_2 R1

EID_2 R3

EID_3 R2

EID_3 R3

Entitlements R1

EID_1 True

EID_2 True

EID_3 False

Entitlements R2

EID_1 True

EID_2 False

EID_3 True

Entitlements R3

EID_1 False

EID_2 True

EID_3 True

Figure 3: Transformation of Mapping

The third step is the classifier training and comparison. There aremany different classification algorithms based on different method-ologies, for instance, support vector machines, and Bayes networks.Different algorithms usually have different accuracy with respect todifferent applications. As a general rule, no particular algorithm isbetter than all other algorithms in all possible applications. Given

different provisioning data sets, a particular algorithm may showquite different performance from others, as we have verified withour experiments (see Section 4). Therefore, at this step, it is impor-tant not only to train and tune classifiers but also to compare theirperformance based on standard estimation methods, e.g., 10-foldcross validation. Only the best classifiers are then used in the laststep, that is, classification.

Sometimes, even the best classifier may result to be a “weak” or“unstable” classifier. A weak learner is defined to be a classifierwhich is only slightly correlated with a “true” or “perfect” classi-fier. An unstable classifier means that small changes in the trainingset for the classifier result in large changes in predictions. Thesechanges can arise because of properties of the provisioning dataand the limitation of learning algorithms.

The last step in our provisioning process classifies the unlabelledprovisioning data to automatically assign entitlements to roles. Ifonly a weak, or unstable classifiers, has been obtained by the previ-ous step, standard ensemble methods, e.g., bootstrap aggregating(bagging) in case of “unstable” classifier and boosting in case of“weak” classifier, may be applied to improve classification perfor-mance. Given a standard training set D of size n, bagging gener-ates m new training sets Di of size n′ ≤ n, by sampling examplesfrom D uniformly and with replacement. When sampling with re-placement is used it is likely that some examples will be repeatedin each Di. If n′ = n, then it can be shown that for large n the setDi is expected to have (1 − 1/e = 63.2%) of the unique examplesof D, the rest being duplicates [4]. This kind of sample is knownas a bootstrap sample. The m models are fitted using the above mbootstrap samples and the results from these models on a test setare combined by voting for classification.

Most boosting algorithms consist of iteratively learning weakclassifiers with respect to a distribution and adding them to a finalstrong classifier. When they are added, they are typically weightedin some way that is usually related to the weak learner’s accuracy.After a weak learner is added, the data is re-weighted: examplesthat are misclassified gain weight and examples that are classified

78

correctly lose weight (some boosting algorithms actually decreasethe weight of repeatedly misclassified examples, e.g., boost by ma-jority). Thus, future weak learners focus more on the examples thatprevious weak learners misclassified.

4. EXPERIMENTSTo evaluate our approach and in particular to assess how the var-

ious supervised learning techniques perform for our specific prob-lem, we have performed an extensive experimental evaluation. Inour evaluation, we have used both real provisioning data from ourmotivation case and synthetic provisioning data. The various learn-ing techniques have been evaluated with respect to accuracy, falsepositive rates, and false negative rates. By accuracy we mean therate between the number of correct assignments by a classifier andthe total number of assignments. By false positive we mean thata role receives a privilege that the role should not have received.False positives may result in security breaches, potentially leadingto major losses for the organization.

By false negative we mean that a role does not receive a privilegethat the role should have received. Even though false negatives donot lead to security breaches, they result in some overhead such assending a request form to a help desk to add the entitlement to theuser or role that is missing it. Due to the nature of our problem,our approach must achieve high accuracy and a very low false pos-itive rate to be usable in practice. Since our purpose is to improvethe efficiency of assigning the entitlements of new applications toexisting roles, a low false negative rate is also important but notnecessary.

4.1 SetupThe user provisioning data set used in the experiments are from

a sub application in the rPOS deployment at retailer A with thefollowing characteristics (a snapshot of such data is shown in Fig-ure 1). Note that there are two types of entitlements: function enti-tlements and user interface entitlements.

• 32 business roles.

• 316 distinct user interface entitlements with 8 attributes (seeFigure 1).

• 658 mappings between business roles and user interface en-titlements.

• 214 distinct function entitlements with 8 attributes: Applica-tion.Functions, UI.Button.Link, Context.Type, Context.Source,Context.Scope, Parameter.1, Parameter.2, Required.by.

• 410 mappings between business roles and function entitle-ments.

As we can see from Figure 1, the relation is a “many-to-many”mapping and there are some missing values.

4.2 Choice of the Classification AlgorithmsWe do not report all the results concerning the experiments on

the selection of the best algorithm because of space reasons. Sincethe main purpose here is a proof of concept, we adopt the followingtwo steps to choose the best overall algorithms:

1. We run different algorithms on 4 mappings related to 3 rolesthat are chosen randomly from 35 roles;

2. We choose the algorithm with the best overall performance(see Figure 4) on these 4 mappings to be used in the nextclassification step.

The four chosen mappings are as follows. The 3rd mapping onlyhas 6 positive cases; therefore we call the role “Sales Consultant- FS” an unpopular role in Function Entitlements, while the other3 roles are popular roles in relevant entitlements. Note that “SalesConsultant - FS” is a popular role in UI Entitlements.

1. Role “Receptionist” w.r.t. Function Entitlements, 42 positivecases and 172 negative cases;

2. Role “Lead Manager” w.r.t. Function Entitlements, 49 posi-tive cases and 165 negative cases;

3. Role “Sales Consultant - FS” w.r.t. Function Entitlements, 6positive cases and 208 negative cases;

4. Role “Sales Consultant - FS” w.r.t. UI Entitlements, 69 pos-itive cases and 247 negative cases.

We analyzed over 20 classification algorithms [17] from differ-ent categories, such as probabilistic classifiers, distance-based clas-sifiers or tree-based classifiers. Figure 4 reports results about 18 ofthose algorithms. The other two algorithms did not have accept-able performance. “cv” in the legend stands for cross validation,“fp” stands for false positive, and “fn” stands for false negative.

Generally speaking, several classification algorithms perform wellin terms of false positive rates, which is very good. However somealgorithms are much better than others in terms of false negativerates. Therefore the false negative rate plays a key role in deter-mining the choice of the best algorithm. The accuracy of the twobest algorithms, that is, support vector machine (SVM) and C4.5decision tree, generally is over 90%, which is very good.

Different classification algorithms have different parameters. Asfor SVM, it has a cost parameter (C) and a kernel parameter (γ),if a radial basis function (RBF) kernel is chosen. It is not knownbeforehand which value forC and γ are the best for a specific prob-lem; consequently some kind of model selection (parameter search)must be done, referred as to parameter tuning. Its goal is to identifygood values for C and γ so that the classifier can accurately predictunlabelled data, i.e. testing data.

It should be noted that some classification algorithms are moresensitive to parameters than other algorithms. For instance, the per-formance of the C4.5 decision tree is bad in our case when applyingits default parameters, but it can be quickly improved by adjust-ing its parameters. By contrast, in our case the SVM performancechanges slowly w.r.t. its parameters.

Because SVM represents the state of the art for learning algo-rithms and has fewer parameters which translates into a shorterparameter tuning time, SVM is chosen to verify the classificationperformance on other mappings in both function entitlements andUI entitlements. The phenomenon of 100% false negative of theunpopular role “Sales Consultant - FS” in Function entitlements isdiscussed in Section 4.3.

4.3 SVM Performance on Different RolesThe classification performance of SVM [5, 12] on the majority

of roles is shown in Figure 5. The first graph represents the map-ping between business roles and function entitlements, and the sec-ond graph reports the number of corresponding positive cases andnegative cases. The third graph represents the mapping betweenbusiness roles and UI entitlements, and the fourth graph reports thenumber of corresponding positive cases and negative cases.

The first observation is that the performance of 10-fold cross val-idation is usually better than that of 2-fold cross validation. Forsome unpopular roles in function entitlements, the improvement ofthe false negative rates is huge. This result is not surprising because

79

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

10-fold cv accuracy

10-fold cv fp rates

10-fold cv fn rates

Business Role:ReceptionistEntitlement:Function entitlementsTotal entitlements: 214Positive cases: 42Negative cases: 172

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%

10-fold cv accuracy

10-fold cv fp rates

10-fold cv fn rates

Business Role:Lead ManagerEntitlement:Function entitlementsTotal entitlements: 214Positive cases: 49Negative cases: 165

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

10-fold cv accuracy

10-fold cv fp rates

10-fold cv fn rates

Business Role:Sales Consultant - FSEntitlement:Function entitlementsTotal entitlements: 214Positive cases: 6Negative cases: 208

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

10-fold cv accuracy10-fold cv fp rates10-fold cv fn rates

Business Role:Sales Consultant - FS

Entitlement:UI entitlements

Total entitlements: 316Positive cases: 69Negative cases: 247

Figure 4: Different classification algorithms on different role-entitlement mappings

80

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

2-fold cv accuracy

2-fold fp rates

2-fold fn rates

10-fold cv accuracy

10-fold fp rates

10-fold fn rates

Classification Algorithm:SVMKernal Function: RBFEntitlement: Function Total entitlements: 214

0

50

100

150

200

250

positive cases

negative cases

Entitlement:Function entitlements

Total entitlements: 214

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

2-fold cv accuracy

2-fold fp rates

2-fold fn rates

10-fold cv accuracy

10-fold fp rates

10-fold fn rates

Classification Algorithm:SVMKernal Function: RBFEntitlement: UITotal entitlements: 316

0

50

100

150

200

250

300

350

positive cases

negative cases

Entitlement:UI entitlements

Total entitlements: 316

Figure 5: SVM on Different Roles in Different Entitlements

81

10-fold cross validation has a larger training set and unpopular rolesare more sensitive to the size of training sets.

The second observation is that the overall performance is verygood. Most false positive rates are between 0-5%, which is veryimportant for security. Most false negative rates are between 0-30%, which means that the learning-based approach works well.70% or more of the assignments can be executed automatically andno more than 30% of the assignments will need some assistance.

The last but not less important observation is that false negativesrates are 100% on two unpopular mappings, “Sale Consultant - FS”w.r.t. function entitlements and “Business Administrator” w.r.t. UIentitlements. 100% false negative rate means that the SVM algo-rithm cannot learn anything from samples in these two mappings.

We have already mentioned that all tested classification algo-rithms perform almost equally bad 1 on the mapping “Sale Consul-tant - FS” w.r.t. function entitlements (see Figure 4). The mapping“Business Administrator” w.r.t. UI entitlements is similar.

To understand the reason for such behavior, we needed to ana-lyze the data in more detail. These two mappings have some com-mon features. The values of the “Dialog ID” attribute of all positive(true) entitlements for “Business Administrator” are not only dis-tinct from each other but distinct from all other UI entitlements.The same happens for the values of the “Application.Functions”attribute of all positive entitlements for “Sale Consultant - FS”.This situation basically means that these two important attributesare useless for classification. Unfortunately, the values of all otherattributes in their positive cases are the same as that of their negativecases, which means that all the other attributes are useless for clas-sification as well. In other words, the reason is that there is simplyno way to distinguish the samples, or that no experience on pastsamples can predict future instances. Some relatively high falsenegatives (around 30-40%) in popular mappings have the same ex-planation. Some positive cases in these mappings have exactly thesame features mentioned above, and thus they cannot be recognizedby classifiers.

These phenomena happen in practice. Some roles that have verydistinct permissions do exist, such as “Sale Consultant - FS” infunction entitlements. It is difficult or impossible to discover somemeaningful distribution for their entitlement-role mappings, andadministrators can only assign entitlements to roles by enumerat-ing all possibly useful cases or by case requests (on-demand as-signment). This is not the “fault” of the learning-based approach,but it is a consequence of the problem itself. Fortunately extremecases like these two roles do not often arise.

If we consider these phenomena purely from the algorithm’s per-spective, it is obvious that the feature set given in the provisioningdata is not sufficient to distinguish positive instances from negativeinstances. Therefore, one possible approach to improve accuracy isadding additional and useful attributes with values that can distin-guish positive instances from negative instances.

Someone may argue that the problem may just be trivial and per-haps there are some simple rules governing the assignment of en-titlements to roles and learning is a trivial matter. To address sucha question, we looked into the details of the functions between role“Lead Manager” and entitlement “Function” that are generated bytwo popular algorithms: C4.5 decision tree and LIBSVM SVM.

1Acute readers may notice that the SVM performance reported inFigure 4 is better than the performance reported in Figure 5. Thereason is that the results reported in Figure 4, correspond to anSVM that chooses a linear kernel function while the results are re-ported in Figure 5, is from an SVN that chooses a RBF as its kernelfunction. Although RBF is better in most cases, a linear kernelfunction is slightly better in this particular case.

The decision tree for the mapping has 10 leaves and 15 nodes.Given 107 samples, the number of support vectors is 64. We ob-tained roughly the same data for other business roles. Therefore,the mapping between business roles and entitlements in this partic-ular problem is definitely not trivial. Indeed it is difficult even forexperts to discover the mapping from the provided samples.

5. COMPARISON WITH A RULE-BASED AP-PROACH

Rule-based user provisioning has been recently proposed by Al-Kahtani et al. [2]. Rules are predicates on the values of relevantattributes, which determine whether a user should be provisionedwith a role or not. Although the approach by Al-Kahtani et al. onlyaddresses the problem of assigning users to roles, the idea can beapplied to permission assignments as well. Rule-based user provi-sioning has been indeed adopted by some leading commercial IdMproducts [11], such as Oracle IdM (under the name “policy based”)and Sun IdM. A hybrid rule-based and script-based automated userprovisioning has been adopted by IBM Tivoli IdM. Administra-tors can use predicates to assign users to roles and they can writeprovisioning scripts to verify the values of user attributes. The ver-ification result is used to determine which privileges to provision.Because scripts are essentially used to describe rules, we use theterm “rule-based approach” to refer to both approaches.

Rule-based provisioning, though it works, does not reduce muchcost or expert involvement. It only provides some language for ad-ministrators to manually describe these relations. Such an approachis not truly automatic, because the rule-based approach reduces ex-pert involvements on each user/role or role/permission assignmentat the cost of expert efforts on two non-trivial problems – discov-ering the relations from samples and describing these relations interms of rules. Usually, it is not trivial to discover such relationsor correctly describe them. For instance, For instance, to be eli-gible for a senior software engineer role, a software engineer mayhave to show good scores on many factors other than the numberof years of experience. These factors could include bug rates, peerreviews, awards, product impacts etc. Therefore, these tasks need“expensive” experts.

By contrast, a learning-based approach does not need human ex-pert efforts for discovering and describing the “relation”; such tasksare executed by classification algorithms, at a cost of finding a setof sample assignments. Notice however that a set of sample assign-ments is also the minimal requirement for manual user provision-ing because administrators usually need some good references todo their job. A set of sample assignments is required by rule-basedprovisioning as well because experts, in order to specify rules, needto elicit information from CIO and/or IT/Business/HR Managersand some good samples to test these rules. Compared with manualprovisioning and rule-based provisioning, the learning-based pro-visioning requires the least involvement of human experts.

We have already pointed out that learning algorithms try to learna probability distribution function between attribute values and la-bels by examples. One may wonder whether the learning algo-rithm can precisely discover the rule in the examples if a clearlydefined (no probability) rule exists between attribute values and en-titlements. A short answer to this question is a conditional “Yes”.One condition is that a learning-based approach needs a good fea-ture selection to reach a “crisp” mapping guided by rules. To verifyour answer to such a question, we generated synthetic user provi-sioning data based on some rules (predicates on attributes). Thenwe ran a classification algorithm on the data to see whether thealgorithm were able to rediscover those rules. Of course, the func-

82

tion may have a form different from the form of rules. In the case ofSVM, that function consists of some support vectors and tuning pa-rameters. To be general, we defined a set of rules on both numericalattributes and nominal attributes as follows (a, b, c, and d are nom-inal attributes with finite domains [0,1], [0,1,2,3,4,5], [0,1], and[0-15] respectively, e and f are integers representing salary [5000-20000] and age [3-80] respectively), and g is an integer numberbetween [0-20], representing for example years of experience. Thegenerated rules are as follows.

• Rule 0: a == 1 & b !=2, a role is assigned if and onlyif a equals 1 and b does not equal 2.

• Rule 1: a == 1 & b !=2 & c == 0, a role is assignedif and only if a equals 1, b does not equal 2 and c equals 0.

• Rule 2: a == 1 & (d == 1 | d == 3 | d == 7 |d == 9 | d ==13), a role is assigned if and only if aequals 1 and d is one of {1,3,7,9,13}.

• Rule 3: e > 10000 & f > 50, a role is assigned if andonly if e is greater than 10000 and f is greater than 50.

• Rule 4: rbinom(maxnum, 1, pnorm(g, mean = 10,sd = 3))==1 2, If g equals 10, a role has 50% chance tobe assigned. The larger g is, the higher probability a role isassigned, or vice versa.

In order to verify the effect of feature selection, we generatedtwo data sets, each of which includes 200 assignments, for eachrule. One data set only includes the attributes that are used in therelevant rule in order to simulate the “after feature selection”. An-other data set includes all attributes, a, b, c, d, e, f, g,in order to simulate the “before feature selection”. We ran SVMclassifiers on these data sets and used 10 fold cross validation tocompare SVM performance. The results are reported in Figure 6.The results clearly show that “after feature selection” is always bet-ter than “before feature selection”. For rules 0, 1, 2, 3, the perfor-mance in the case “after feature selection” has no difference fromthat of the rule based approach. The functions learnt have 100% ac-curacy. Therefore, if we really want to enforce a rule, we can syn-thetically generate some samples based the rule and train a functionbased these samples. We can also see that, in some cases, the per-formance in the case “before feature selection” is really bad, e.g.,rule 2. Therefore, feature selection really matters here. Due to thechoice of the mean and the standard deviation in rule 4, the accu-racy of rule 4 is roughly 88% as expected.

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

rule 0 rule 1 rule 2 rule 3 rule 4

before feature selectionaccuracy

before feature selectionfalse positive rates

before feature selectionfalse negative rates

after feature selectionaccuracy

after feature selectionfalse positive rates

after feature selectionfalse negative rates

Figure 6: Performance of SVM on different rules’ data

6. ADDITIONAL ISSUESIn this section, we discuss two specific issues related to a learning-

based provisioning for RBAC systems.2We directly use normal distribution and binomial distributionfunctions in R Language [12].

6.1 Maintaining An Up-to-Date Training SetTypical classification problems, e.g. iris species recognition, or

breast cancer detection, usually have rather “stable” mappings be-tween attribute values and labels. By “stable” we mean that suchmappings, if they exist, change very slowly. Thus the more labelledsamples are obtained, the more precise a relevant classifier is. Un-fortunately, as indicated in [1], mappings in provisioning applica-tions are dynamic, which means that the mappings themselves maychange. The change frequency depends on the business propertiesof organizations. Different mappings may result in different labelsfor the same instance at different time periods. A new sample rep-resenting a new relation may contradict an old sample representingan obsolete relation. Obviously the presence of contradictory sam-ples in a training set confuses any learning algorithm. Thus it isnot always true that the more samples a training set has, the bet-ter result a classifier obtains. A healthy training set should excludeobsolete samples and be kept up-to-date.

Adding new samples is relatively straightforward. New sam-ples mainly come from two sources. The first source is CIO andBusiness/Project/IT/HR managers; often permissions need to bechanged to comply with new regulations or to address new require-ments. The second source is from wrongly labelled data. By addingwrongly labelled data with correct labels into the training set, wecan induce new classifiers from repeating previous mistakes.

Removing or updating obsolete samples is equally important andeven more complicated. One naive approach to remove obsoletesamples is to simply run a conflicting sample detection after addingnew samples. Two samples conflict with each other if they havesame attribute values but conflicting answers, Yes and No, for oneparticular label. It is usually the case that the new sample is themost up-to-date and the old sample should be removed. However,it is still possible that both are correct. One possible situation isthat the current feature set is not sufficient to distinguish these twosamples. In this situation, some new attributes might be added tothe feature set.

The fact that an old sample does not conflict with some new sam-ples cannot guarantee, however, that the old sample is not obsolete.Due the limitation of sample collection methods that are mainlybased on experiences, we cannot expect that the new samples rep-resent all individual cases from new regulations. Therefore it ispossible that some old sample may contradict new regulations butcoincide with new samples. One possible solution is setting a timestamp (TS) and a time to live (TTL) to each sample. When its TTLcomes to the end, a sample has to be reevaluated for its representa-tiveness.

6.2 Enforcing RBAC ConstraintsAnother distinct feature of role-based provisioning is the need to

enforce constraints. Separation of duty (SoD) constraints and car-dinality constraints are well-known examples of constraints. Thusan effective training set should incorporate samples that representthese constraints.

An approach to address such requirement is based on two steps.The first step is to interpret each constraint by some additional at-tributes. This can always be done because constraints are essen-tially some additional predicates controlling assignments in RBAC.The second step is to update old samples and add new samples withthese additional attributes to enforce the constraints. We first usethe SoD constraints as an example to illustrate the approach andthen apply it to cardinality constraints.

There are many different representations in RBAC for SoD con-straints. To be simple and general, a SoD constraint in this papermeans that no user can be assigned to both R1 and R2. Now fur-

83

ther assume that we have two positive samples in the training setof R1. Negative samples in the training set are not useful for thisproblem. If the current feature set for R1 is (a1, a2, .., ak), thenwhat we need is a new feature set (a1, a2, .., ak, r2) where the newattribute r2 represents whether a user is in role R2 or not. Fromthe two positive samples four new samples are generated: two pos-itive samples and two negative samples. In the two new positivesamples, the old attributes have the same values that they had in theold positive samples, but attribute r2 has value 0 indicating that auser is assigned to role R1 if the user is not in role R2. In the twonew negative samples, the old attributes have the same values as inthe old positive samples, but the attribute r2 has value 1 indicatingthat the a user cannot be assigned to R1 if the user is in role R2.An important advantage of our solution is that these two steps aremechanical and can thus be automated.

To enforce the cardinality constraints, for example 20, what weneed is an additional attribute cardinalitytest that represents thecardinality test (20) result. All samples in the training set are up-dated to new samples with this additional attribute. The process issimilar to the one in the previous example and thus is omitted.

7. RELATED WORKTo the best of our knowledge, there are no proposals comparable

to ours, since the problem of assignment of entitlements to roles,taking contextual constraints into account, has never been investi-gated before. However there are approaches using machine learn-ing techniques to mine roles [9] and assign users to roles [13]. As-sociation rule mining and clustering techniques had been applied todiscover roles [9]. A manually constructed decision tree by Shenget al. [13] was used to illustrate the idea how a learned decision treecould assign a user to a group. However, no specific issues relatedto a learning-based approach, like that identified in this paper, werediscussed and no experiments were carried out to demonstrate thatthe approach was feasible.

Many different approaches have been proposed to solve the prob-lem of “many-to-many mapping” in classification. We can groupthe existing methods into two main categories: a) problem transfor-mation methods that transform the multi-label classification prob-lem either into one or more single-label classification [3, 8] and b)algorithm modification methods that extend specific learning algo-rithms to handle multi-label data directly [6, 14]. We have cho-sen the first approach because of two reasons. First, the secondapproach is limited to one particular classification algorithm be-cause different classification algorithms require different modifica-tion. For instance, in order to be adaptive to multi-label problems,Clare et al. [6] modified the entropy calculation in C4.5 decisiontree while Tsochantaridis et al. [14] introduced discriminant func-tions and modified the loss function in SVM light support vectormachine. As a general framework, we prefer to preserve the abil-ity for users to choose the best classification algorithm for differentprovisioning data sets. Second, transformations can preserve allinformation while modified algorithms may lose information [14]during additional data processing required by the multi-label ap-proaches, which can decrease performance. For security purpose,we usually want the best accuracy.

8. CONCLUSIONWe have proposed a learning-based approach to reduce the in-

volvement of human experts in role-based provisioning. Specificlearning problems are identified and solved. The effectiveness ofour approach has been confirmed by extensive experiments on realprovisioning data. One interesting problem is how to verify whether

functions learnt comply with high level security goals. We, thus,plan to further investigate the functions learnt and compare themwith functions representing high level security goals in the future.

9. REFERENCES[1] Beyond Roles: A Practical Approach to Enterprise User

Provisioning. Technical report, Hitachi ID System, INC.[2] M. A. Al-Kahtani and R. S. Sandhu. Rule-based rbac with

negative authorization. In ACSAC, pages 405–415. IEEEComputer Society, 2004.

[3] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learningmulti-label scene classification. Pattern Recognition,37(9):1757–1771, 2004.

[4] L. Breiman. Bagging predictors. Machine Learning,24(2):123–140, 1996.

[5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for supportvector machines. National Taiwan University, version 2.86edition, April 2008.

[6] A. Clare and R. D. King. Knowledge discovery in multi-labelphenotype data. In L. D. Raedt and A. Siebes, editors,PKDD, volume 2168 of Lecture Notes in Computer Science,pages 42–53. Springer, 2001.

[7] A. Ene, W. Horne, N. Milosavljevic, P. Rao, R. Schreiber,and R. E. Tarjan. Fast exact and heuristic methods for roleminimization problems. In SACMAT 2008, pages 1–10.

[8] R.-E. Fan and C.-J. Lin. A study on threshold selection formulti-label classification. Department of Computer Science,National Taiwan University, 2007.

[9] M. Kuhlmann, D. Shohat, and G. Schimpf. Role mining -revealing business roles for security administration usingdata mining technology. In SACMAT 2003, pages 179–186.

[10] I. Molloy, H. Chen, T. Li, Q. Wang, N. Li, E. Bertino, S. B.Calo, and J. Lobo. Mining roles with semantic meanings. InSACMAT 2008, pages 21–30.

[11] E. Perkins and P. Carpenter. Magic Quadrant for UserProvisioning, Aug 2008. Gartner RAS Core Research NoteG00159740.

[12] R Development Core Team. R: a language and environmentfor statistical computing. R Foundation for StatisticalComputing, Vienna, Austria, 2005. ISBN 3-900051-07-0.Available at http://www.R-project.org.

[13] S. Sheng and S. L. Osborn. A classifier-based approach touser-role assignment for web applications. In Secure DataManagement, volume 3178 of LNCS, pages 163–171.Springer, 2004.

[14] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun.Support vector machine learning for interdependent andstructured output spaces. In ICML 2004.

[15] J. Vaidya, V. Atluri, Q. Guo, and N. R. Adam. Migrating tooptimal rbac with minimal perturbation. In SACMAT 2008,pages 11–20.

[16] H.-L. Wei and S. A. Billings. Feature subset selection andranking for data dimensionality reduction. IEEE Trans.Pattern Anal. Mach. Intell., 29(1):162–166, 2007.

[17] I. H. Witten and E. Frank. Data Mining: Practical MachineLearning Tools and Techniques. Morgan Kaufmann, SanFrancisco, second edition, June 2005. ISBN 0-12-088407-0.

[18] R. J. Witty. Suntrust implements role-based provisioning to asuccessful conclusion. Technical Report G00132894,Gartner RAS Core Research, Jan 12 2006.

84

automating role-based provisioning by learning …...automating role-based provisioning by learning...

Documents