[acm press the 27th annual computer security applications conference - orlando, florida...

BLOCK: A Black-box Approach for Detection of StateViolation Attacks Towards Web Applications

Xiaowei LiDepartment of Electrical Engineering and

Computer ScienceVanderbilt University

[email protected]

Yuan XueDepartment of Electrical Engineering and

Computer ScienceVanderbilt University

[email protected]

ABSTRACTState violation attacks towards web applications exploit logicflaws and allow restrictive functions and sensitive informa-tion to be accessed at inappropriate states. Since applicationlogic flaws are specific to the intended functionality of a par-ticular web application, it is difficult to develop a generalapproach that addresses state violation attacks. To date,existing approaches all require web application source codefor analysis or instrumentation in order to detect state vio-lations.

In this paper, we present BLOCK, a BLack-bOx approachfor detecting state violation attaCKs. We regard the webapplication as a stateless system and infer the intended webapplication behavior model by observing the interactions be-tween the clients and the web application. We extract a setof invariants from the web request/response sequences andtheir associated session variable values during its attack-freeexecution. The set of invariants is then used for evaluatingweb requests and responses at runtime. Any web request orresponse that violates the associated invariants is identifiedas a potential state violation attack. We develop a systemprototype based on the WebScarab proxy and evaluate ourdetection system using a set of real-world web applications.The experiment results demonstrate that our approach iseffective at detecting state violation attacks and incurs ac-ceptable performance overhead. Our approach is valuablein that it is independent of the web application source codeand can easily scale up.

Keywords: black-box approach, state violation attack,web application security, invariant

1. INTRODUCTIONDuring the past decade, web applications have become

the most prevalent way for service delivery over the Inter-net. As they get deeply embedded in business activities andrequired to support sophisticated functionalities, the designand implementation of web applications are becoming more

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ACSAC ’11 Dec. 5-9, 2011, Orlando, Florida USACopyright 2011 ACM 978-1-4503-0672-0/11/12 ...$10.00.

and more complicated. The increasing popularity and com-plexity make web applications a primary target for hackerson the Internet. According to a recent survey [22], attacksagainst web applications account for 63% of all Internet ex-ploits. These attacks are commonly classified into two classes[7]: 1) input validation attacks, which exploit the applica-tion’s insufficient or erroneous sanitization of user inputs,allowing malicious code to be injected into web applications,and 2) state violation attacks, which exploit logic flaws inweb applications, allowing restrictive functions and sensitiveinformation to be accessed at inappropriate states. For ex-ample, authentication bypass attack allows the attacker toperform administrative operations over the web application.This paper focuses on state violation attacks. While there

are a large body of literatures on input validation attacks(e.g., [12, 25, 15, 11]), there have been very limited worksthat address state violation attacks [4, 7, 10]. The ma-jor challenge for defending against state violation attackscomes from the fact that application logic flaws are specificto the intended functionality of a particular web application.Thus it is difficult to develop a general approach that ad-dresses state violation attacks towards different applications.Clearly, the key to approaching state violation attacks is toderive the intended behaviormodel of a particular web appli-cation, i.e., the specification of web application. The existingworks have presented both static and dynamic techniques toinfer the web application specification. For example, Mi-MoSA [4] analyzes the application source code to derive anintended workflow graph; Swaddler [7] establishes models ofsession variables for each program block of the applicationbased on the execution traces; Waler [10] infers likely in-variants of session variables at each program point duringexecution. Such specifications can then be either leveragedby model checking (e.g., MiMoSA, Waler) to identify vul-nerabilities within the implementation or used at runtimefor detection of relevant attacks (e.g., Swaddler). However,these existing works are limited in two aspects. First, theyall require the web application source code for instrumenta-tion, which may not be available in practice. Second, theyall infer the application specification at the program level,which makes their approach closely coupled with the pro-gramming languages (e.g., PHP, JSP) and frameworks. Thecorrectness and accuracy of their derived specifications arehighly dependent on and thus limited by their capability ofdealing with language-level details. For example, MiMoSAand Swaddler cannot handle object-oriented programs grace-fully.In this paper, we present BLOCK, a BLack-bOx approach

for detecting state violation attaCKs against web applica-

tions. Due to the stateless nature of HTTP protocol, sessionvariables are explicitly defined in web applications to main-tain the state of a web session. Session variables can bemaintained either at the client side (i.e., via cookies, URLrewriting or hidden forms) or at the server side with a sessionID issued to the client for indexing. The key idea of BLOCKis to infer the intended behaviormodel of the web application(i.e., specification) by observing the web request/responsesequences and their associated session variable values dur-ing attack-free executions. Then, the inferred model is usedfor evaluating web requests and responses at runtime, com-bining with current session information. Any web request orresponse that violates the model is identified as a potentialstate violation attack and blocked. In particular, we lever-age the stateless property of HTTP and regard the vector ofcurrent values of session variables as part of the input alongwith web request to the application, the web responses andthe updated session variables as the output. In this way, theweb application can be approximated as a stateless system.Under this stateless system model, we characterize the ap-plication behavior from three aspects in the form of likelyinvariants: 1) input invariants, which model the relationshipbetween the web requests and the session variable values, 2)input/output invariants, which capture the relationship be-tween the web request and response as well as the changes inthe session variables after the web request is processed, and3) input/output sequence invariants, which leverage the his-torical web request/response pair sequences to capture theapplication states that are not revealed by defined sessionvariables.

To our knowledge, BLOCK is the first black-box techniquethat addresses state violation attacks towards web applica-tions. Our approach is independent of the application sourcecode and able to handle a variety of programming frame-works. Thus, it can scale up to protect a large number ofweb applications.

Our contributions are summarized as follows:

• We propose a black-box approach for detecting stateviolation attacks. We regard the web application asa stateless system and model the relations within webrequests, responses and session variables using a set ofinvariants.

• We implement a prototype of our detection system,which is able to observe and analyze the interactionsbetween the clients and the web application, detect andblock state violation attacks.

• We evaluate our detection system using a set of opensource web applications. The detection results showthat our approach is effective at detecting state viola-tion attacks and incurs acceptable performance over-head.

The rest of this paper is organized as follows. Section II il-lustrates state violation attacks we target at. Our approachis presented in detail in Section III. The following sectiondescribes the implementation of our detection system proto-type. Evaluation setup and results are given in Section V.Section VI discusses related works and Section VII concludesthis paper.

2. STATE VIOLATION ATTACK

A web application manages the clients’ session states tocontrol the access over its restrictive functions and sensi-tive information, as well as enforce desired state transitions.Although most current web application development frame-works provide session management mechanisms, it is still thedeveloper’s responsibility to define and check session vari-ables at appropriate program points, which is usually donein an ad-hoc manner. Three types of vulnerabilities are pos-sibly introduced into the web application: (1) insufficientdefinition of session variables for differentiating all possiblestates; (2) insufficient checking of session variables at ap-propriate program points; (3) erroneous checking of sessionvariables that can be bypassed. They all make the web ap-plication vulnerable to state violation attacks (also referredto as the workflow violation attack in Swaddler [7]). Theattacker can launch state violation attacks by sending webrequests to the web application, which violate the underly-ing requirements of expected web requests by the developersat the current application state. We use a small PHP webapplication, as shown in Fig.1, which contains several statemanagement vulnerabilities, as an example to illustrate stateviolation attacks. This example is also used throughout thepaper to demonstrate how we address these attacks.The first example of state violation attack is authenti-

cation/authorization (simplified as auth hereafter) bypass.The web application controls the access over its functionsby checking session variables indicating the user privilegebefore its restrictive functions can be executed. If the ap-plication is not at the required state, the web applicationwill redirect the user to the login page, authorization pageor an error page. However, if there exists a path lead-ing to the restrictive function with insufficient or erroneouschecking of session variables, the attacker is able to bypassthe authentication/authorization. The example applicationdemonstrates three cases of auth bypass attacks. admin.phpand admin2.php contain restrictive functions, which shouldonly be accessed by admin users when the session variable$ SESSION[‘privilege’] is set to the value of admin.

• In admin.php, there is no check on the session variable$ SESSION[‘privilege’]. The attacker, being either aguest or a regular user, can directly request the pageand access the admin functions.

• In admin2.php, even though there is an if conditioncheck on the session variable $privilege, the attackercan append an additional parameter privilege to theURL, for example http://example.com/admin2.php?privilege=admin, and bypass the auth check. The rea-son is when the register global option of PHP inter-preter is enabled, the parameter attached to the webrequest will be automatically bound to a global vari-able, if such variable doesn’t exist in the current sessionstate. This vulnerability results from the inappropriateor erroneous check on the session variable.

• In admin2.php, even when the auth check fails, the at-tacker is able to execute the restrictive functions afterthe redirection (i.e., header function) by submitting aPOST request with the parameter title and change theapplication’s title successfully. This is because there isno exit function or an additional check after the redi-rection.

<?phpinclude_once("header.php");if (isset($_GET['logout'])){

session_start();unset($_SESSION['username']);unset($_SESSION['privilege']);session_destroy();print "You are logged out.<br>";

} else if (isset($_POST['email'])){if (validateLogin($_POST['email'], $_POST['passwd'])){

$_SESSION['username'] = $_POST['email'];if ($_POST['email'] == $admin_email){

$_SESSION['privilege'] = "admin";} else {

$_SESSION['privilege'] = "user";}

header("Location: index.php?username

=" . $_SESSION['username']);exit();

} else {die("Wrong username or password");

}}?><form action='login.php' method=post>username: <input name="email" type="text"><br>password: <input name="passwd" type="password"><br><input name="submit" type="submit"> </form>

< ?php include_once 'footer.html';?>

<?php include_once 'header.php';logIdentity();print "<a href='admin2.php'>Next step: change the title</a>";include 'footer.html';?>

<?phpinclude_once 'header.php';if (isset($_GET['username'])){

$userid = $_GET['username'];showUserInfo($userid);if ($_SESSION['privilege'] == "admin"){

print "<a href='admin.php'>Admin link</a><br >";}

}print "<a href='login.php?logout=1'>Logout</a><br>";include_once 'footer.html';?>

<?phpinclude_once 'header.php';if ($privilege != "admin"){

header("Location: index.php?username

=".$_SESSION['username']);}if (isset($_POST['title'])){

modifyTitle($_POST['title']);}?><form action='admin2.php' method=post>New title: <input name="title" type="text"><br><input name="submit" type="submit"></form><a href='login.php?logout=1'>Logout</a><?phpinclude_once 'footer.html';?>

index.php

admin2.php

admin.php

login.php

Figure 1: Example Application

The second example of state violation attack is parame-ter manipulation. In a lot of cases, the web application as-sumes implicit relations between the user’s input parameterswithin web requests and the session state. Such a relation-ship may also be reflected from web responses returned bythe web application. If the application doesn’t check thesession state when accepting the web request, the attackeris able to manipulate the input parameters and gain accessto unauthorized information. In the example application,after the user logs in, he/she will be redirected to the in-dex.php page, which displays his/her personal information.The web application assumes the request parameter user-name is always equal to the value of session variable $ SES-SION[‘username’]. If the equality relationship is not exam-ined when the user’s personal information is retrieved, theattacker is able to view any user’s information by modifyingthe username parameter within the web request.

The third state violation attack is workflow bypass. A webapplication usually has an intended workflow, which requiresthe user to perform a predefined sequence of operations tocomplete a certain task. For example, an e-commerce web-site has a predefined checkout procedure, which instructs thecustomer to first fill in the shipping information and then thecredit card information before the order can be confirmedand submitted. Such a temporal relationship is enforced bythe restrictions over the session state transitions. However, ifthe session variables are insufficiently defined or checked forguarding the desired state transitions, the attacker is able tobypass certain required steps and violate the intended work-flow. The example application requires the admin user firstaccess admin.php, which logs his/her identify (by logIden-tity function) before he/she can modify the application titlein admin2.php. The two steps indicate two different sessionstates and the transition between them should be guardedby the web application. However, there is no session variable

defined for indicating whether the identity of the admin userhas been logged or not. The attacker can directly point toadmin2.php page without his/her identity being logged.

3. APPROACHOur approach for detecting state violation attacks has two

key phases. In the training phase, the intended behaviormodel of the web application (i.e., the specification) is de-rived by observing the web request/response sequences andthe corresponding session variable values during its attack-free execution. In the detection phase, the inferred modelis used to evaluate each incoming web request and outgoingweb response and detect any violations.Due to the stateless nature of HTTP protocol, session vari-

ables are explicitly defined in web applications to maintainthe state of a web session. There are two ways for maintain-ing session states: 1) client side only, where session statesare directly carried in cookies, hidden forms, or URLs; 2)collaboration of the client and the server, where the serverstores the session states and issues a session ID to the clientfor indexing its session states. In either case, session statescan be retrieved at runtime for each web request indepen-dent of the web application implementation. For example,when session states are carried in cookies, hidden forms, orrewritten URLs, they can be directly retrieved from the webrequests. When session states are kept in the server side,they can be found either in a file or a database table. Inthe case of PHP, the session state is by default stored intemporary files located at /var/lib/php5, which is indexedby session ID within web requests, while in the case of JSP,the session state is persisted in database tables.One straightforward approach to modeling the application

behavior is to derive its states from the session variables andtheir values directly. Yet, this approach has several issues:

1) at one application state, session variables may exhibit alarge range of values. For example, $ SESSION[‘username’]can assume as many possible values as the number of regis-tered users in the same application state. Thus, directly us-ing session variable values to differentiate application statesmay result in a large number of spurious states; 2) defini-tion of session variables may be missing from the applicationimplementation. As a result, two application states in thespecification can not be differentiated by the collection ofall session variables. For example, in the application shownin Fig. 1, there is no session variable defined for indicatingwhether the admin user identity has been logged.

Our approach follows the stateless property of HTTP andregards the session variables as part of the input to the webapplication along with web requests. Similarly, the output ofthe application consists of the web response and the sessionvariables. In this way, the web application can be regardedas a stateless system, as shown in Fig. 2. Under this state-less system model, we characterize the application behaviorin the form of three types of likely invariants. 1) Type I inputinvariants: recall that the web application input consists ofthe web request and the values of the session variables whenthe request is made. This type of invariants models the re-lationship between the web requests and the session variablevalues. Essentially, it tries to capture the constraints on theweb requests at certain session states. By identifying the in-variant component of session variables, this approach avoidsthe introduction of spurious states by unnecessary sessionvariables. 2) Type II input/output invariants: this type ofinvariants models the relationship between the web requestand response as well as the changes in the session variablesafter the web request is processed. Essentially, it tries tocapture the constraints on the application state transitionand the input/output dependency at a certain state. Bothtype I and II invariants rely on the session variables to in-fer the application states. When the session variables arenot sufficiently defined, we need a third type of invariant.3) Type III input/output sequence invariants: this type ofinvariants models the relationship between consecutive webrequest/response pairs. Essentially, it tries to capture theapplication states that are not revealed by defined sessionvariables by leveraging the historical request/response in-formation. In the following sections, we first formalize oursystem model and then illustrate how to extract three typesof invariants and apply them into runtime detection.

3.1 System ModelAs shown in Fig.2, a web application is regarded as a state-

less system F , which accepts an input min and emits an out-put mout, expressed as F (min) = mout. An input min con-sists of a web request and a set of session variable name/valuepair S(min). To facilitate detection, we further decompose aweb request into two components: a web request key r(min),which includes the HTTP requestmethod and the target file,and a set of input parameter name/value pair P (min). Inthis paper, we only consider GET and POST methods andfocus on PHP pages. For example, the web request keysinclude GET-login.php, POST-login.php, in the applicationshown in Fig. 1. Similarly, an output consists of a web re-sponse and a set of session variable name/value pair S(mout).A web response is a synthesized web page, which is usuallygenerated by filling dynamic contents into static web pagestructure (i.e., template). To deal with the infinite number

of possible web responses, we decompose a web page into aweb template, the number of which is finite, with a set ofdynamic contents, which become output parameters. If weassign a unique ID to each static template, a web responsecan be symbolized as a web template ID (i.e., web responsekey v(mout)) and a set of output parameter name/value pairQ(min). In the next section, we illustrate how to symbol-ize a web page into a web template with a set of outputparameters.

POST http://example.com/login.php username=testuser&passwd=xx

r (request key) P (input parameter) $_SESSION[‘privilege’]=null&$_SESSION[‘username’]=null

S (session variable) Web

Application

FTemplate: t.index_user /html/body: user info for testuser

v (response key) Q (output parameter) $_SESSION[‘privilege’]=“user”&$_SESSION[‘username’]=“testuser”

S (session variable)

mout

min

Figure 2: A stateless view of web application

3.2 Web Page SymbolizationTo symbolize a web page, we first extract the web tem-

plates (O) from all the observed web pages (D). Then, givena web page d ∈ D, we classify it into amost possible template(v) and extract the set of output parameters (Q) accordingly.Techniques for extraction of templates from web pages havebeen presented in existing literatures [19, 13]. In this paper,we leverage the method from TEXT [13], which expressesthe DOM tree structure of a web page as a set of essentialpaths. Our template extraction procedure contains the fol-lowing four steps. Step 1 and 2 are similar to TEXT andstep 3 and 4 are designed to fulfill the purpose of templateextraction in our context.(1) Transformation: the DOM tree structure of a web page

d is first transformed into a set of paths Pd. Here, we fo-cus on the paths that lead to the leaf text nodes, whichcarry the information sent back to the clients within webpages. An index page from our example application can beexpressed as three paths: “/html/body/Welcome to the ap-plication”, “/html/body/user information for: testuser.” and“/html/body/a/logout”, as shown in Fig.3.

/html/body/Welcome to the example application.

/html/body/user information for: testuser

/html/body/a/Logout

Web Page

/html/body/Welcome to the example application.

/html/body/a/Logout

All paths

Output parameter

Template paths

(t.index_user)

/html/body/ user information for: testuser��

Figure 3: An example of page symbolization

(2) Pruning: to extract templates from all the paths, thosepaths that lead to dynamic contents should be pruned. Todo so, we define the support of a path as the number of pagesin D that contain the path. Since the occurrence of a path

that belongs to a template is generally higher, paths withlow support are most likely dynamic contents and should bepruned. For each page d, the minimum support thresholdtd is defined as the mode (i.e., the most frequent value) ofthe occurrence of paths that are contained in the page. Notethat using one threshold for all the pages is inappropriate aseach template may generate different number of pages. Afterthe paths with support lower than the threshold are pruned,each page is expressed as a set of “essential” paths. We useep(d) to denote the number of essential paths contained inthe page d.

(3) Clustering: two web pages are probably generatedfrom the same web template if they have similar set of es-sential paths. The similarity (Dist) between two pages di djis defined as follows:

Dist(di, dj) =cp(di, dj)√

ep(di)× ep(dj)(1)

where cp(di, dj) is the number of common essential pathscontained in di and dj . We then perform hierarchical ag-glomerative clustering over all pages based on the abovesimilarity metric. Each resulting cluster corresponds to aweb template. The essential path set of a new template isthe intersection of path sets from the two templates that aremerged together.

(4) Parametrization: for each page in D, after eliminatingthe essential paths contained in the template it belongs to,the remaining paths in its path set belong to output param-eters. The parameter is identified by the path leading to thetext node and its value is the content of the text node. Weextract those parameters that are observed in all the pagesthat belong to the template as the set of output parametersof the template. For each parameter, we put the commonparts (i.e., tokens) from all the observed values into the pa-rameter name and only extract the variable part as its value.

For the example application, we obtain seven templates.They are the login page (t.login form), logout page (t.logout),wrong login page (t.wrong login), regular user informationpage (t.index user), admin user information page (t.index admin),logging identity page (t.admin) and the title change page(t.title form). As shown in Fig.3, the template t.index userhas a parameter “/html/body/user information for:”, whichdisplays the user’s information and its value is the currentuser name.

Given a web page d, it is first transformed into a set ofpaths. Then, it is classified into the template v that has thehighest similarity with its path set (i.e., v = argmax(Dist(d, vi)),vi ∈ O). The corresponding output parameters for the tem-plate are finally extracted.

3.3 Invariant ExtractionWe extract three types of invariants: (1) type I input in-

variants, indexed by the web request key r; (2) type II in-put/output invariants, indexed by the key pair (r, v); (3)type III input/output sequence invariants, also indexed bythe request key r. We also show some example invariantsextracted from the application in Fig.1.

3.3.1 Type I InvariantThe inputs with the same request key r are grouped to-

gether. We extract the following types of invariants for eachrequest key r.

(1) A set of session variables Sinv(r) that are always present.

An example invariant of this type is Sinv(GET-index.php) ={$ SESSION[‘username’], $ SESSION[‘privilege’]}.(2) A set of input parameters Pinv(r) that are always

present. An example invariant of this type is Pinv(POST-login.php) = {email, passwd}.(3) For a specific session variable s ∈ Sinv(r), its value is

drawn from an enumeration set V (s, r). For example, invari-ants of this type include: V ($ SESSION[‘privilege’], GET-admin.php) = {admin}, V ($ SESSION[‘privilege’], GET-index.php) = {admin, user};

(4) For a specific input parameter p ∈ Pinv(r), its value isdrawn from an enumeration set V (p, r).(5) The value of an input parameter p ∈ Pinv(r) is always

equal to the value of a session variable s ∈ Sinv(r). Forthe request key GET-index.php, the session variable $ SES-SION[‘username’] is always equal to the input parameterusername.

3.3.2 Type II InvariantThe input/output pairs with the same key pair (r, v) are

grouped together. We first extract the same set of invariantsas type I for the key pair. For example, an invariant drawnfor the key pair (GET-login.php, t.logout) is that V (logout,(GET-login.php, t.logout) = {1} and the input parameterlogout is added into Pinv(GET-login.php, t.logout).We also extract two new invariants for each key pair (r, v):(1) The value of an output parameter is always equal to the

value of an input parameter and/or a session variable. Thisinvariant reflects the dataflow within the web application.An invariant for the key pair (POST-login.php, t.index user)is that the output parameter /html/body/ of the templatet.index user is always equal to the session variable $ SES-SION[‘username’] and the input parameter username.(2) The session state is unchanged. For example, the user’s

session state always stays the same by observing the key pair(GET-login.php, t.login form), but evolves for the key pair(POST-login.php, t.index user).

3.3.3 Type III InvariantFor each request key r, we extract the following invariant:(1) A set of input/output key pairs that always precede

the web request key in one session. An invariant of thistype is the key pair (GET-admin.php, t.admin) always pre-cedes the request key GET-admin2.php and the key pair(GET-admin2.php, t.title form) always occurs before POST-admin2.php.

3.4 DetectionEach web request key r is associated with a set of invari-

ants, including both type I and type III invariants. Eachinput/output key pair (r, v) is also associated with a set oftype II invariants. For detection, each invariant is trans-formed into an evaluation function, which operates on aninput or an input/output pair. If the input or input/outputpair satisfies the invariant, the function returns true. Oth-erwise, the function returns false. The runtime detection isperformed in two phases:(1) validating the input min: the web request is accepted,

if and only if the request key has been observed and all theinvariants associated with it are satisfied. Otherwise, theweb request is dropped.(2) validating the input/output pair (min,mout): the web

page is sent back to the user if and only if the corresponding

key pair has been observed and all the invariants associatedwith it are satisfied. Otherwise, the web page is blocked.

All the attacks that exploit the example application can bedetected by our extracted invariants. (1) Each auth bypassattack instance violates the invariants associated with threerequest keys GET-admin.php, GET-admin2.php and POST-admin2.php respectively and are detected at the first phase.For example, the first attack instance violates the invariantV ($ SESSION[‘privilege’], GET-admin.php) = {admin}. (2)the parameter manipulation attack violates the invariantassociated with the request key GET-index.php where theinput parameter username is always equal to the sessionvariable $ SESSION[‘username’] and is detected in the firstphase. It also violates the invariant of the key pair (GET-index.php, t.index user) that the output parameter “/html/body/user information for:” is equal to both the input pa-rameter username and $ SESSION[‘username’]. (3) theworkflow bypass attack violates the invariant associated withthe request key GET-admin2.php that the key pair (GET-admin.php, t.admin) always precedes the request key and isdetected in the first phase.

4. IMPLEMENTATIONWe implement the prototype of our detection system BLOCK

as a proxy that sits between the web application and theclient, as shown in Fig. 4. BLOCK is capable of intercept-ing all the messages exchanged between the web applicationand the client and taking snapshots of the user’s session in-formation stored at the server side. To capture the webrequests and responses, we build BLOCK on top of Web-Scarab [18], an open source web application testing tool,which is deployed at the web server and configured as a re-verse proxy. PHP web applications, which are our focus inthis paper, by default store the users’ session information intemporary files at the directory /var/lib/php5. BLOCK isable to locate the correct session files, indexed by the ses-sion ID within the web request, and read the user’s sessioninformation. BLOCK can be operated in two modes: train-ing and detection. In the training mode, BLOCK collectsthe observed web requests, responses and their associatedsession information, analyzes those execution traces and ex-tracts the set of relevant invariants. In the detection mode,BLOCK monitors the interactions between the clients andthe web application, dynamically detects and blocks thosepotential attacks that violate the extracted invariants. Wenote that BLOCK can be easily extended to other platformsother than PHP by just modifying the component that ac-cesses the session information to handle a variety of pro-gramming frameworks. For example, in the case of Tomcatservlet, the component should be able to access databasetables via JDBC drivers, which store persistent session in-formation. Our implementation is independent of the webapplication (i.e., doesn’t require the source code for analysisor instrumentation). Thus, it can scale up to protect a largenumber of web applications.

4.1 Training ModeThe components of BLOCK in the trainingmode are shown

in Fig. 5. Whenever a web request or a web page is captured,the message constructor takes a snapshot of the current ses-sion state and composes the corresponding messages, whichis sent to the trace collector. After suficient traces have beencollected, BLOCK will perform offline learning. The trace

Web Applica�onClient/

Simulator Database

Session

Info Index by session id

P

r

o

x

y

Web Server

Figure 4: Overview of BLOCK

processor first extracts web templates from observed webpages, then parses both the input and output messages intothe designated format: a request or response key associatedwith a set of key/value pairs for both parameters and ses-sion variables. The parsed traces are fed into the invariantextractor, where all three types of invariants are derived.Especially, the value-related invariants (e.g., the equality re-lationship between variables, the enumeration value set ofvariables) are inferred by leveraging Daikon engine [9], awell-known tool for dynamic inference of program invari-ants. The traces are transformed into the format required byDaikon engine and the output is a set of invariants extractedfor each declared entry. Presence-related invariants are ex-tracted by self-developed programs. All extracted invariantscomprise the web application’s specification.

Web Server

User

Simulator

Session

Info

Web Request

BLOCK

Message Constructor

Invariant Extractor

Trace Collector

Web Page

Invariants

Web

Applica�on

Templates Template Extractor

Figure 5: Training Mode

4.2 Detection ModeOnce the invariants are extracted, BLOCK switches to

the detection mode, as shown in Fig. 6. The invariant in-terpreter loads and interprets the extracted invariants. Atruntime, the message constructor combines session informa-tion with the intercepted web request, composes an inputand sends it to the detector for evaluation. If the input isaccepted, the web request is forwarded to the web applica-tion and logged as the current input for the web application.Otherwise, the web request is dropped. When the messageconstructor receives a web response, if the response is a redi-rection, the subsequent web request will not be evaluated orlogged. If the response is a web page, the message construc-tor assigns the web page a response key based on its webtemplate, composes an output and sends it to the detector,where the output is paired with the current input and eval-uated. If the output is accepted, the web page is returned tothe client and the key pair is logged for the current user ses-sion. Otherwise, the web response is blocked and the currentinput is invalidated. After the user’s session has terminated,all of the logged key pairs are cleaned up.

5. EVALUATION

Web Server

Client

Session

Info

Web Request

BLOCK

Message Constructor

Invariant Interpreter

Detector

Web Page

Web

Applica�on

Invariants

Templates

Figure 6: Detection Mode

We evaluate our approach using a set of open source PHPweb applications, which are representative with different typesof functionalities. (1) Scarf is a conference management sys-tem, which is used for managing sessions, papers, users andcomments. It is known with an auth bypass vulnerability(CVE-2006-5909). The attacker can directly visit the ad-ministrative page generaloptions.php and modify the systemsettings and user accounts, since the admin page doesn’tcheck the privilege of current user. It echoes the first caseof auth bypass in the example application. (2) Simplecmsis a simple content management system that allows the ad-min to publish and manage contents. It is also vulnerableto an auth bypass attack in Auth.php page (BID 19386). Ituses the register globals mechanism insecurely. An attackercan append a parameter loggedin to the web request andbypass the authentication check. It echoes the second caseof auth bypass in the example application. (3) Bloggit isa blog application that supports web blog management. Italso has an auth bypass vulnerability (CVE-2006-7014) inadmin.php page where the restrictive code continues beingexecuted after the auth check fails. It echoes the third caseof auth bypass in the example application. (4) Wackopicko[24] is an online photo sharing website that allows users toupload pictures, comment on and purchase other people’spictures, etc. It is initially written for testing web applica-tion vulnerability scanners. It is designed with a numberof vulnerabilities, such as cross-site scripting, SQL injection,file inclusion, etc. Here, we focus on its parameter manip-ulation vulnerability. After a user logs in, he/she can viewthe personal information in home.php page. However, anattacker can manipulate the userid parameter to view anyother user’s information and owned pictures. (5) OsCom-merce [17] is a widely-used open source e-commerce applica-tion. To evaluate our approach of handling workflow bypassattacks, we instrument one vulnerability into the checkoutprocedure, which allows the attacker to directly go to thepayment page without selecting the shipping method andthe total charge doesn’t include shipping fees. Table 1 showsa summary of web applications we use for evaluation.

All the web applications and BLOCK (based on Web-Scarab) are deployed on a 2.13GHz Core 2 Linux server with2GB RAM, running Ubuntu 10.10, Apache web server (ver-sion 2.2.16) and PHP (version 5.3.3). To collect trainingtraces, each web application is driven by a user simulator,which emulates the interactions between a normal user andthe web application. For each web application, user rolesand atomic operations are first identified manually. Then,the user simulator is developed based on the Selenium web-driver [21] to emulate a normal user operating a web applica-tion. The simulator leverages a library of user information

Table 1: Summary of Evaluated Web ApplicationsApplication PHP

filesDescription Vulnerability

Scarf 21 Conferencemanagementsystem

Auth bypass(CVE-2006-5909)

Simplecms 23 Contentmanagementsystem

Auth bypass(BID 19386)

BloggIt 24 Blog engine Auth bypass(CVE-2006-7104)

Wackopicko 53 Photo shar-ing website

Parameter ma-nipulation

OsCommerce 533 Open sourcee-commercesolution

Workflow bypass(instrumented)

of all the undergraduate students from a network securityclass and is able to automatically explore the web appli-cation, such as clicking the links, filling in and submittingforms. Among the available atomic operations for the cur-rently chosen user, it randomly selects one as the emulateduser’s next step. The user simulator is set up at a 2.83GHzCore 2 desktop with 8GB RAM running Windows 7 andFirefox 4. The client is connected to the web server usingEthernet.

5.1 Detection EffectivenessBLOCK first runs in the training mode to collect the ex-

ecution traces, generated by the user simulators. Table 2shows the summary of our collected traces 1. Then, it ana-lyzes those traces, extracts web request keys, web page tem-plates, as well as all three types of invariants. To observethe impact of the training set size on the number of derivedinvariants, we vary the training set size and calculate theresulting invariants. Fig. 7 shows the experiment result weobtain for the Scarf application. We can see that the num-bers of type I and III invariants initially decrease and thenconverge with the increase of training set size, indicating theelimination of false invariants learnt from insufficient train-ing samples. The number of type II invariants first increases,due to the exploration of new state space that has not beenrevealed by the small training set, then also slowly converges.Based on this observation, we use the training set for eachapplication where the number of invariants converges.

500 1000 1500 2000 2500 30000

200

400

600

800

Training set size (# of requests)

Nu

mb

er o

f in

vari

ants

Type I Invariant

Type II InvariantType III Invariant

Figure 7: Number of invariants vs. Training set size(Scarf application)

1Here, we note that our training only covers the part of mostused functions for customers in OsCommerce application.Also, we don’t count redirection headers as web pages.

Table 2: Summary of Training SetApplication Requests Web

PagesRequestKeys

Web Tem-plates

KeyPairs

Type IInv

Type IIInv

Type IIIInv

Scarf 3225 3200 21 26 69 90 640 11Simplecms 2661 2555 17 12 34 56 190 28BloggIt 2657 2645 16 13 47 65 377 9Wackopicko 2949 2946 20 12 30 36 155 37OsCommerce 3879 3444 25 36 123 374 4609 26

Then BLOCK switches to the detection mode. The cleantest set is generated by both the user simulators and the un-dergraduate students who manually operate the web appli-cations. Ten attack instances are manually generated underdifferent circumstances against each web application. Table3 shows the summary of the test set and all the detection re-sults. All of the attacks are successfully detected by BLOCKand the false positives for both web requests and responsesare fairly low. This fact demonstrates the effectiveness ofour approach at detecting state violation attacks.

We further investigate those false positives and find outtwo major sources. One is the incomplete exploration of theweb application performed by the user simulator. The capa-bility of the user simulator determines the state space thatour detection system can characterize for the web applica-tion. The more the simulator explores, the richer and moreaccurate these invariants are. In our evaluation, some falsepositives result from error pages that are not explored by thesimulator, thus not observed and profiled by the invariantextractor. In practice, if real-world traces are available, ourdetection system can be readily applied and work effectively.The other source of false positives is the inaccurate symbol-ization of web pages. Page symbolization affects both thetraining and detection phase. In the training phase, boththe number and the quality of the inferred invariants, es-pecially for type II, are closely related with the number ofextracted web templates. We can see that the number oftype I and III invariants converges very fast, thus leading toan extremely low number of false positives for web requests,while type II invariants bring more false positives of web re-sponses. In the detection phase, due to the content drift ofweb pages, it is possible that a web page is classified into awrong template, which likely results in an unobserved pairof input/output and thus a false positive. We use the sameclustering threshold for all applications to extract web tem-plates, which also introduces certain level of inaccuracies.Since web template extraction is not our focus in this paper,we adapt the methods from TEXT [13] and it works wellwith the web application we use for evaluation. To increasethe accuracy and robustness of web page symbolization, ad-vanced algorithms or manual audit can be introduced forguiding the process.

The detection results also show the types of invariantsviolated by different attacks. Auth bypass attacks on in-sufficient checking of session variables result in violationsof type I invariants that are imposed on the session state,when web requests are received. They would also violatetype III invariants due to the missing step of authentica-tion/authorization. Parameter manipulation attacks can bedetected by type I invariants, if the input parameters arerelated to the session variables. They may also be identifiedby type II invariants, if the corresponding web pages contain

output parameters that are related with the session state.Workflow bypass attacks will be blocked in the same man-ner as auth bypass attacks, if the session variables, whichare used for guarding the state transitions, are not checked.If there are no such guarding session variables, e.g., in theexample application, type III invariants would help to iden-tify workflow bypass attacks due to the constraints imposedon the sequence of operations.

5.2 Performance OverheadSince our detection system sits between the client and the

web application, it will affect the response time of the webapplication. First, the WebScarab proxy intercepts and for-wards all the messages exchanged between the user and theweb application, which would increase the response time.Second, the integrated detector evaluates the web requestsand web pages, which also introduces additional delay. Tomeasure the performance overhead brought by our detec-tion system, we use the simulators to perform a designatedsequence of operations and log the response time for ev-ery web request. We compare the performance under threeconfigurations: 1) without WebScarab proxy, 2) with Web-Scarab proxy deployed but the detector disabled, 3) withWebScarab proxy deployed and the detector enabled. Fig-ure 8 shows the summary of the averaged response time foreach application under the above three scenarios. We cansee that the average response time increases by a factor from1.5 to around 5, if BLOCK is deployed and enabled. Whilethe resulting response time is still acceptable, we notice thatmore than 90% of the overhead is brought by the WebScarabproxy and only a small amount is introduced by the detector.For our current prototype implementation, no modificationsor configurations are made to the WebScarab proxy to en-hance its performance. If a more light-weight and efficientproxy (e.g, Apache mod proxy) is employed for integratingour detection system, it is possible to reduce the responsetime, which serves as our future work.

Scarf Simplecms Bloggit Wackopicko OsCommerce0

20

40

60

80

Res

po

nse

tim

e (m

s)

without WebScarab

with WebScarab (without Detector)

with WebScarab (with Detector)

Figure 8: Summary of performance overhead

Table 3: Summary of Detection ResultApplication Requests

(cleantest set)

Web pages(clean testset)

Blocked re-quests (falsepositive)

Blocked re-sponses (falsepositive)

Attacks Detected Invariant vio-lations

Scarf 1364 1360 0 6 10 10 type I and IIISimplecms 1731 1688 0 8 10 10 type I and IIIBloggIt 1044 1024 0 0 10 10 type I and IIIWackopicko 1322 1314 0 1 10 10 type I and IIOsCommerce 1505 1460 3 10 10 10 type I and III

5.3 DiscussionThere is one limitation of BLOCK we would like to point

out. BLOCK only observes and models the relations be-tween web requests, web responses and the session variables.Thus it cannot handle the attacks that violate the persistentstates that exist in database tables. If BLOCK is extendedto capture and analyze the SQL queries/responses from adatabase, it has the potential to handle this type of stateviolation attack. This serves as our future work.

Our technique bears the same limitations as other dynamicanalysis techniques. The completeness and correctness ofinferred invariants cannot be guaranteed. In order to putBLOCK into practice, introducing somemanual interventionis preferable to guarantee sufficient training and suppressfalse positives. In the future, we would like to investigatemechanisms for automatic verification of likely invariants.

6. RELATED WORKOur work falls within the category of web application se-

curity and our approach is closely related to the specificationinference of software.

6.1 Web Application SecurityWeb application security has been a popular research topic

these years. A large body of existing works investigate in-put validation attacks, such as cross-site scripting, SQL in-jection, which exploit the applications’ insufficient or erro-neous sanitization of the user inputs. Compared with stateviolation flaws, which is the focus of this paper, input val-idation flaws are independent of the application logic andthus can be captured via a general specification. For ex-ample, the information flow model has been applied to theinput validation problem, where a set of data input pointsare defined as sources, and the security sensitive operationsare modeled as sinks. Based on this model, both static anddynamic program analysis techniques are employed to iden-tify the insufficient or erroneous sanitizations within the webapplication, which result in insecure information flow [12]. Itis worth noting that the black-box approach [20], techniquesthat analyze the external request/response flow [1, 23], andapproaches of inferring a DFA for web requests [15, 11] havebeen presented to address input validation attacks. Black-box techniques have also been applied to address other prob-lems within web applications, such as post-migration testing(Splitter [8]), insider threats (CADS [6]), form tampering(NoTamper [5]) and HTTP parameter pollution [3]. How-ever, due to different nature of problems, they don’t takeinto account the internal state of the web application andcan not be applied to state violation attacks.

MiMoSA [4] and Waler [10] employ white-box analysistechniques to identify vulnerabilities within web applica-

tions that attract state violation attacks. While they mayachieve better accuracy (i.e., less false positives) than black-box techniques, their capability is limited in that they relyon precise modeling of the application source code and pro-gramming frameworks, which is difficult and not scalable.The most related work to ours is Swaddler [7], which alsodetects state violation attacks at runtime by evaluating thedeviations of session variables when entering a specific pro-gram block. Onemajor deficiency of Swaddler is that it com-pletely depends on user defined session variables. In caseswhere insufficient session variables are defined, as shown inthe example application, it cannot detect those attacks. Incontrast, our Type III invariants that are defined based onthe web request/response history can capture the applicationstate that is not revealed by defined session variables. Thuseven when the session variables are insufficient or unreliable,our approach is still effective.

6.2 Specification Inference of SoftwareSoftware specification is essential for verification of pro-

gram behaviors and program testing. However, a completeand machine understandable specification is rarely available.Thus, researchers are motivated to study the problem of in-ferring software specifications. Static inference techniquesanalyze the program code to extract the partial orders offunction calls [14], while dynamic inference techniques try toprofile the program behavior throughmining program execu-tion traces. Daikon engine [9], the most famous tool in thisfield, extracts value-related invariants by matching invarianttemplates to expressions. Strauss [2] formalizes the speci-fication mining as a grammar inference problem and learnsprobabilistic finite state automata (PFSA) from traces. Per-racotta [26] mines two-letter alternating patterns of func-tions from imperfect traces. Gk-tail [16] builds extendedfinite state machine (EFSM) combining both value-relatedand temporal properties. Our approach falls into the cate-gory of dynamic inference techniques. Different from thesegeneric software specification inference methods, our workleverages the unique stateless feature of HTTP protocol andits associated session management mechanism and can beapplied to distributed client/server web applications.

7. CONCLUSIONThis paper presents BLOCK, a black-box approach for de-

tecting state violation attacks, and evaluates its prototypeimplementation using a set of open source PHP web applica-tions. The results validate the effectiveness of BLOCK. Ourapproach is valuable in that it is independent of the web ap-plication source code and can fit into a large variety of webapplication hosting scenarios based on different applicationframeworks, where the source code may not be available.

AcknowledgmentThis work was supported by NSF TRUST (The Team for Re-search in Ubiquitous Secure Technology) Science and Tech-nology Center (CCF-0424422). We would like to thank RyanBurns, Brandon Conway and Russ Amos for their help indeveloping user simulators and Vanderbilt ITS for valuablediscussion.

8. REFERENCES[1] M. Almgren, H. Debar, and M. Dacier. A lightweight

tool for detecting web server attacks. In Proceedings ofthe ISOC Symposium on Network and DistributedSystems Security, pages 157–170, 2000.

[2] G. Ammons, R. Bodl�lk, and J. R. Larus. Miningspecifications. In Symposium on Principles ofProgramming Languages, volume 37, pages 4–16, 2002.

[3] M. Balduzzi, C. Gimenez, D. Balzarotti, and E. Kirda.Automated discovery of parameter pollutionvulnerabilities in web applications. In NDSS’11:Proceedings of the 18th Network and DistributedSystem Security Symposium, 2011.

[4] D. Balzarotti, M. Cova, V. V. Felmetsger, andG. Vigna. Multi-module vulnerability analysis ofweb-based applications. In CCS’07: Proceedings of the14th ACM conference on Computer andcommunications security, pages 25–35, 2007.

[5] P. Bisht, T. Hinrichs, N. Skrupsky, R. Bobrowicz, andV. N. Venkatakrishnan. NoTamper: automaticblackbox detection of parameter tamperingopportunities in web applications. In CCS’10:Proceedings of the 17th ACM conference on Computerand communications security, pages 607–618, 2010.

[6] Y. Chen and B. Malin. Detection of anomalousinsiders in collaborative environments via relationalanalysis of access logs. In CODASPY ’11: Proceedingsof the first ACM conference on Data and applicationsecurity and privacy, pages 63–74, 2011.

[7] M. Cova, D. Balzarotti, V. Felmetsger, and G. Vigna.Swaddler: An Approach for the Anomaly-basedDetection of State Violations in Web Applications. InRAID’07: Proceedings of the 10th InternationalSymposium on Recent Advances in IntrusionDetection, pages 63–86, 2007.

[8] X. Ding, H. Huang, Y. Ruan, A. Shaikh, B. Peterson,and X. Zhang. Splitter: a proxy-based approach forpost-migration testing of web applications. InEuroSys’10: Proceedings of the 5th Europeanconference on Computer systems, pages 97–110, 2010.

[9] M. D. Ernst, J. Cockrell, W. G. Griswold, andD. Notkin. Dynamically discovering likely programinvariants to support program evolution. IEEETransactions on Software Engineering, 27(2):99–123,Feb. 2001.

[10] V. Felmetsger, L. Cavedon, C. Kruegel, and G. Vigna.Toward Automated Detection of Logic Vulnerabilitiesin Web Applications. In USENIX’10: Proceedings ofthe 19th conference on USENIX Security Symposium,pages 143–160, 2010.

[11] K. L. Ingham, A. Somayaji, J. Burge, and S. Forrest.Learning dfa representations of http for protecting webapplications. Computer Networks and Isdn Systems,51:1239–1255, 2007.

[12] N. Jovanovic, C. Kruegel, and E. Kirda. Pixy: A staticanalysis tool for detecting web applicationvulnerabilities (short paper). In S&P’06: Proceedingsof the 27th IEEE Symposium on Security & Privacy,pages 258–263, 2006.

[13] C. Kim and K. Shim. Text: Automatic templateextraction from heterogeneous web pages. IEEETrans. Knowl. Data Eng., 23(4):612–626, 2011.

[14] T. Kremenek, P. Twohey, G. Back, A. Ng, andD. Engler. From uncertainty to belief: inferring thespecification within. In OSDI ’06: Proceedings of the7th symposium on Operating systems design andimplementation, pages 161–176, 2006.

[15] C. Kruegel and G. Vigna. Anomaly detection ofweb-based attacks. In CCS’03: Proceedings of the 10thACM conference on Computer and communicationssecurity, pages 251–261, 2003.

[16] D. Lorenzoli, L. Mariani, and M. Pezze. Automaticgeneration of software behavioral models. In ICSE ’08:Proceedings of the 30th international conference onSoftware engineering, pages 501–510, 2008.

[17] OsCommerce Inc. http://www.oscommerce.com/.

[18] OWASP WebScarab Project.https://www.owasp.org/index.php/category:owasp webscarab project.

[19] D. C. Reis, P. B. Golgher, A. S. Silva, and A. F.Laender. Automatic web news extraction using treeedit distance. In WWW ’04: Proceedings of the 13thinternational conference on World Wide Web, pages502–511, 2004.

[20] R. Sekar. An efficient black-box technique fordefeating web application attacks. In NDSS’09: 16thAnnual Network and Distributed System SecuritySymposium, 2009.

[21] SeleniumHQ: Web Application Testing System.http://seleniumhq.org/.

[22] Symantec internet security threat report 2009.http://www.symantec.com/business/threatreport/.

[23] G. Vigna, W. Robertson, V. Kher, and R. A.Kemmerer. A stateful intrusion detection system forworld-wide web servers. In ACSAC’03: Proceedings ofthe Annual Computer Security ApplicationsConference, pages 34–43, 2003.

[24] Wackopicko.https://github.com/adamdoupe/wackopicko.

[25] G. Wassermann and Z. Su. Static detection ofcross-site scripting vulnerabilities. In ICSE’08:ACM/IEEE 30th International Conference onSoftware Engineering, pages 171–180, 2008.

[26] J. Yang, D. Evans, D. Bhardwaj, T. Bhat, andM. Das. Perracotta: mining temporal api rules fromimperfect traces. In ICSE ’06: Proceedings of the 28thinternational conference on Software engineering,pages 282–291, 2006.

[acm press the 27th annual computer security applications conference - orlando, florida...

Documents