session mining final.docx

1. INTRODUCTION

A Web user-session, simply named user-session in the remainder of

the paper, is informally defined as a set of TCP connections created by a

given user while surfing the Web during a given time frame. The study of

telecommunication networks has been often based on traffic measurements, which are used to

create traffic models and obtain performance estimates. While a lot of attention has been

traditionally devoted to traffic characterization at the packet and transport layers few are the

studies on traffic properties at the session/user layer.

This is due to the difficulty in defining the “session” concept itself which depends on the

considered application. Also, the generally accepted conjecture that such sessions follow a

Poisson arrival process might have reduced the interest in the user-session process analysis.

User-session identification and characterization play an important role

both in Internet traffic modeling and in the proper dimensioning of network

resources. Besides increasing the knowledge of network traffic and user

behavior, they yield workload models which may be exploited for both

performance evaluation and dimensioning of network elements. Synthetic

workload generators may be defined to assess network performance,

Finally, the knowledge of user-session behavior is important for service

providers, for example to dimension access links and router capacity. User

behavior can be modeled by few parameters e.g., session arrival rates, data

volumes.

An activity (ON) period on the Web alternates with a silent (OFF) period

during which the user is inactive on the Internet.

1

Clustering technique:

Clustering techniques are exploratory techniques used in many areas

to analyze large data sets. Given a proper notion of similarity, they find

groups of similar variables/objects by partitioning the data set in “similar

subsets

The HTTP protocols are having on the characteristics of web traffic in the Internet. For

example, measurements of TCP connection usage for early versions of the HTTP protocols

pointed to clear inefficiencies in design, notably the creation of a different TCP connection for

each web object reference. Recent revisions to the HTTP protocol, notably version 1.1, have

introduced the concepts of persistent connections and pipelining.

Persistent connections are provided to enable the reuse of a single TCP connection for multiple

object references at the same IP address (typically embedded components of a web page).

Pipelining allows the client to make a series of requests on a persistent connection without

waiting for a response between each request we make a number of observations about the

evolving nature of web traffic, the use of the web, the structure of web pages, the organization of

web servers providing this content, and the use of packet tracing as a method for understanding

dynamics of web traffic

Packet: as a chunk of information with atomic characteristics, that is either delivered correctly to

the destination or is lost (and most often has to be retransmitted);

Flow: as a concatenation of correlated packets, as in a TCP connection;

Traffic characteristics:

The traffic characteristics or the behavior can be measured using following

1) the packet level

2) the Bandwidth level

The packet level:

2

Measuring the packet level characteristics are very complex since the big theory is needed to

develop traffic engineering tools.

The packets size distribution, averaged over one week in JAN.01. It confirms that either packets

are very small i.e., pure control packets, or very large.The distribution of the Time To Live

(TTL) field content, distinguishing between incoming and outgoing packets.

Flows:

Flows are defines as the stream of packets, corresponding to transfer of a document. The

document may be a web page, a video clip which needs number of TCP connections. Generally

the flows are small. But the traffic is raised due to the small number of very large flows. Flows

do not occur independently .

Sessions:

The sessions are set TCP connections comes from different number of users.

Where every session can have any number of flows. Since the possibility for traffic is more in

session, the arrival process of sessions is leads to more workload to servers. The important

measure in traffic analysis is identifying the variability of both number of flows in session and

size of those flows.

1) A user session begins when a user who has been idle opens a TCP connection.

2) A user session ends, and the next idle period begins, when the user has had no open

connections for consecutive r seconds.

3

In choosing a particular value of r we wanted to use a time that was on the order of Web

browsing speed. In a typical Web browsing behavior, a human user clicks on a link. The Web

browser opens up several simultaneous

connections to retrieve the information. Then, when the information is presented, the user may

take a few seconds or minutes to digest the information before locating the next desired link. We

wanted to be large enough to keep such a sequence of clicks together.

2. SYSTEM ANALYSIS

2.1 Existing System

Some methods for the “session” arrival process were presented. However,

the focus is on Telnet and FTP sessions, where each session is related to a

single TCP data connection. No measurements of HTTP sessions are

reported.

To identify HTTP user-sessions, traditional approaches rely on the

adoption of a threshold. TCP connections are aggregated in the same

session if the inter-arrival time between two TCP connections is

smaller than the threshold value. Otherwise, the TCP connection is

associated with a newly created user-session.

The threshold-based approach works well only if the threshold value is

correctly matched to the values of connection and session inter-arrival

times

Furthermore, different users may show different idle times, and even

the same user may have different idle periods depending on the

service, e.g., news or e-commerce, he is accessing. Thus, the a prior

knowledge of the proper threshold value is an unrealistic assumption.

If the threshold value is not correctly matched to the session statistical

behavior, threshold based mechanisms are significantly error prone, To

avoid this drawback, we propose a more robust algorithm to identify

user-session.

4

Many existing system perform a data analysis of server logs to define

user-sessions. While the server log approach can be very reliable, it

lacks the capability offered by passive measurements performed at the

packet level which permit to simultaneously monitor a user browsing

several servers.

Some models adopt a passive sniffing methodology to rebuild HTTP

layer transactions to infer clients/users’ behaviors. By crawling HTTP

protocol headers, the sequence of objects referred by the initial

request is rebuilt. This allows grouping several TCP connections to form

a user-session.

While the server log approach can be very effective, it does not scale

well and, by leveraging on a specific application level protocol, can be

hardly generalized. Furthermore, since the payload of all packets must

be analyzed, this approach is not practical when, for security or privacy

reasons, data payloads (and application layer headers) are not

available.

2.2 Proposed System

The main paper goals are:

i) to devise a technique that permits to correctly identify user-

sessions, and,

ii) to determine their statistical properties by analyzing traces of

measured data.

The aim of this proposed system is to define a clustering technique to

identify user-sessions. Performance is compared with those of traditional

threshold based approaches, which partition samples depending on a

comparison between the sample to sample distance and a given threshold

value

5

The main advantage of the clustering approach is avoiding the need to

define a prioriany threshold value to separate and group samples. Thus, this

methodology is more robust than simpler threshold based mechanisms.

By running a clustering algorithm, we avoid the need of setting a prioria

threshold value, since clustering techniques automatically adapt to the

actual

Users behavior. Furthermore, the algorithm does not require any training

phase to properly run. We test the proposed methodology on artificially

generated traces

i) to ensure its ability to correctly identify a set of TCP connections

belonging to the same user-session,

ii) to assess the error performance of the proposed technique, and

iii) to compare it with traditional threshold based mechanisms.

Finally, we run the algorithms over real traffic traces, to obtain statistical

information on user-sessions, such as distributions of

i) session duration,

ii) amount of data transferred in a single session,

iii) number of connections within a single session.

In this paper, TCP headers only are analyzed, limiting privacy issues and

significantly reducing the probe complexity. Furthermore, the proposed

approach may be adopted for any set of sessions generated by the same

application,

1) The Hierarchical Agglomerative Approach:

Each sample is initially associated with a different cluster and the

procedure startup the number of clusters is then, calculated and the basis of

the definition of a cluster-to-cluster distance, the clusters at minimum

distance are merged to form a new cluster. The algorithm iterates this step

until all samples belong to the same cluster.

6

This approach can be quite time consuming, especially when the data

set is very large, since the initial number of clusters is equal to the number

of samples in the data set . For this reason, non-hierarchical approaches,

named partitioned, are often preferred, since they show better scalability

properties.

2) The Partitioned Approach:

This technique is used when the final number of clusters is known. The procedure starts with an

initial configuration including clusters, selected according to some criteria. The final cluster

definition is obtained through an iterative procedure.

1) The 4-tuple identifying the connection, i.e., IP addresses and TCP port

numbers of the client and the server;

2) The connection opening time, identified by the timestamp of the first

client SYN message;

3) The connection ending time, identified by the time instant in which the

TCP connection is terminated;

4) The net amount of byte sent from the client and server respectively

(excluding retransmissions.

2.3 Modules

I) Authentication

ii) Data set formulation

7

Iii) Session Clustering

Module 1: Authentication:

This module is used to make security assurance to that website in which users are

browsing. This Module is very must since the sites of these systems will have huge amount

of data to be maintained very securely. These sites are used by the huge number of people

and if any illegal activity in sites will affect all these members. So Based on the some id

proof such as student id, the security can be improved.

Module 2: Dataset formulation:

Once the user start browsing, a session for the user is created automatically. If the user is

valid then the further TCP connections from the user will be sent to the server. A user can

have any number of connections in a single session. The user can also have any number of

sessions at a time. All the details of session arrival and TCP arrival processes are stored in

log file or dataset of the server.

Module 3: session clustering:

The TCP flows in the session are cluster based on the user behavior in the data set. So the

session active time can be ON as much as the user needed. The sessions can not be inactive

due to clustering of same user sessions.

Data Fusion:

At the beginning of the data preprocessing, we have the Log containing the Web

server log files collected by several Web servers as well as the Web site maps (in XGML format

[PK01], which is an XML application). First, we join the log files and then anonymise the

resulting log file for privacy reasons.

Data cleaning:

There are a variety of files accessed as a result of a request by a client to view a particular

Web page. These include image, sound and video files, executable cgifiles , coordinates of

8

clickable regions in image map files and HTML files. Thus the server logs contain many entries

that are redundant or irrelevant for the data mining tasks

User Request: Page1.html

Browser Request: Page1.html, a. gif, b.gif

Entries for same user request in the Server Log, hence redundancy.

This procedure is straightforward, consisting in the removal of all the requests issued by pairs of

(Host, User Agent) identified as being a Web robot.

Data Summarization:

The last step of data preprocessing contains what we called the \advanced preprocessing

step for WUM". In this step, first, we transfer the structured file containing visits or episodes (if

identified) to a relational database. Afterwards, we apply the data generalization at the request

level (for URLs) and the aggregated data computation for episodes, visits and user sessions to

completely fill in the database.

2.4 Software Specifications

Java Tools:

Langage : Java JDK1.6

Java Technologies : Swing, Servlets , JSP

IDE : NetBeans6.0

Operating System : Windows 2000/XP

3. SOFTWARE SYSTEM ATTRIBUTES

9

Reliability

The system shall fail under unavoidable circumstances like network connection failure.

Availability

The system shall allow users to restart the application after failure with the loss of the old Peer

ID along with its trust data and provides a new Peer ID by treating it as a new Peer

Security

The factors that can protect the software from accidental or malicious access, use, modification,

destruction, or disclosure provided are:

The software can be accessed by only authorized users since they are provided with

user name & password

The trust value about one peer is encrypted using public key cryptography and then

transferred to the other peer

Maintainability

Specify attributes of software that relate to the ease of maintenance of the software itself. There

may be some requirement for certain modularity, interfaces, complexity, etc. Requirements

should not be placed here just because they are thought to be good design practices. If someone

else will maintain the system

Portability

Since the software is to be developed using host independent code it can be executed in different

host with different platforms

4. IMPLEMENTATION

10

4.1 Java

Java was conceived by James Gosling, Patrick Naughton, Chris Warth, Ed Frank and Mike

Sheridan and SUN Micro Systems Incorporation in 1991. It took 18 months to develop the first

working version. This language was initially called "OAK", but was renamed "JAVA" in 1995.

Before the initial implementation of OAK in 1992 and the public announcement of Java in 1995,

many more contributed to the design and evolution of the language.

Java overview

Java is a powerful but lean object oriented programming language. It has generated a lot of

excitement because it make it possible to program for internet by creating applets, programs that

can be embedded in web page. The context of an applet is limited only by one's imagination.

For example, an applet can be animated with sound, an interactive game or a ticker tape with

constantly updated stock prices. Applets can be just little decoration to liven up web page, or

they can be serious applications like word processors or spreadsheet.

But Java is more than a programming language for writing applets. It is being

used more and more for writing standalone applications as well. It is becoming so popular that

many people believe it will become standard language for both general purpose and Internet

programming.

Java builds on the strength of c++. It has taken best features of c++. It has added

garbage collection, multi threading and security capabilities. The result is that Java is actually a

platform and easy to use.

Java is actually a platform consisting of three components:

1. Java programming language

2. Java library of classes and interfaces

3. Java virtual machine

Java is portable

11

One of the biggest advantages java offers is that it is portable. An application written in java will

run on all the major platforms. Any computer with a java-based browser can run the applications

or applets written in java pr4ogramming languages. A programmer no longer has to write one

program to run on a UNIX machine, and so on. Developers write code once; Java code is

compiled into byte codes rather than a machine language. These byte codes go to the Java

virtual machine, which executes them directly into the language that is understood by the system.

Java is object oriented

Classes

A class is a combination of similar type. The combination of both data and the code of an object

can be made a user defined data type with the help of a class. A class defines shapes and

behaviors of an object and data. In class has been defined we can create any number of object

belonging to that class. A said already classes are user defined data types and behave likes the

built in types of programming language.

Data AbstractionData Abstraction

Data abstraction is an act of representing essential features without including the background

details and explanations.

Encapsulation

Data Encapsulation is one of the most striking features of java. Encapsulation is the wrapping up

of data and functions into a single unit called class. The wrapped defines the behavior and

protects the code and data from being arbitrarily accessed by the outside world and only those

functions, which are wrapped in the class, can access it. This type of insulation of data from

direct access by the program is called 'Data hiding'.

Inheritance

12

Inheritance is the process by which objects of a class can acquire the properties of objects of

another class i.e. in java the concept of inheritance provides idea of reusability providing the

means of adding additional features to an existing class without modifying it. This is possible by

deriving a new class from the existing on thus the newly created class will have the combined

features of both the parent and the child classes.

Polymorphism

Polymorphism means the ability to take more than one form i.e. one object, many shapes.

Polymorphism plays an important role in allowing objects having different internal structure to

share the same external interface. This states a general class if operations may be accessed in the

same manner ever though, specific actions with each operation may differ.

Dynamic Binding

Binding refers the linking of a procedure call to the code to be executed in response to the call.

Dynamic binding means that the code associated with a given procedure call is not known until

the time of the call at the run time.

Dynamic binding is closely associated with the concepts of polymorphic depends on the dynamic

type of that reference.

Java programming structure

A Java source files is a text file that contains one or more class definitions. The java compiler

expects these files to be stored with the '.java' filename extension. When Java source code is

compiled, each individual class is put into its own output file named after the class with a ‘.class’

extension since there is no global functions or variables in Java and only thing that can be in a

Java, source file is one or more class definitions. Java requires that all code reside inside of a

names class. Java is highly case sensitive with respect to all keywords and identifiers. In java

the code for any method must be started by an open brace and so ended by a close brace. Every

java application must have a 'main' method. The main method is simply a starting place for the

interpreter to begin. Java applets won't use a main method at all, since the web browser's java

runtime has a different conversion for boot strapping applets. In java every statement must end

with a semicolon, there are no limits on the length of the statements. Java is a free form

13

language. Java programs can be written in any way there must be at least one space between each

token. Java programs are collection of white space, comments, keywords, identifiers, literals,

operators and separators.

Packages and interfaces

Java allows to groups classes in a collection called packages. Packages are convenient way of

organizing the classes and libraries. Packages can be nested. A number of classes having same

kind of behavior can be grouped under a package.

Packages are imported into the required java programs using the implements

keyword. Interfaces provide a mechanism that allows unrelated classes to implement the same

set of methods. An interface is a collection of method prototypes and constant values that are

free from dependency on a specific class. Interfaces are implemented by using the implements

keyword.

4.2 Introduction to API

Application programming interface (API) forms the heart of any java program. These API'S are

defined in corresponding java packages and are imported to the program. Some of the packages

available in java are

* Java. Lang - includes all language libraries

* Java.awt - includes AWT libraries, such as windows, Scrollbars, etc.,

for GUI applications

* Java. Applet - includes API for applet programming

* Java.io - includes all libraries required for input-output applications

* Java. Image -includes libraries for image processing.

* Java.net -includes networking API's.

* Java.util -includes general API's like vector, stack etc.

*Javax.swing -includes AWT libraries such as windows and GUI applications

4.3 Java Database Connectivity (JDBC)

14

The Java Database Connectivity (JDBC) is a standard java extension for data access that allows

Java programmers to code to a unified relational database API. By using JDBC, java

programmer can represent database connections; issue SQL statements, process database results,

implemented by a JDBC Drive, an adapter that known how to talk to a particular database in a

proprietary way. JDBC is similar to the Open Database Connectivity (ODBC) standard, and the

two are quite interoperable through JDBC-ODBC bridges.

JDBC

JDBC is the set of interfaces for connecting to SQL table. With JDBC, we can query and update

the data stored in the SQL tables within our Java programs. By this way, any Java object can be

saved into SQL tables. This Java API is essential for EJB, the core API in J2EE.

JDBC creates a programming-level interface for communicating with databases in a

uniform manner similar in concept to Microsoft's Open Database Connectivity (ODBC)

component, which has become the standard for personal computers and LANs. The JDBC

standard itself is based on the X/Open SQL Call Level Interface, the same basis as that of

ODBC. This is one of the reasons why the initial development of JDBC is progressing so fast.

Object classes for opening transactions with these databases are written completely in Java to

allow much closer interaction than you would get by embedding C language function calls in

Java programs, as you would have to do with ODBC. This way we can still maintain the security,

the robustness, and the portability that make Java so exciting.

However, to promote its use and maintain some level of backward compatibility, JDBC can be

implemented on top of ODBC and other common SQL APIs from vendors.

JDBC consists of two main layers: the JDBC API supports application-to-JDBC Manager

Communications; the JDBC Driver API supports JDBC Manager-to-Driver implementation

communications.

In terms of Java classes, the JDBC API consists of:

java.sql.Environment - allows the creation of new database connections;

java.sql.Connection - connection-specific data structures;

java.sql.Statement - container class for embedded SQL statements;

java.sql.ResultSet - access control to results of a statement.

5. UML DIAGRAMS

15

Use case diagram:

Class Diagram:

16

Sequence Diagram:

17

6. TESTING

Introduction:

Testing is a schedule process carried out by the software development team to capture all

the possible errors, missing operations and also a complete verification to verify objective are

met and user requirement are satisfied. The design of tests for software and other engineering

products can be as challenging as the initial design to the product itself.

Testing Types:

A software engineering product can be tested in one of two ways:

Black box testing

White box testing

Black box testing:

knowing the specified function that a product has been designed to perform, determine

whether each function is fully operational.

White box testing:

knowing the internal workings of a software product determine whether the internal

operation implementing the functions perform according to the specification, and all the internal

components have been adequately exercised.

Testing Strategies:

Four Testing Strategies that are often adopted by the software development team include:

Unit Testing

Integration Testing

Validation Testing

System Testing

This system was tested using Unit Testing and Integration Testing Strategies to test the

project because there were the most relevant approaches for this project.

Unit Testing:

19

We adopt white box testing when using this testing technique. This testing was carried

out on individual components of the software that were designed. Each individual module was

tested using this technique during the coding phase. Every component was checked to make sure

that they adhere strictly to the specifications spelt out in the data flow diagram and ensure that

they perform the purpose intended for them.

All the names of the variables are scrutinized to make sure that they are truly reflected of

the element they represent. All the looping mechanisms were verified to ensure that they were as

decided. Beside these, we trace through the code manually to capture syntax errors and logical

errors.

Integration Testing:

After finishing the Unit Testing process, next is the integration testing process. In this

testing process we put our focus on identifying the interfaces between components and their

functionality as dictated by the DFD diagram. The Bottom up incremental approach was

adopted during this testing. Low level modules are integrated and combined as a cluster before

testing.

The Black box testing technique was employed here. The interfaces between the

components were tested first. This allowed identifying any wrong linkages or parameters

passing early in the development process as it just can be passed in a set of data and checked if

the result returned is an accepted one.

Validation Testing:

Software testing and validation is achieved through a series of black box tests that

demonstrate conformity with requirements. A test procedure defines specific test cases that will

be used to demonstrate conformity with requirements. Both, the plan and the procedure are

designed to ensure that all functional requirements are achieved, documentation is correct and

other requirements are met. After each validation test case has been conducted, one of the two

possible conditions exists. They are,

The function or performance characteristics conform to specification and are accepted.

A deviation from specification is uncovered and a deficiency list is created.

The deviation or error discovered at this stage in project can rarely be corrected prior to

scheduled completion. It is necessary to negotiate with the customer to establish a method for

resolving deficiencies.

20

System Testing:

System testing is a series of different tests whose primary purpose is to fully exercise the

computer based system. Although each test has a different purpose, all the work should verify

that all system elements have been properly integrated and perform allocated functions.

System testing also ensures that the project works well in the environment. It traps the

errors and allows convenient processing of errors without coming out of the program abruptly.

Recovery testing is done in such as way that failure is forced to a software system and

checked whether the recovery is proper and accurate. The performance of the system is highly

effective.

Software testing is critical element of software quality assurance and represents ultimate review

of specification, design and coding. Test case design focuses on a set of technique for the

creation of test cases that meet overall testing objectives. Planning and testing of a programming

system involve formulating a set of test cases, which are similar to the real data that the system is

intended to manipulate. Test castes consist of input specifications, a description of the system

functions exercised by the input and a statement of the extended output. Through testing

involves producing cases to ensure that the program responds, as expected, to both valid and

invalid inputs, that the program perform to specification and that it does not corrupt other

programs or data in the system.

In principle, testing of a program must be extensive. Every statement in the program

should be exercised and every possible path combination through the program should be

executed at least once. Thus, it is necessary to select a subset of the possible test cases and

conjecture that this subset will adequately test the program.

Approach to testing:

21

Testing a system’s capabilities is more important that testing its component. This means

that test cases should be chosen to identify aspects of the system, which will stop them doing

their job.

Testing old capabilities is more important than testing new capabilities. If program is a

revision of an existing system, users existing feature to keep working.

Testing typical situation is more important that testing boundary value cases. It is more

important that a system works under normal usage conditions than under occasional conditions,

which only arise with extreme data values.

Test data:

Test data are the inputs, which have been devised to test the system. It is sometimes

possible to generate test data automatically.Thedatas entered during the run time by the user is

compared to the data type given in the coding, if it is not equal then error is displayed in a

message box and does not allows the user to proceed further. Thus the dates are tested.

Algorithm Used for joining Log File:

First, we join the different log files from the Log. We put together the requests from all

log files into a joint log file. Generally, in the log files, the requests do not include the name of

the server file. However, we need the Web server name to distinguish between requests made to

different Web servers, therefore we add this information in the requests(before the file path).

Moreover, we have to take into account the synchronization of the Web server clocks, including

the time zone differences. Figure 2.2 shows our algorithm for joining Web server log files. In

this algorithm we used the following notations:

22

7. SCREEN SHOTS

24

session mining final.docx

Documents

web usersession

characteristics of web

user behavior

web alternates

traffic characterization

traffic measurements

traffic models

given user