session mining final.docx
TRANSCRIPT
1. INTRODUCTION
A Web user-session, simply named user-session in the remainder of
the paper, is informally defined as a set of TCP connections created by a
given user while surfing the Web during a given time frame. The study of
telecommunication networks has been often based on traffic measurements, which are used to
create traffic models and obtain performance estimates. While a lot of attention has been
traditionally devoted to traffic characterization at the packet and transport layers few are the
studies on traffic properties at the session/user layer.
This is due to the difficulty in defining the “session” concept itself which depends on the
considered application. Also, the generally accepted conjecture that such sessions follow a
Poisson arrival process might have reduced the interest in the user-session process analysis.
User-session identification and characterization play an important role
both in Internet traffic modeling and in the proper dimensioning of network
resources. Besides increasing the knowledge of network traffic and user
behavior, they yield workload models which may be exploited for both
performance evaluation and dimensioning of network elements. Synthetic
workload generators may be defined to assess network performance,
Finally, the knowledge of user-session behavior is important for service
providers, for example to dimension access links and router capacity. User
behavior can be modeled by few parameters e.g., session arrival rates, data
volumes.
An activity (ON) period on the Web alternates with a silent (OFF) period
during which the user is inactive on the Internet.
1
Clustering technique:
Clustering techniques are exploratory techniques used in many areas
to analyze large data sets. Given a proper notion of similarity, they find
groups of similar variables/objects by partitioning the data set in “similar
subsets
The HTTP protocols are having on the characteristics of web traffic in the Internet. For
example, measurements of TCP connection usage for early versions of the HTTP protocols
pointed to clear inefficiencies in design, notably the creation of a different TCP connection for
each web object reference. Recent revisions to the HTTP protocol, notably version 1.1, have
introduced the concepts of persistent connections and pipelining.
Persistent connections are provided to enable the reuse of a single TCP connection for multiple
object references at the same IP address (typically embedded components of a web page).
Pipelining allows the client to make a series of requests on a persistent connection without
waiting for a response between each request we make a number of observations about the
evolving nature of web traffic, the use of the web, the structure of web pages, the organization of
web servers providing this content, and the use of packet tracing as a method for understanding
dynamics of web traffic
Packet: as a chunk of information with atomic characteristics, that is either delivered correctly to
the destination or is lost (and most often has to be retransmitted);
Flow: as a concatenation of correlated packets, as in a TCP connection;
Traffic characteristics:
The traffic characteristics or the behavior can be measured using following
1) the packet level
2) the Bandwidth level
The packet level:
2
Measuring the packet level characteristics are very complex since the big theory is needed to
develop traffic engineering tools.
The packets size distribution, averaged over one week in JAN.01. It confirms that either packets
are very small i.e., pure control packets, or very large.The distribution of the Time To Live
(TTL) field content, distinguishing between incoming and outgoing packets.
Flows:
Flows are defines as the stream of packets, corresponding to transfer of a document. The
document may be a web page, a video clip which needs number of TCP connections. Generally
the flows are small. But the traffic is raised due to the small number of very large flows. Flows
do not occur independently .
Sessions:
The sessions are set TCP connections comes from different number of users.
Where every session can have any number of flows. Since the possibility for traffic is more in
session, the arrival process of sessions is leads to more workload to servers. The important
measure in traffic analysis is identifying the variability of both number of flows in session and
size of those flows.
1) A user session begins when a user who has been idle opens a TCP connection.
2) A user session ends, and the next idle period begins, when the user has had no open
connections for consecutive r seconds.
3
In choosing a particular value of r we wanted to use a time that was on the order of Web
browsing speed. In a typical Web browsing behavior, a human user clicks on a link. The Web
browser opens up several simultaneous
connections to retrieve the information. Then, when the information is presented, the user may
take a few seconds or minutes to digest the information before locating the next desired link. We
wanted to be large enough to keep such a sequence of clicks together.
2. SYSTEM ANALYSIS
2.1 Existing System
Some methods for the “session” arrival process were presented. However,
the focus is on Telnet and FTP sessions, where each session is related to a
single TCP data connection. No measurements of HTTP sessions are
reported.
To identify HTTP user-sessions, traditional approaches rely on the
adoption of a threshold. TCP connections are aggregated in the same
session if the inter-arrival time between two TCP connections is
smaller than the threshold value. Otherwise, the TCP connection is
associated with a newly created user-session.
The threshold-based approach works well only if the threshold value is
correctly matched to the values of connection and session inter-arrival
times
Furthermore, different users may show different idle times, and even
the same user may have different idle periods depending on the
service, e.g., news or e-commerce, he is accessing. Thus, the a prior
knowledge of the proper threshold value is an unrealistic assumption.
If the threshold value is not correctly matched to the session statistical
behavior, threshold based mechanisms are significantly error prone, To
avoid this drawback, we propose a more robust algorithm to identify
user-session.
4
Many existing system perform a data analysis of server logs to define
user-sessions. While the server log approach can be very reliable, it
lacks the capability offered by passive measurements performed at the
packet level which permit to simultaneously monitor a user browsing
several servers.
Some models adopt a passive sniffing methodology to rebuild HTTP
layer transactions to infer clients/users’ behaviors. By crawling HTTP
protocol headers, the sequence of objects referred by the initial
request is rebuilt. This allows grouping several TCP connections to form
a user-session.
While the server log approach can be very effective, it does not scale
well and, by leveraging on a specific application level protocol, can be
hardly generalized. Furthermore, since the payload of all packets must
be analyzed, this approach is not practical when, for security or privacy
reasons, data payloads (and application layer headers) are not
available.
2.2 Proposed System
The main paper goals are:
i) to devise a technique that permits to correctly identify user-
sessions, and,
ii) to determine their statistical properties by analyzing traces of
measured data.
The aim of this proposed system is to define a clustering technique to
identify user-sessions. Performance is compared with those of traditional
threshold based approaches, which partition samples depending on a
comparison between the sample to sample distance and a given threshold
value
5
The main advantage of the clustering approach is avoiding the need to
define a prioriany threshold value to separate and group samples. Thus, this
methodology is more robust than simpler threshold based mechanisms.
By running a clustering algorithm, we avoid the need of setting a prioria
threshold value, since clustering techniques automatically adapt to the
actual
Users behavior. Furthermore, the algorithm does not require any training
phase to properly run. We test the proposed methodology on artificially
generated traces
i) to ensure its ability to correctly identify a set of TCP connections
belonging to the same user-session,
ii) to assess the error performance of the proposed technique, and
iii) to compare it with traditional threshold based mechanisms.
Finally, we run the algorithms over real traffic traces, to obtain statistical
information on user-sessions, such as distributions of
i) session duration,
ii) amount of data transferred in a single session,
iii) number of connections within a single session.
In this paper, TCP headers only are analyzed, limiting privacy issues and
significantly reducing the probe complexity. Furthermore, the proposed
approach may be adopted for any set of sessions generated by the same
application,
1) The Hierarchical Agglomerative Approach:
Each sample is initially associated with a different cluster and the
procedure startup the number of clusters is then, calculated and the basis of
the definition of a cluster-to-cluster distance, the clusters at minimum
distance are merged to form a new cluster. The algorithm iterates this step
until all samples belong to the same cluster.
6
This approach can be quite time consuming, especially when the data
set is very large, since the initial number of clusters is equal to the number
of samples in the data set . For this reason, non-hierarchical approaches,
named partitioned, are often preferred, since they show better scalability
properties.
2) The Partitioned Approach:
This technique is used when the final number of clusters is known. The procedure starts with an
initial configuration including clusters, selected according to some criteria. The final cluster
definition is obtained through an iterative procedure.
1) The 4-tuple identifying the connection, i.e., IP addresses and TCP port
numbers of the client and the server;
2) The connection opening time, identified by the timestamp of the first
client SYN message;
3) The connection ending time, identified by the time instant in which the
TCP connection is terminated;
4) The net amount of byte sent from the client and server respectively
(excluding retransmissions.
2.3 Modules
I) Authentication
ii) Data set formulation
7
Iii) Session Clustering
Module 1: Authentication:
This module is used to make security assurance to that website in which users are
browsing. This Module is very must since the sites of these systems will have huge amount
of data to be maintained very securely. These sites are used by the huge number of people
and if any illegal activity in sites will affect all these members. So Based on the some id
proof such as student id, the security can be improved.
Module 2: Dataset formulation:
Once the user start browsing, a session for the user is created automatically. If the user is
valid then the further TCP connections from the user will be sent to the server. A user can
have any number of connections in a single session. The user can also have any number of
sessions at a time. All the details of session arrival and TCP arrival processes are stored in
log file or dataset of the server.
Module 3: session clustering:
The TCP flows in the session are cluster based on the user behavior in the data set. So the
session active time can be ON as much as the user needed. The sessions can not be inactive
due to clustering of same user sessions.
Data Fusion:
At the beginning of the data preprocessing, we have the Log containing the Web
server log files collected by several Web servers as well as the Web site maps (in XGML format
[PK01], which is an XML application). First, we join the log files and then anonymise the
resulting log file for privacy reasons.
Data cleaning:
There are a variety of files accessed as a result of a request by a client to view a particular
Web page. These include image, sound and video files, executable cgifiles , coordinates of
8
clickable regions in image map files and HTML files. Thus the server logs contain many entries
that are redundant or irrelevant for the data mining tasks
User Request: Page1.html
Browser Request: Page1.html, a. gif, b.gif
Entries for same user request in the Server Log, hence redundancy.
This procedure is straightforward, consisting in the removal of all the requests issued by pairs of
(Host, User Agent) identified as being a Web robot.
Data Summarization:
The last step of data preprocessing contains what we called the \advanced preprocessing
step for WUM". In this step, first, we transfer the structured file containing visits or episodes (if
identified) to a relational database. Afterwards, we apply the data generalization at the request
level (for URLs) and the aggregated data computation for episodes, visits and user sessions to
completely fill in the database.
2.4 Software Specifications
Java Tools:
Langage : Java JDK1.6
Java Technologies : Swing, Servlets , JSP
IDE : NetBeans6.0
Operating System : Windows 2000/XP
3. SOFTWARE SYSTEM ATTRIBUTES
9
Reliability
The system shall fail under unavoidable circumstances like network connection failure.
Availability
The system shall allow users to restart the application after failure with the loss of the old Peer
ID along with its trust data and provides a new Peer ID by treating it as a new Peer
Security
The factors that can protect the software from accidental or malicious access, use, modification,
destruction, or disclosure provided are:
The software can be accessed by only authorized users since they are provided with
user name & password
The trust value about one peer is encrypted using public key cryptography and then
transferred to the other peer
Maintainability
Specify attributes of software that relate to the ease of maintenance of the software itself. There
may be some requirement for certain modularity, interfaces, complexity, etc. Requirements
should not be placed here just because they are thought to be good design practices. If someone
else will maintain the system
Portability
Since the software is to be developed using host independent code it can be executed in different
host with different platforms
4. IMPLEMENTATION
10
4.1 Java
Java was conceived by James Gosling, Patrick Naughton, Chris Warth, Ed Frank and Mike
Sheridan and SUN Micro Systems Incorporation in 1991. It took 18 months to develop the first
working version. This language was initially called "OAK", but was renamed "JAVA" in 1995.
Before the initial implementation of OAK in 1992 and the public announcement of Java in 1995,
many more contributed to the design and evolution of the language.
Java overview
Java is a powerful but lean object oriented programming language. It has generated a lot of
excitement because it make it possible to program for internet by creating applets, programs that
can be embedded in web page. The context of an applet is limited only by one's imagination.
For example, an applet can be animated with sound, an interactive game or a ticker tape with
constantly updated stock prices. Applets can be just little decoration to liven up web page, or
they can be serious applications like word processors or spreadsheet.
But Java is more than a programming language for writing applets. It is being
used more and more for writing standalone applications as well. It is becoming so popular that
many people believe it will become standard language for both general purpose and Internet
programming.
Java builds on the strength of c++. It has taken best features of c++. It has added
garbage collection, multi threading and security capabilities. The result is that Java is actually a
platform and easy to use.
Java is actually a platform consisting of three components:
1. Java programming language
2. Java library of classes and interfaces
3. Java virtual machine
Java is portable
11
One of the biggest advantages java offers is that it is portable. An application written in java will
run on all the major platforms. Any computer with a java-based browser can run the applications
or applets written in java pr4ogramming languages. A programmer no longer has to write one
program to run on a UNIX machine, and so on. Developers write code once; Java code is
compiled into byte codes rather than a machine language. These byte codes go to the Java
virtual machine, which executes them directly into the language that is understood by the system.
Java is object oriented
Classes
A class is a combination of similar type. The combination of both data and the code of an object
can be made a user defined data type with the help of a class. A class defines shapes and
behaviors of an object and data. In class has been defined we can create any number of object
belonging to that class. A said already classes are user defined data types and behave likes the
built in types of programming language.
Data AbstractionData Abstraction
Data abstraction is an act of representing essential features without including the background
details and explanations.
Encapsulation
Data Encapsulation is one of the most striking features of java. Encapsulation is the wrapping up
of data and functions into a single unit called class. The wrapped defines the behavior and
protects the code and data from being arbitrarily accessed by the outside world and only those
functions, which are wrapped in the class, can access it. This type of insulation of data from
direct access by the program is called 'Data hiding'.
Inheritance
12
Inheritance is the process by which objects of a class can acquire the properties of objects of
another class i.e. in java the concept of inheritance provides idea of reusability providing the
means of adding additional features to an existing class without modifying it. This is possible by
deriving a new class from the existing on thus the newly created class will have the combined
features of both the parent and the child classes.
Polymorphism
Polymorphism means the ability to take more than one form i.e. one object, many shapes.
Polymorphism plays an important role in allowing objects having different internal structure to
share the same external interface. This states a general class if operations may be accessed in the
same manner ever though, specific actions with each operation may differ.
Dynamic Binding
Binding refers the linking of a procedure call to the code to be executed in response to the call.
Dynamic binding means that the code associated with a given procedure call is not known until
the time of the call at the run time.
Dynamic binding is closely associated with the concepts of polymorphic depends on the dynamic
type of that reference.
Java programming structure
A Java source files is a text file that contains one or more class definitions. The java compiler
expects these files to be stored with the '.java' filename extension. When Java source code is
compiled, each individual class is put into its own output file named after the class with a ‘.class’
extension since there is no global functions or variables in Java and only thing that can be in a
Java, source file is one or more class definitions. Java requires that all code reside inside of a
names class. Java is highly case sensitive with respect to all keywords and identifiers. In java
the code for any method must be started by an open brace and so ended by a close brace. Every
java application must have a 'main' method. The main method is simply a starting place for the
interpreter to begin. Java applets won't use a main method at all, since the web browser's java
runtime has a different conversion for boot strapping applets. In java every statement must end
with a semicolon, there are no limits on the length of the statements. Java is a free form
13
language. Java programs can be written in any way there must be at least one space between each
token. Java programs are collection of white space, comments, keywords, identifiers, literals,
operators and separators.
Packages and interfaces
Java allows to groups classes in a collection called packages. Packages are convenient way of
organizing the classes and libraries. Packages can be nested. A number of classes having same
kind of behavior can be grouped under a package.
Packages are imported into the required java programs using the implements
keyword. Interfaces provide a mechanism that allows unrelated classes to implement the same
set of methods. An interface is a collection of method prototypes and constant values that are
free from dependency on a specific class. Interfaces are implemented by using the implements
keyword.
4.2 Introduction to API
Application programming interface (API) forms the heart of any java program. These API'S are
defined in corresponding java packages and are imported to the program. Some of the packages
available in java are
* Java. Lang - includes all language libraries
* Java.awt - includes AWT libraries, such as windows, Scrollbars, etc.,
for GUI applications
* Java. Applet - includes API for applet programming
* Java.io - includes all libraries required for input-output applications
* Java. Image -includes libraries for image processing.
* Java.net -includes networking API's.
* Java.util -includes general API's like vector, stack etc.
*Javax.swing -includes AWT libraries such as windows and GUI applications
4.3 Java Database Connectivity (JDBC)
14
The Java Database Connectivity (JDBC) is a standard java extension for data access that allows
Java programmers to code to a unified relational database API. By using JDBC, java
programmer can represent database connections; issue SQL statements, process database results,
implemented by a JDBC Drive, an adapter that known how to talk to a particular database in a
proprietary way. JDBC is similar to the Open Database Connectivity (ODBC) standard, and the
two are quite interoperable through JDBC-ODBC bridges.
JDBC
JDBC is the set of interfaces for connecting to SQL table. With JDBC, we can query and update
the data stored in the SQL tables within our Java programs. By this way, any Java object can be
saved into SQL tables. This Java API is essential for EJB, the core API in J2EE.
JDBC creates a programming-level interface for communicating with databases in a
uniform manner similar in concept to Microsoft's Open Database Connectivity (ODBC)
component, which has become the standard for personal computers and LANs. The JDBC
standard itself is based on the X/Open SQL Call Level Interface, the same basis as that of
ODBC. This is one of the reasons why the initial development of JDBC is progressing so fast.
Object classes for opening transactions with these databases are written completely in Java to
allow much closer interaction than you would get by embedding C language function calls in
Java programs, as you would have to do with ODBC. This way we can still maintain the security,
the robustness, and the portability that make Java so exciting.
However, to promote its use and maintain some level of backward compatibility, JDBC can be
implemented on top of ODBC and other common SQL APIs from vendors.
JDBC consists of two main layers: the JDBC API supports application-to-JDBC Manager
Communications; the JDBC Driver API supports JDBC Manager-to-Driver implementation
communications.
In terms of Java classes, the JDBC API consists of:
java.sql.Environment - allows the creation of new database connections;
java.sql.Connection - connection-specific data structures;
java.sql.Statement - container class for embedded SQL statements;
java.sql.ResultSet - access control to results of a statement.
5. UML DIAGRAMS
15
Use case diagram:
Class Diagram:
16
Sequence Diagram:
17
18
6. TESTING
Introduction:
Testing is a schedule process carried out by the software development team to capture all
the possible errors, missing operations and also a complete verification to verify objective are
met and user requirement are satisfied. The design of tests for software and other engineering
products can be as challenging as the initial design to the product itself.
Testing Types:
A software engineering product can be tested in one of two ways:
Black box testing
White box testing
Black box testing:
knowing the specified function that a product has been designed to perform, determine
whether each function is fully operational.
White box testing:
knowing the internal workings of a software product determine whether the internal
operation implementing the functions perform according to the specification, and all the internal
components have been adequately exercised.
Testing Strategies:
Four Testing Strategies that are often adopted by the software development team include:
Unit Testing
Integration Testing
Validation Testing
System Testing
This system was tested using Unit Testing and Integration Testing Strategies to test the
project because there were the most relevant approaches for this project.
Unit Testing:
19
We adopt white box testing when using this testing technique. This testing was carried
out on individual components of the software that were designed. Each individual module was
tested using this technique during the coding phase. Every component was checked to make sure
that they adhere strictly to the specifications spelt out in the data flow diagram and ensure that
they perform the purpose intended for them.
All the names of the variables are scrutinized to make sure that they are truly reflected of
the element they represent. All the looping mechanisms were verified to ensure that they were as
decided. Beside these, we trace through the code manually to capture syntax errors and logical
errors.
Integration Testing:
After finishing the Unit Testing process, next is the integration testing process. In this
testing process we put our focus on identifying the interfaces between components and their
functionality as dictated by the DFD diagram. The Bottom up incremental approach was
adopted during this testing. Low level modules are integrated and combined as a cluster before
testing.
The Black box testing technique was employed here. The interfaces between the
components were tested first. This allowed identifying any wrong linkages or parameters
passing early in the development process as it just can be passed in a set of data and checked if
the result returned is an accepted one.
Validation Testing:
Software testing and validation is achieved through a series of black box tests that
demonstrate conformity with requirements. A test procedure defines specific test cases that will
be used to demonstrate conformity with requirements. Both, the plan and the procedure are
designed to ensure that all functional requirements are achieved, documentation is correct and
other requirements are met. After each validation test case has been conducted, one of the two
possible conditions exists. They are,
The function or performance characteristics conform to specification and are accepted.
A deviation from specification is uncovered and a deficiency list is created.
The deviation or error discovered at this stage in project can rarely be corrected prior to
scheduled completion. It is necessary to negotiate with the customer to establish a method for
resolving deficiencies.
20
System Testing:
System testing is a series of different tests whose primary purpose is to fully exercise the
computer based system. Although each test has a different purpose, all the work should verify
that all system elements have been properly integrated and perform allocated functions.
System testing also ensures that the project works well in the environment. It traps the
errors and allows convenient processing of errors without coming out of the program abruptly.
Recovery testing is done in such as way that failure is forced to a software system and
checked whether the recovery is proper and accurate. The performance of the system is highly
effective.
Software testing is critical element of software quality assurance and represents ultimate review
of specification, design and coding. Test case design focuses on a set of technique for the
creation of test cases that meet overall testing objectives. Planning and testing of a programming
system involve formulating a set of test cases, which are similar to the real data that the system is
intended to manipulate. Test castes consist of input specifications, a description of the system
functions exercised by the input and a statement of the extended output. Through testing
involves producing cases to ensure that the program responds, as expected, to both valid and
invalid inputs, that the program perform to specification and that it does not corrupt other
programs or data in the system.
In principle, testing of a program must be extensive. Every statement in the program
should be exercised and every possible path combination through the program should be
executed at least once. Thus, it is necessary to select a subset of the possible test cases and
conjecture that this subset will adequately test the program.
Approach to testing:
21
Testing a system’s capabilities is more important that testing its component. This means
that test cases should be chosen to identify aspects of the system, which will stop them doing
their job.
Testing old capabilities is more important than testing new capabilities. If program is a
revision of an existing system, users existing feature to keep working.
Testing typical situation is more important that testing boundary value cases. It is more
important that a system works under normal usage conditions than under occasional conditions,
which only arise with extreme data values.
Test data:
Test data are the inputs, which have been devised to test the system. It is sometimes
possible to generate test data automatically.Thedatas entered during the run time by the user is
compared to the data type given in the coding, if it is not equal then error is displayed in a
message box and does not allows the user to proceed further. Thus the dates are tested.
Algorithm Used for joining Log File:
First, we join the different log files from the Log. We put together the requests from all
log files into a joint log file. Generally, in the log files, the requests do not include the name of
the server file. However, we need the Web server name to distinguish between requests made to
different Web servers, therefore we add this information in the requests(before the file path).
Moreover, we have to take into account the synchronization of the Web server clocks, including
the time zone differences. Figure 2.2 shows our algorithm for joining Web server log files. In
this algorithm we used the following notations:
22
23
7. SCREEN SHOTS
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43