data leakage detection--final 26 april

DATA LEAKAGE DETECTION

OBJECTIVE /MOTIVATION

A data distributor has given sensitive data to a set of supposedly trusted agents (third parties).

Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop).

The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means.

Data allocation strategies (across the agents) that improve the probability of identifying leakages.

These methods do not rely on alterations of the released data (e.g., watermarks). In some cases distributor can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.

Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.

AIM

Aim is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.

ABSTRACT

A data distributor has given sensitive data to a set of supposedly trusted

agents (third parties). Some of the data is leaked and found in an unauthorized

place (e.g., on the web or somebody’s laptop). The distributor must assess the

likelihood that the leaked data came from one or more agents, as opposed to having

been independently gathered by other means. Data allocation strategies (across the

agents) that improve the probability of identifying leakages has been proposed.

These methods do not rely on alterations of the released data (e.g., watermarks). In

some cases distributor can also inject “realistic but fake” data records to further

improve our chances of detecting leakage and identifying the guilty party.

LITERATURE SURVEY

GENERAL INTRODUCTION

The guilt detection approach we present is related to the data provenance problem, tracing the lineage of S objects implies essentially the detection of the guilty agents.

Suggested solutions are domain specific, such as lineage tracing for data warehouses and assume some prior knowledge on the way a data view is created out of data sources.

Watermarks were initially used in images, video and audio data whose digital representation includes considerable redundancy. Watermarking is similar in the sense of providing agents with some kind of receiver-identifying information. However, by its very nature, a watermark modifies the item being watermarked. If the object to be watermarked cannot be modified then a watermark cannot be inserted. In such cases methods that attach watermarks to the distributed data are not applicable.

Recently, works have also studied marks insertion to relational data.

There are also lots of other works on mechanisms that allow only authorized users to access sensitive data through access control policies. Such approaches prevent in some sense data leakage by sharing information only with trusted parties. However, these policies are restrictive and may make it impossible to satisfy agents’ requests.

ACHIEVING K-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION

AND SUPPRESSION1

Abstract: Often a data holder, such as a hospital or bank, needs to share person-specific records

in such a way that the identities of the individuals who are the subjects of the data cannot be

determined. One way to achieve this is to have the released records adhere to kanonymity, which

means each released record has at least (k-1) other records in the release whose values are

indistinct over those fields that appear in external data. So, kanonymity provides privacy

protection by guaranteeing that each released record will relate to at least k individuals even if

the records are directly linked to external information. This paper provides a formal presentation

of combining generalization and suppression to achieve k-anonymity. Generalization involves

replacing (or recoding) a value with a less specific but semantically consistent value.

Suppression involves not releasing a value at all. The Preferred Minimal Generalization

Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these

techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms

Datafly and m-Argus are compared to MinGen. Both Datafly and m-Argus use heuristics to

make approximations, and so, they do not always yield optimal results. It is shown that Datafly

can over distort data and m-Argus can additionally fail to provide adequate protection

WATERMARKING DIGITAL IMAGES FOR COPYRIGHT PROTECTION

A watermark is an invisible mark placed on an image that is designed to identify both the source

of an image as well as its intended recipient. The authors present an overview of watermarking

techniques and demonstrate a solution to one of the key problems in image watermarking,

namely how to hide robust invisible

WHY AND WHERE: A CHARACTERIZATION OF DATA PROVENANCE

Abstract. With the proliferation of database views and curated databases, the issue of data

provenance { where a piece of data came from and the process by which it arrived in the

database { is becoming increasingly important, especially in scientific databases where

understanding prove-nance is crucial to the accuracy and currency of data. In this paper we

describe an approach to computing provenance when the data of interest has been created by a

database query. We adopt a syntactic approach and present results for a general data model that

applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our

work is a distinction between \why" provenance (refers to the source data that had some

influence on the existence of the data) and \where" provenance (refers to the location(s) in the

source databases from which the data was extracted).

PRIVACY, PRESERVATION AND PERFORMANCE: THE 3 P’S OF DISTRIBUTED

DATA MANAGEMENT

Privacy, preservation and performance (“3 P’s”) are central design objectives for secure

distributed data management systems. However, these objectives tend to compete with one

another. This paper introduces a model for describing distributed data management systems,

along with a framework for measuring privacy, preservation and performance. The framework

enables a system designer to quantitatively explore the tradeoffs between the 3 P’s.

EXISTING SYSTEM

Perturbation

Application where the original sensitive data cannot be perturbed has been

considered. Perturbation is a very useful technique where the data is modified and

made “less sensitive” before being handed to agents. For example, one can add

random noise to certain attributes, or one can replace exact values by ranges.

However, in some cases it is important not to alter the original distributor’s data.

For example, if an outsourcer is doing our payroll, he must have the exact salary

and customer bank account numbers. If medical researchers will be treating

patients (as opposed to simply computing statistics), they may need accurate data

for the patients.

Watermarking

Traditionally, leakage detection is handled by watermarking, e.g., a unique code is

embedded in each distributed copy. If that copy is later discovered in the hands of

an unauthorized party, the leaker can be identified. Watermarks can be very useful

in some cases, but again, involve some modification of the original data.

Furthermore, watermarks can sometimes be destroyed if the data recipient is

malicious.

Disadvantages

Consider applications where the original sensitive data cannot be perturbed.

Perturbation is a very useful technique where the data is modified and made

“less sensitive” before being handed to agents.

However, in some cases it is important not to alter the original distributor’s data.

Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy.

If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified.

Watermarks can be very useful in some cases, but again, involve some modification of the original data.

Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.

PROPOSED SYSTEM

Unobtrusive techniques for detecting leakage of a set of objects or records have

been studied. After giving a set of objects to agents, the distributor discovers some

of those same objects in an unauthorized place. (For example, the data may be

found on a web site, or may be obtained through a legal discovery process.) At this

point the distributor can assess the likelihood that the leaked data came from one or

more agents, as opposed to having been independently gathered by other means.

Using an analogy with cookies stolen from a cookie jar, if Freddie with a single

cookie has been cached, he can argue that a friend gave him the cookie. But if

Freddie with 5 cookies has been cached, it will be much harder for him to argue

that his hands were not in the cookie jar. If the distributor sees “enough evidence”

that an agent leaked data, he may stop doing business with him, or may initiate

legal proceedings.

A model for assessing the “guilt” of agents has been developed. An algorithm for

distributing objects to agents, in a way that improves our chances of identifying a

leaker has been proposed. The option of adding “fake” objects to the distributed set

also been considered. Such objects do not correspond to real entities but appear

realistic to the agents. In a sense, the fake objects acts as a type of watermark for

the entire set, without modifying any individual members. If it turns out an agent

was given one or more fake objects that were leaked, then the distributor can be

more confident that agent was guilty.

Advantages

After giving a set of objects to agents, the distributor discovers some of

those same objects in an unauthorized place.

At this point the distributor can assess the likelihood that the leaked data

came from one or more agents, as opposed to having been independently

gathered by other means.

If the distributor sees “enough evidence” that an agent leaked data, he may

stop doing business with him, or may initiate legal proceedings.

To develop a model for assessing the “guilt” of agents.

We also present algorithms for distributing objects to agents, in a way that

improves our chances of identifying a leaker.

Consider the option of adding “fake” objects to the distributed set. Such

objects do not correspond to real entities but appear.

If it turns out an agent was given one or more fake objects that were leaked,

then the distributor can be more confident that agent was guilty.

REQUIREMENT ANALYSIS

User Requirements

User need to do the following activities to create the data leakage model

Distributor

Distributor will forward the requested data to the agent along with created fake

object. Distributor must lead a text file, which contains the leaked data. Distributor

can compare the data available in the text file with the data, which is allocated to

agent.

Agents

Agents receive the data from distributor, which they can use for the own purpose.

Sometime the agents may leak-out data to unauthorized person.

Unauthorized person

Unauthorized person may get data from agent.

System Requirements:

Hardware requirements:

Processor : Any Processor above 500 MHz.

Ram : 128Mb.

Hard Disk : 10 Gb.

Compact Disk : 650 Mb.

Input device : Standard Keyboard and Mouse.

Output device : VGA and High Resolution Monitor.

Software requirements:

Operating System : Windows Family.

Language : JDK 1.5

Front End : Java Swing

Database : Msaccess

Introduction to JAVA:

Java is a programming language developed by Sun Micro System. Java is an object oriented

programming language that is used in Conjunction with java Enabled web browsers. These browsers can

interrupt byte codes created by the language compiler. The technical design of Java is architectural

neutral.

The term architectural in this sense refers to computer hardware. Programmers can create Java

programs without having to worry about his underlying architecture of a user computer. Instead, the

hot Java browser is customized to the users.

Features and advantage of JAVA:

Java fits the Network communication programmers need to understand more specific technical

characteristics of Java. Java is a simple, Object-oriented, distributed, interpreted, robust, secure,

architecture neutral, portable, high-performance, multithreaded, and dynamic language.

Java is platform independent. Because Java was designed is support distributed application over

computer networks, is can be used with a variety of CPU and Operating system. To achieve this objective

this goal, a compiler was created those procedures architectural neural Object files from Java code.

Java is secure, Java allows virus – free, tampers – free system to be created, because of the

number of security checks performed before a piece of code can be executed. This is possible

because, pointers and memory allocations are removed during compile time.

Java is distributed, Java is built with network communications in mind and it comes for free.

Java has a comprehensive library of routines for dealing with network protocols such as

transmission control protocol/Internet protocol etc.

Java is object oriented; the object-oriented functions of Java are es

sentially those of C++. The object-oriented approach supplies a generic form of Lego bricks,

from which all other Lego bricks can be derived.

Java is robust; Java required declarations, ensuring the data types pass to routine are exactly

the data type that the routine required. The programmers must explicitly write casts further more

Java does not allow automatic casting of data types.

Java is dynamic, Java is designed to adapt in a constantly evolving environment. Java is capable

of dynamically linking in new class libraries, methods and instance variables as it goes without

breaking.

Java is strongly associated with the Internet because of the fact that, the first application

program written in Java was Hot Java, a Web browser to run applets on Internet. Internet users

can use Java to create applet programs and run them locally using a “Java- enabled browser”,

they can also use a Java enabled browser to download an applet located on a computer anywhere

in the Internet and runs it on his local computer.

IMPLEMENTATION

5.1. SOFTWARE DEVELOPMENT FLOW:

SOFTWARE DEVELOPMENT LIFE CYCLE

Fig

Design

Implementation

Test

Installation and

Checkout

Operation and

Maintenance

Replacement

Concept

Exploration

WHAT

FEEDBACK

HOW

OPERATION

EVOLVE

5.1 Waterfall Cycle

The waterfall life-cycle model describes a sequence of activities that begins with concept exploration

and concludes with maintenance and eventual replacement. The waterfall model caters to forward

engineering of software products. This means starting with a high-level conceptual model for a system.

After a description of the conceptual model for a system has been worked out, the software process

continues with the design, implementation, and testing of a physical model of the system:

5.2 Design:

System design is the process of planning a new system moving from the problem domain to the solution

domain. The design phase translates the logical aspects of the system into physical aspects of the

system.

JAVA:

Java is a new computer programming language developed by Sun Microsystems. Java has a good

chance to be the first really successful new computer language in several decades. Advanced

programmers like it because it has a clean, well-designed definition. Business likes it because it

dominates an important new application, Web programming.

Java has several important features:

A Java program runs exactly the same way on all computers. Most other languages allow small

differences in interpretation of the standards.

It is not just the source that is portable. A Java program is a stream of bytes that can be run on any

machine. An interpreter program is built into Web browsers, though it can run separately. Java

programs can be distributed through the Web to any client computer.

Java applets are safe. The interpreter program does not allow Java code loaded from the network to

access local disk files, other machines on the local network, or local databases. The code can display

information on the screen and communicate back to the server from which it was loaded.

A group at Sun reluctantly invented Java when they decided that existing computer languages

could not solve the problem of distributing applications over the network. C++ inherited many

unsafe practices from the old C language. Basic was too static and constrained to support the

development of large applications and libraries.

Today, every major vendor supports Java. Netscape incorporates Java support in every version of

its Browser and Server products. Oracle will support Java on the Client, the Web Server, and the

Database Server. IBM looks to Java to solve the problems caused by its heterogeneous product

line.

The Java programming language and environment is designed to solve a number of problems in

modern programming practice. It has many interesting features that make it an ideal language for

software development. It is a high-level language that can be characterized by all of the

following buzzwords:

Features

Sun describes Java as

Simple

Object-oriented

Distributed

Robust

Secure

Architecture Neutral

Portable

Interpreted

High performance

Java is simple.

What it means by simple is being small and familiar.Sun designed Java as closely to C++ as possible in

order to make the system more comprehensible, but removed many rarely used, poorly understood,

confusing features of C++. These primarily include operator overloading, multiple inheritance, and

extensive automatic coercions. The most important simplification is that Java does not use pointers and

implements automatic garbage collection so that we don't need to worry about dangling pointers,

invalid pointer references, and memory leaks and memory management.

Java is object-oriented.

This means that the programmer can focus on the data in his application and the interface to it. In Java,

everything must be done via method invocation for a Java object. We must view our whole application

as an object; an object of a particular class. .

Java is distributed.

Java is designed to support applications on networks. Java supports various levels of network

connectivity through classes in java. net. For instance, the URL class provides a very simple interface to

networking. If we want more control over the downloading data than is through simpler URL methods,

we would use a URLConnection object which is returned by a URL URL.openConnection () method. Also,

you can do your own networking with the Socket and Server Socket classes.

Java is robust.

Java is designed for writing highly reliable or robust software. Java puts a lot of emphasis on early

checking for possible problems, later dynamic (runtime) checking, and eliminating situations that are

error prone. The removal of pointers eliminates the possibility of overwriting memory and corrupting

data.

Java is secure.

Java is intended to be used in networked environments. Toward that end, Java implements several

security mechanisms to protect us against malicious code that might try to invade your file system. Java

provides a firewall between a networked application and our computer.

Java is architecture-neutral:

Java program are compiled to an architecture neutral byte-code format. The primary advantage of this

approach is that it allows a Java application to run on any system that implements the Java Virtual

Machine. This is useful not only for the networks but also for single system software distribution. With

the multiple flavors of Windows 95 and Windows NT on the PC, and the new PowerPC Macintosh, it is

becoming increasing difficult to produce software that runs on all platforms.

Java is portable.

The portability actually comes from architecture-neutrality. But Java goes even further by

explicitly specifying the size of each of the primitive data types to eliminate implementation-

dependence. The Java system itself is quite portable. The Java compiler is written in Java, while

the Java run-time system is written in ANSI C with a clean portability boundary.

Java is interpreted.

The Java compiler generates byte-codes. The Java interpreter executes the translated byte codes

directly on system that implements the Java Virtual Machine. Java's linking phase is only a

process of loading classes into the environment.

Java is high-performance.

Compared to those high-level, fully interpreted scripting languages, Java is high-performance. If the just-

in-time compilers are used, Sun claims that the performance of byte-codes converted to machine code

are nearly as good as native C or C++. Java, however, was designed to perform well on very low-power

CPUs.

Java is multithreaded.

Java provides support for multiple threads of execution that can handle different tasks with a Thread

class in the java.lang Package. The thread class supports methods to start a thread, run a thread, stop a

thread, and check on the status of a thread. This makes programming in Java with threads much easier

than programming in the conventional single-threaded C and C++ style.

Java is dynamic.

Java language was designed to adapt to an evolving environment. It is a more dynamic language than C

or C++. Java loads in classes, as they are needed, even from across a network. This makes an upgrade to

software much easier and effectively. With the compiler, first we translate a program into an

intermediate language called Java byte codes ---the platform-independent codes interpreted by the

interpreted on the Java platform. The interpreter parses and runs each Java byte code instruction on the

computer. Compilation happen s just once; interpretation occurs each time the program is executed.

Java byte codes can be thought as the machine code instructions for the Java Virtual Machine (JVM).

Every Java interpreter, whether it’s a development tool or a web browser that can run applets, is an

implementation of the Java Virtual Machine.

The Java Platform

A platform is the hardware or software environment in which a program runs. Most Platforms

can be described as a combination of the operating system and hardware. The Java platform

differs from most other platforms in that it’s software –only platform that runs on top of other

hardware-based platforms.

The Java platform has two components:

1. The Java Virtual Machine (JVM).

2. The Java Application Programming Interfaces (Java API).

The JVM has been explained above. It’s the base for the Java platform and is ported onto various

hardware-based platforms.

The Java API is a large collection of ready-made software components that provide many useful

capabilities, such as graphical user interface (GUI). The Java API is grouped into libraries of

related classes and interfaces; these libraries are known as packages.

Native code is code that after you compile it, the compiled code runs on a specific hardware

platform. As a platform-independent environment, the Java platform can be bit slower than

native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code

compilers can bring performance close to that of native code without threatening portability.

MODULE DESCRIPTION:

DATA ALLOCATION:

This modklikule is mainly designed to transfer data from distributor to

agents. The same module can also be used for illegal data transfer from authorized

to agents to other agents

The distributor intelligently gives data to agents in order to improve the chances of

detecting a guilty agent. There are four instances of this problem can be addressed,

depending on the type of data requests made by agents and whether “fake objects”

are allowed. The two types of requests handled are: sample and explicit. Fake

objects are objects generated by the distributor. The objects are designed to look

like real objects, and are distributed to agents together, in order to increase the

chances of detecting agents that leak data.

All agents make explicit requests and all agents make sample requests. The results

can be extended to handle mixed cases, with some explicit and some sample

requests.

GUILT MODEL

This module is designed using the agent – guilt model. When the distributor

sends data to agent, while run time this module allocates unique fake object in each

and every tuple provided to agents. A copy of data, which is transferred to agents

are stored in the distributor’s database.

Distributor adds fake objects to the distributed data in order to improve the

effectiveness in detecting guilty agents. However, fake objects may impact the

correctness of what agents do, so they may not always be allowable.

In this module, perturbing the set of distributor objects by adding fake elements

has been done. In some applications, fake objects may cause fewer problems that

perturbing real objects. For example, say the distributed data objects are patient

records and the agents are researchers. In this case, even small modifications to the

records of actual patient distribution records may be undesirable. However, the

addition of some fake object may be acceptable.

AGENT-GUILT MODEL:

This module is mainly designed for determining fake agents. This module

uses fake objects (which is stored in database from guilt model module) and

determines the guilt agent along with the probability.

Once the distributor finds his data in unauthorized places, he can able to compare

the release data with his copy of data, which is distributed to agents, then he can

able to find out the guilt agent.

To compute this probability, we need an estimate for the probability that values

can be “guessed” by the target.

We use the probability of guessing to identify agents that have leaked information.

The probabilities are estimated based on experiments.

OVERLAP MINIMIZATION

When the distributor allocates data upon request from agents, there may be chance

of asking same tuples by more that one agent. Here come overlap between more

that one agent, while giving same tuple to more than one agent. When the data is

released, distributor must able to assess who is guilt agent.

Thus we are arrive solution for this overlap minimization. When the distributor

allocates data along with fake object, the fake objects are should be in the manner

that for each and every agent and for each and every tuple the fake object is

unique.

FINDING PROBABILITY

When the distributor finds the leaked data in unauthorized places like websites or

laptop, then the distributor must be able to find out who is the guilt agent. For this,

he can able to find the probability of the number of allocated data for a particular

agent by number of found records in unauthorized place.

From this it is clearly concluded that the agent who is having more probability

must be the guilt agent. Fake objects are used here to confirm the guilt agent more

clearly.

SYSTEM ARCHITECTURE

Compares with his DB to find guilt agent

Explicit or sample request to distributor

DB

Distributor sends sensitive data to

agents

Data Leak out

Leaked file

Distributor finds his data leak in

unauthorized place

Distributor

Agent 1 Agent 2 Agent 3 Agent 3 Agent 5

Unauthorized person

In runtime distributor creates fake object and

allocates to agents

Data Flow Diagram:

LEVEL 1: DATA TRANSFER PROCESS

Distributor

Agents

Gives request

Unauthorized person

Data Transfe

r

Agent id

Fake object

Only guilt agent

data

thirdparty

Agent id, fake object, no. of file transfer

Data Transfe

r

Agent id, fake object

agent

Add Fake object agent

Agent

LEVEL 2: GUILT MODEL ANALYSIS

Distributor

Find probability

Find guilt agent

View released data

Agent id, fake object

agent

Probability between guilt agents

UML DIAGRAM

USE CASE DIAGRAM:

Add fake objects

Explicit or sample request

Distribute data to agents

Find probability

Find guilt agent

DistributorAgent

Data transfer to unauthorized

Unauthorized person

SEQUENCE DIAGRAM:

Login Distribute Data to agents

View Distributed Data

Find Guilt Agents

Probability Distribution of Data

Login as DistributorStore data into database

View from database for data leakageFind probability of data transfer to agents

COLLABORATION DIAGRAM:

LoginDistribute Data to

agents

View Distributed Data

Find Guilt Agents Probability Distribution of

Data

1: Login as Distributor 2: Store data into database

3: View from database for data leakage

4: Find probability of data transfer to agents

ACTIVITY DIAGRAM:

LOGIN

DISTRIBUTE DATA TO AGENTS

VIEW DATA DISTRIBUTED TO AGENTS

FIND GUILT AGENTS

FIND PROBABILITY OF DATA LEAKAGE

ARCHITECTURE DIAGRAM:

4. Major software functions

Number Description Functions

1 Windows XP/7 with MS-

office

Maintaining compatibility with versions of the

software used.

2 MS-Access Maintained the record of the logs of users,

intruder and signature.

3 Netbeans 1.6.0 Coding done and user interface is created.

4 JDK1.6.0 Created Java Development and runtime

environment.

4 RISK MANAGEMENT

This section discusses project risks and the approach to managing them.

4. 3.1 Project Risks

RMMM plan tackles risk through Risk Assessment and Risk Control. Risk

Assessment involves Risk Identification, Risk Analysis and Risk Prioritization. While

Risk Control involves Risk Management Planning, Risk Resolution and Risk Monitoring.

Project name: Intrusion Detection with Network Deception using Honeypots.

Purpose:

The RMMM plan outlines the risk management strategy adopted. We adopt a

proactive approach to tackle risks and thus reduce the performance schedule and cost

overruns, which we may incur due to occurrence of unexpected problems.

This “Risk Mitigation Monitoring and Management Plan” identifies the risks

associated with our project “Intrusion Detection with Network Deception using

Honeypots”. In addition to project risk and technical risks, business risks are also

identified, analyzed and documented. This document outlines the strategy that we have

adopted to avoid these risks. A contingency plan is also prepared for each risk, in case it

becomes a reality. Only those risks have been treated whose probability and impact are

relatively high i.e above a referent level.

4.3.2 Risk Table

Impact levels: The risks are categorized on the basis of their probability of

occurrence and the impact that they would have, if they do occur. Their impact is rated as

follows:

Catastrophic 1

Critical 2

Marginal 3

Negligible 4

Sr No Risk Category Probability Impact

1 Increase of work load Personal 20% 3

2 Inexperience in Project

software environment

Technical 25% 3

3 Overly optimistic

schedules

Project 20% 3

4 Lack of sufficient

research

Technical 50% 3

5 Modules require more

testing and further

implementation work

Project 50% 2

6 Inconsistency in Input Project 30% 3

Table: 4.3

4.4 PROJECT TIMELINE

AUGUST SEPTEMBER

W 1 W 2 W 3 W 4 W 1 W 2 W 3 W 4

1 Requirement

gathering

a information from

internet

b information from

book

c information from

stakeholders

2 Requirement analysis

a Analysis of

information related

to Mobile agent

b analysis of

information related

to RMI

c analysis of

information related

to wireless

environment

d analysis of

NETBEANS IDE 6.9

OCTOBER NOVEMBER

W 1 W 2 W 3 W 4 W 1 W 2 W 3 W 4

3 Problem definition

a Meet Internal guide

b Identify project

constraint

c Establish project

statement

4 Feasibility

a Economic Feasibility

b Technical Feasibility

c Behavioral Feasibility

W 1 W 2 W 3 W 4 W 1 W 2 W 3 W 4

5 Planning

a Scheduling of task

b Task division and

Time Allocation

c Effort Allocation

d Resource Allocation

6 Designing

a RMI Logic are

studied

b Designing of GUI

c Designing of

Database

FEBRUARY MARCH

W 1 W 2 W 3 W 4 W 1 W 2 W 3 W 4

7 Coding

a Coding of GUI

b Coding of RMI

c Coding of RMI

REGISTRY

d Designing of

Database

APRIL MAY

W 1 W 2 W 3 W 4 W 1 W 2 W 3 W 4

8 implementation

Details

a Linking GUI and

Project Files

9 Testing

a Unit Testing

b Integration Testing

c System Testing

10 Evaluation

a Project Evaluation

b Documentation

Review and

Recommendation

Table: 4.4

Testing

6.1 INTRODUCTION

Testing is the process of executing a program with the intent of finding errors. Testing is a

process used to help identify the correctness, completeness and quality of developed computer

software. Testing helps is verifying and Validating if the Software is working as it is intended to

be working.

What is software testing?

Exercising (analyzing) a system or component with defined inputs capturing monitored outputs

comparing outputs with specified or intended requirements To maximize the number of errors

found by a finite no of test cases. Testing is successful if you can prove that the product does

what it should not do and does not do what it should do.

Test cases are devised with this purpose in mind. A test case is set of data that the system will

process as an input. However the data are created with intent of determining whether the system

will process them correctly without any errors to produce the required output.

6.2 INTEGRATION TESTING

When the individual components are working correctly and meeting the specified objectives,

they are combined into a working system. This integration is planned and co-coordinated so that

when a failure occurs, there is some idea of what caused it.

In addition, the order in which components are tested, affects the choice of test cases and tools.

This test strategy explains why and how the components are combined to test the working

system. It affects not only the integration timing and coding order, but also the cost and

thoroughness of the testing.

6.2.1 BOTTOM-UP INTEGRATION

One popular approach for merging components to the larger system is bottom-up testing. When

this method is used, each component at the lowest level of the system hierarchy is tested

individually. Then, the next components to be tested are those that call the previously tested

ones. This approach is followed repeatedly until all components are included in the testing.

Bottom-up method is useful when many of the low-level components are general-purpose utility

routines that are invoked often by others, when the design is object-oriented or when the system

is integrated using a large number of stand-alone reused components.

6.2.2 TOP-DOWN INTEGRATION

Many developers prefer to use a top-down approach, which in many ways is the reverse of

bottom-up. The top level, usually one controlling component, is tested by itself. Then, all

components called by the tested components are combined and tested as a larger unit. This

approach is reapplied until all components are incorporated.

6.3 BLACK BOX TESTING

Black Box Testing involves testing without knowledge of the internal workings of the item being

tested. For example, when black box testing is applied to software engineering, the tester would

only know the "legal" inputs and what the expected outputs should be, but not how the program

actually arrives at those outputs. It is because of this that black box testing can be considered

testing with respect to the specifications, no other knowledge about the program is necessary.

For this reason, the tester and the programmer can be independent of one another, avoiding

programmer bias toward his own work. For this testing, test groups are often used. Also, due to

the nature of black box testing, the test planning can begin as soon as the specifications are

written. The opposite of this would be glass box testing where test data are derived from direct

examination of the code to be tested. For glass box testing, the test cases cannot be determined

until the code has actually been written. Both of these testing techniques have advantages and

disadvantages, but when combined, they help to ensure thorough testing of the product.

6.3.1 TESTING STRATEGIES AND TECHNIQUES

The Black box testing should make use of randomly generated inputs (only a test range

should be specified by the tester), to eliminate any guess work by the tester.

The data outside of the specified input range should be tested to check the robustness

of the program.

The boundary cases should be tested (top and bottom of specified range) to make sure

that the highest and lowest allowable inputs produce proper output.

The number zero should be tested when numerical data is given as input.

Stress testing (try to overload the program with inputs to see where it reaches its

maximum capacity) should be performed, especially with real time systems.

Crash testing should be performed to identify the scenarios that would make the system

down.

Test monitoring tools should be used whenever possible to track which tests have already

been performed and the outputs of these tests is used to avoid repetition and also to aid in

the software maintenance.

Other functional testing techniques include transaction testing, syntax testing, domain

testing, logic testing, and state testing.

Finite state machine models can be used as a guide to design the functional tests.

6.3.2 ADVANTAGES OF BLACK BOX TESTING

It is more effective on larger units of code than that of glass box testing.

The tester needs no knowledge of implementation, including specific programming

languages.

The tester and the programmer are independent of one another.

6.3.3 DISADVANTAGES OF BLACK BOX TESTING

Only a small number of possible inputs can actually be tested.

The test cases are hard to design without clear and concise specifications.

6.4 WHITE BOX TESTING

White box testing uses an internal perspective of the system to design test cases based on internal

structure. It is also known as glass box, structural, clear box and opens box testing. It requires

programming skills to identify all paths of the software. The tester chooses test case inputs to

exercise all paths and to determine the appropriate outputs. In electrical hardware, testing every

node in a circuit may be probed and measured. E.g. in-circuit testing (ICT).

Since the tests are based on the actual implementation, when the implementation changes the

tests also change probably. For instance, ICT needs update if the component value changes, and

needs modified/new fixture if the circuit changes. This adds financial resistance to the change

process, thus buggy products may stay buggy. Automated Optical Inspection (AOI) offers

similar component level correctness checking without the cost of ICT fixtures. However changes

still require test updates.

While white box testing is applicable at the unit, integration and system levels of the software

testing process, it is typically applied to the unit. So when it normally tests paths within a unit, it

can also test paths between units during integration, and between subsystems during a system

level test. Though this method of test design can uncover an overwhelming number of test cases,

it might not detect unimplemented parts of the specification or missing requirements. But it is

sure that all the paths through the test objects are executed.

6.4.1 WHITE BOX TESTING STRATEGY

White box testing strategy deals with the internal logic and structure of the code. The tests that

are written based on the white box testing strategy incorporates coverage of the code written,

branches, paths, statements and internal logic of the code etc.In order to implement white box

testing, the tester has to deal with the code and hence he should possess knowledge of coding and

logic i.e. internal working of the code. White box testing also needs the tester to look into the

code and find out which unit/ statement/ chunk of the code is malfunctioning.

6.4.2 ADVANTAGES OF WHITE BOX TESTING

As the knowledge of internal coding structure is prerequisite, it becomes very easy to find

out which type of input/data can help in testing the application effectively.

It helps in optimizing the code.

It helps in removing the extra lines of code, which can bring in hidden defects.

Introspection: Introspection, or the ability to look inside the application, means that

testers can identify objects programmatically. This is helpful when the GUI is changing

frequently or the GUI is yet unknown as it allows testing to proceed. It also can, in some

situations, decrease the fragility of test scripts provided the name of an object does not

change.

6.4.3 DISADVANTAGES OF WHITE BOX TESTING

As the knowledge of code and internal structure is a prerequisite, a skilled tester is

needed to carry out this type of testing which increases the cost.

It is nearly impossible to look into every bit of code to find out hidden errors, which may

create problems resulting in failure of the application.

Time is the sole biggest disadvantage of white box testing. Testing time can be

extremely expensive.

SCREEN SHOTS

REFERENCES

[1] R. Agrawal and J. Kiernan. Watermarking relational databases. In VLDB ’02:

Proceedings of the 28th international conference on Very Large Data Bases, pages

155–166. VLDB Endowment, 2002.

[2] P. Bonatti, S. D. C. di Vimercati, and P. Samarati. An algebra for composing

access control policies. ACM Trans. Inf. Syst. Secur., 5(1):1–35, 2002.

[3] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of

data provenance. In J. V. den Bussche and V. Vianu, editors, Database Theory -

ICDT 2001, 8th International Conference, London, UK, January 4-6, 2001,

Proceedings, volume 1973

of Lecture Notes in Computer Science, pages 316–330. Springer, 2001.

[4] P. Buneman and W.-C. Tan. Provenance in databases. In SIGMOD ’07:

Proceedings of the 2007 ACM SIGMOD international conference on Management

of data, pages 1171–1173, New York, NY, USA, 2007. ACM.

[5] Y. Cui and J. Widom. Lineage tracing for general data warehouse

transformations. In The VLDB Journal, pages 471–480, 2001.

data leakage detection--final 26 april

Documents