chapter ii literature survey -...

7

Chapter II

LITERATURE SURVEY

This chapter provides a literature review on Artificial Intelligence systems, computer

aided medical diagnosis, image denoising and feature extraction. In AI systems

section, literature reviews on ANNs, fuzzy systems and GAs are provided. In the

image denoising section, its evolution and classification are dealt with.

2.1 ARTIFICIAL INTELLIGENCE SYSTEMS

AI is the intelligence of machines and the branch of computer science which aims to

create it. Computational intelligence (CI) was seen as a comprehensive framework to

design and analyze intelligent systems with a focus on all fundamentals of autonomy,

learning, and reasoning (Duch 2007). The idea is to consider computing systems that

are able to learn and deal with new situations using reasoning, generalization,

association, abstraction, and discovery capabilities (Eberhart et al 1996). The

paradigm of CI is shown in Figure 2.1.

Figure 2.1 Paradigm of Computational Intelligence systems

Swarm Intelligence

Evolving Systems

Genetic Fuzzy

Systems

Neural Evolutionary

Systems

Immune Systems

Neural Fuzzy

Systems

Neural Networks

Fuzzy Set Theory

Evolutionary

Systems

Computational Intelligence

8

Growing as a stand-alone field in itself, CI nowadays contains evolving systems

(Angelov 2002) and swarm intelligence (Kennedy and Eberhart 2001; Dorigo and

Stutzle 2004), immune systems (Castro and Timmis 2002), and other forms of natural

(viz., biologically inclined) computation. A key issue in CI is adaptation of behavior

as a strategy to handle changing environments and deal with unforeseen situations. CI

exhibits interesting links with machine intelligence (Mitchell et al 1997), statistical

learning (Tibshirani et al 2001) and intelligent data analysis and data mining

(Berthold and Hand 2006), pattern recognition and classification (Duda et al 2001),

control systems (Dorf and Bishop 2004), team learning in robotic soccer (Geetha

Ramani 2009) and operations research (Hillier and Lieberman 2005). This research

focuses on CI using ANNs, FL and GA hybrid for efficient medical diagnosis and a

detailed literature review on each is provided.

2.1.1 Artificial Neural Networks

Much of the research in ANN is related to the understanding of nonlinear dynamics of

specific architectures and searching for fast, efficient, convergent, stable and robust

ways for training and adoption from noisy sampled data, and it is this that is

dominating developments in the field rather than developments in fundamental

theory. This is unfortunate as there is no rigorous mathematical foundation for the

determination of the characteristics of a sampled data set and specific network ability

to generalize from that set (Kung 1993).

There is no general design theory to determine the allocation of neurons to data, what

weights to alter and in what ways to make an accurate record of the data. The distant

goal of ‘neural networkers’ is to understand how to store, retrieve, and process data in

neural networks (Judd 1990). In 1988, Hecht-Nielsen listed some of the best known

ANN in chronological order 1-13. In 1988, Specht announced the PNN which was

placed as 14th. The 15th (Albus 1975; Miller et al 1990) Cerebellar Model Arithmetic

Computer (CMAC) is based on the cerebellation (4th in Hecht-Nielson's list) and has

found applications in robotics and other nonlinear industrial control systems. The

names of some ANN developers, features, advantages, disadvantages and some

possible areas of application (Hecht-Nielson 1988) are summarized in Table 2.1.

9

Table 2.1 Milestones of ANN Research

Network Features Applications Advantages Disadvantages

PERCEPTRON, 1958 (Rosenblatt). Oldest artificial neural network

Built in hardware Rarely used today Cannot recognize complex

characters, eg. Chinese. Sensitive to differences in

scale Translation, distortion.

MADALINE, 1960-62 (Widrow).

Multiple adaptive linear elements. Have been in commercial use for over 20 years

Control systems Image processing Antenna systems Pattern recognition Noise cancellation Adaptive nulling of radar

jammers Adaptive modems Adaptive equalizers (echo

cancellers) in telephone lines

AVALANCHE, 1967 (Grossberg).

Is a class of networks and no single network can do all the tasks

Continuous speech recognition Teaching motor commands to

robotics arms

Requires a literal playback of motor sequences.

No simple way to alter speed or interpolate movements

10

Table 2.1 Milestones of ANN Research (Contd.)


CEREBELLATION, 1969 (Marr, Albus and Pellionez).

Similar to avalanche network. Can blend several command sequences with different weights to interpolate motions smoothly.

Controlling motor action of robotics arms Requires complicated control

input

MULTI-LAYER PERCEPTRON (BACKPROPAGATION-OF-ERROR), 1974-86 (Werbos, Parker and Rumelhart).

Many classification applications Speech synthesis from text Adaptive control of robotics

arms Scoring bank loan applications Signal processing Control systems

Most popular network

Works well generally

Simple to learn

Supervised training only Abundant correct input/output

examples needed Slow to train May converge to inferior

solution or not at all

BRAIN STATE IN A BOX, 1977 (Anderson).

Similar to bi-directional associative memory in completing fragmented inputs.

Extraction of knowledge from data bases.

Psychological experimentation

One-shot decision making No iterative reasoning

NEOCOGNITRON, 1978-84 (Fukushima).

Most complicated network ever developed. Insensitive to differences in scale, rotation, translation. Able to identify complex characters, eg. Chinese.

Hand printed-character recognition

Requires usually a large number of processing elements and connections

ADAPTIVE RESONANCE THEORY (ART), 1978-86 (Carpenter and Grossberg).

Very sophisticated.

Pattern recognition, especially complicated or unfamiliar to humans, eg. radar, sonar and voiceprints

Decision making under risk Neurobiological connections and

classical conditioning

Sensitive to translation Distortion and changes in

scale.

11



SELF-ORGANISING MAP (SOM), 1972-82 (Kohonen).

Maps one geometric region onto another, eg. rectangle to aircraft

More effective than many algorithmic techniques for numerical aerodynamic flow calculations

Requires extensive training

HOPFIELD, 1982 (Hopfield).

Can be implemented on a large scale. Normally used with binary inputs.

Retrieval of complete data or images from fragments

Olfactory processing Signal processing

The weights must be set in advance

Number of patterns that can be stored and accurately recalled is severely limited

An exemplar pattern will be unstable if it shares many bits in common with another exemplar

BI-DIRECTIONAL ASSOCIATIVE MEMORY, 1985 -88 (Kosko).

Associates fragmented pairs of objects with completed pairs.

Content-addressable associative memory

Resource allocation

Low storage density Data must be properly coded

BOLTZMANN / CAUCHY MACHINE, 1985-86 (Hinton and Sejnowsky).

Pattern recognition for images,

radar and sonar Graph search and optimization

Simple network in which noise functions find global minimum

Boltzmann - long training time

Cauchy - generating noise in proper statistical distribution

12



COUNTER PROPAGATION, 1986 (Hecht-Nielsen).

Functions as a self-programming look-up table. Similar to backpropagation but less powerful.

Image compression Statistical analysis Scoring of bank loan

applications

Large number of processing elements and connections are required for high accuracy for any size of problem

PROBABILISTIC NEURAL NETWORK (PNN), 1990 (Specht).

Training is much faster than MLP and easy in one-pass. Decision surfaces are guaranteed to approach the Bayes'-optimal boundaries as the size of the training sample grows. Sparse samples can be adequate for good performance.

Pattern recognition and classification

Mapping Direct estimation of posteriori

probability density functions

All training sample points must be stored and used to classify new patterns so that a large memory is required

Classification time can be slower than MLP for software realization

CEREBELLAR MODEL ARITHMETIC COMPUTER (CMAC), 1971 (Albus).

Training is much faster than MLP. Large networks can be used and trained in practical time. Practical hardware realization using logic cell arrays.

Real-time robotics Pattern recognition Signal processing Speech processing

Generalization is not global, only local

Design care is necessary to assure a low error solution

SUPPORT VECTOR MACHINE (SVM), 1995 (Cortes and Vapnik)

Utilizes competitive learning. Suitable for numerical data.

Classification of web pages Image recognition Shape description

For classic categorization problem boundary is limited

NEURO-DYNAMIC PROGRAMMING 1996-06 (Bertsekas and Tsitsiklis)

Can deal explicitly with state and control constraints. Implemented using standard deterministic optimal control methodology.

Applies to both deterministic and stochastic problems

Connection with infinite-time reachability

13

Amari (1998) indicated that most learning rules of ANNs are formulated as follows:

xrηΔw i (2.1)

where ‘r’ is a learning signal function, x is the input vector, w is the weight vector

and η is the learning signal vector. The function ‘r’ depends on whether the teacher

(or target) signals‘t’ is available or not:

r = r ( x , w, y) for supervised learning or

r = r (x, w) for unsupervised learning.

Equation 2.1 corresponds to the Widrow-Hoff rule when r = y – wT x.

For convenience, the learning signal vector (Amari 1998) is defined as

)(ηΔw i vectorsignallearning (2.2)

Table 2.2 summarizes typical neural learning rules, with t the target output, y the

neural network’s output, x the input vector, w the weight vector , J the Jacobian

matrix, e the error and α as the learning rate (Jang et al 1997). Unsupervised learning

is useful for analyzing data without desired outputs; the networks evolve to capture

density characteristics of a data set.

Table 2.2 Typical Neural Network Learning Formulas

Learning algorithm Learning signal vector Learning mode

Hebbian (Hebb 1949) y x Unsupervised

Perceptron (Rosenblatt 1962) { t – sgn (wT x)} x Supervised

Outstar (Grossberg 1973) t - wi Supervised

Oja’s (reversed least mean square) (Oja 1982) (x – y w) y Unsupervised

Winner-take-all (competitive) ( Lippman 1987) x – w Unsupervised

Correlation (Grossberg 1988) t x Supervised

Least mean square (Widrow 1990) (t – wT x) x Supervised

Delta (Widrow 1990) {(t – y) α} x Supervised

Recursive Levenberg-Marquardt (Ngia and sjoberg 2000)

wi+1= wi (JiTJi + αI) Ji

T e Supervised

14

In addition to the above standard learning methods in table 2.2, there are lot of other

learning algorithms available in literature like natural-descent method (Amari 1998),

stochastic learning algorithm (Sheta and De Jong 2001), terminal attractor-based BP

algorithm (Jiang and Yu 2001) etc. Levenberg-Marquardt learning algorithm is

widely used with slight modifications.

Table 2.3 shows the list of classifiers used in different studies to perform the task of

classification. Even though many classifiers are available in literature, study has been

restricted to most commonly used standard classifiers.

Table 2.3 List of different Classifiers in Literatures

Sl. No Classifier References

1 Neural networks

Choi et al (1997) Einstein et al (1998) Furundzic et al (1998) Spyridonos et al (2002) Papadopoulos et al (2002) Demir et al (2004) Gunduz et al (2004) Cho and Won (2006) Rodríguez et al (2010)

2 K-nearest neighborhood

Schnorrenberg et al (1996) Weyn et al (1999) Ginneken and Mendrik (2006) Huang et al (2009)

3 Logistic regression Wolberg et al (1995) Einstein et al (1998) Hong and Mitchell(2007)

4 Fuzzy systems

Blekas et al (1998) Nauck et al (1999) Jin (2000) Monzon and Pisarello (2005) Abu-Amara and Abdel-Qader (2009)

5 Linear discriminant analysis

Hamilton et al (1997) Esgiar et al (1998) Smolle (2000) Li and Yuan(2005)

6 Decision trees Wiltgen et al (2003) Silahtaroğlu (2009)

15

A classification system requires data for training and testing process separately to

evaluate its success. Since there is a limited available data in training, it is important

to test the system with extra data. How to use this limited amount of data in both

training and testing is an issue. More data used in training lead to better system

designs, whereas more data used in testing lead to more reliable evaluation of the

system (Kaski 1997). Evaluating the system according to the success obtained on the

training set brings in the risk of memorization of data and obtaining over-optimistic

error rates.

To circumvent the memorization problem, the system should be evaluated on a

separate data set that is not used in training. For that, one approach is to split the data

Table 2.4 List of Classifier Evaluation techniques used in different Studies

Sl. No List of Techniques References

1 No separate evaluation set:

Thiran and Macq (1996) Anderson et al (1997) Smolle (2000) Cho and Won (2006)

2 Separate training and test sets:

Choi et al (1997) Blekas et al (1998) Esgiar et al (1998) Pena-Reyes and Sipper (1999) Wiltgen et al (2003) Gunduz et al (2004) Demir et al (2005) Resul Das et al (2009) Posawang et al (2009) Muthu Rama Krishnan et al(2010)

3 K-fold cross-validation:

Wolberg et al (1995) Zhou et al (2002) Wiens et al (2008) Rodríguez (2010)

4 Leave-one-out:

Schnorrenberg et al (1996) Einstein et al (1998) Weyn et al (1999) Albregtsen et al (2000) Spyridonos et al (2001) Hong and Mitchell (2007) Kemal Polat and Salih Gunes (2009)

16

into two disjoint sets and use these sets to train and test the system. In case it is not

feasible to use a significant portion of the data as the testing set, k-fold cross-

validation can be used. This approach randomly partitions the data set into k groups.

Then, it uses k-1 groups to train the system and uses the remaining group to estimate

an error rate. This procedure is repeated k times such that each group is used for

testing the system. Leave-one-out is a special case of the k-fold cross validation where

k is selected to be the size of the data; therefore only a single sample is used to

estimate the error rate in each step.

Since the testing stage should measure how well the system will work on unknown

samples in the future, the test set should also consist of the samples that are

independent from those used in the training. However, in the case of k-fold

cross-validation, random partitioning may result in using the test sets that do not

include such independent samples. Therefore, over-optimistic results may be

obtained. An illustrative example that shows the effects of each approach on the

system success can be found in Schulerud et al (1998). In this example, by using the

same data, 95% accuracy is achieved when the entire data is used in both training and

testing; 87% testing accuracy is obtained in the case of k-fold cross-validation; but,

only 60% testing accuracy is obtained when separate training and test sets are used.

Table 2.4 shows the list of techniques for classifier evaluation that are used by

different studies and the approach of using separate training and test datasets, which is

becoming a standard method for classifier evaluation.

2.1.2 Fuzzy Systems

To devise a concise theory of logic, and later mathematics, Aristotle postulated the

so-called ‘Laws of Thought’. One of these, the ‘Law of the Excluded Middle’, states

that every proposition must either be True (T) or False (F). Even when Parminedes

proposed the first version of this law (around 400 B.C) there were strong and

immediate objections: for example, Heraclitus proposed that things could be

simultaneously True and not True. It was Plato who laid the foundation for what

would become Fuzzy Logic (FL), indicating that there was a third region (beyond T

and F) where these opposites ‘tumbled about’. The notion of an infinite-valued logic

17

was introduced in Zadeh’s (1965) seminal work ‘Fuzzy Sets’ where he described the

mathematics of fuzzy set theory, and by extension FL. This theory proposed making

the membership function (or the values F and T) operate over the range of real

numbers [0, 1]. The door to the development of fuzzy computers was opened in 1985

by the design of the first logic chip by Masaki Togai and Hiroyuki Watanabe (1986)

at Bell Telephone laboratories. In the years to come fuzzy computers will employ

both fuzzy hardware and extended fuzzy software, and they will be much closer in

structure to the human brain than the present-day computers (Zadeh 2009). The

principle of incompatibility lucidly formulated by Zadeh (1965) states that:

‘As the complexity of a system increases, our ability to make precise and yet

significant statements about its behavior diminishes until a threshold is reached

beyond which precision and significance (or relevance) become almost mutually

exclusive characteristics’

In a fuzzy inference system, the knowledge base is comprised of fuzzy rule base and a

database. FIS are universal approximators capable of performing nonlinear mappings

between the inputs and outputs. The Mamdani (Mamdani 1974) and the TSK (Takagi

and Sugeno 1985) models are two popular FISs. The Mamdani model is a nonadditive

fuzzy model that aggregates the output of fuzzy rules using the maximum operator,

while the TSK model is an additive fuzzy model that aggregates the output of rules

using addition operator. Koskos standard additive model (Kosko 1997) is another

additive fuzzy model. All the models can be derived from fuzzy graph (Yen 1999)

and are universal approximators (Kosko 1997). There are two types of fuzzy rules

namely fuzzy mapping rules and fuzzy implication rules (Yen 1999).

Complex fuzzy sets and logic are mathematical extensions of fuzzy sets and logic

from the real domain to the complex domain (Ramot et al 2003).With the increase of

complexity, ability to make precise significant statements about its behavior

diminishes. The Japanese, in the application of the fuzzy technique has acquired more

than 2000 patents and the area spans a wide spectrum, from consumer products and

electronic instruments to automobile and traffic monitoring systems. In Wang and Lu

(2003), fuzzy system with nth order B-spline MFs and CMAC network (Albus 1971)

18

with nth order B-spline basis function are proved to be universal approximators for a

smooth function and its derivatives up to the (n-2)th order. Fuzzy systems are widely

used in medicine as expert systems for providing disease diagnosis (Friedrich

Steimann 2001; Abu-Amara and Abdel-Qader 2009).

2.1.3 Neuro-Fuzzy System (NFS)

A NFS (Jang 1993) is based on a fuzzy system which is trained by a learning

algorithm derived from neural network theory. The (heuristical) learning procedure

operates on local information, and causes only local modifications in the underlying

fuzzy system. ANN learning provides a good way to adjust the expert’s knowledge

and automatically generate additional fuzzy rules and Membership Function (MFs), to

meet certain specifications and reduce design time and costs.The strength of NFS

involves two contradictory requirements in fuzzy modeling: interpretability versus

accuracy. In practice, one of the two properties prevails. Two universal approximation

theorems (A.1 and A.2), one based on linear basis function and the other based on

radial basis function are given in Appendix A and its proof can be referred from Kung

(1993). In ANN context, the theorems suggest that a feed forward network with a

single hidden layer with nonlinear units can approximate any arbitrary function but do

not suggest any method of determining the parameters, such as number of hidden

units and weights to achieve the given accuracy.

ANFIS is a well known neuro-fuzzy model (Jang et al 1997) and is a graphical

representation of TSK model. The sigmoid-ANFIS (Zhang et al 2004) is a special

form of ANFIS, where only sigmoidal MFs are employed. The ANFIS unfolded-in-

time (Sisman-Yilmaz et al 2004) is a method that duplicates the ANFIS T times to

integrate temporal information, where T is the number of time intervals needed in the

specific problem. Neuro-fuzzy systems are usually trained by gradient-descent

method (Jang et al 1997). Similar to ANFIS architecture, the self-organizing fuzzy

neural network (Leng et al 2004) has five layered fuzzy neural network architecture

and is an on-line implementation of TSK (Takagi and Sugeno 1985). The merits of

both neural and fuzzy systems can be integrated in a neuro–fuzzy approach

(Pal et al 2000).

19

Dynamic Evolving Neuro-Fuzzy Inference System (Kasabov and Song 2002) is used

to perform the prediction where new fuzzy rules are created and updated during the

operation of the system. An online sequential fuzzy extreme learning machine (OS-

Fuzzy-ELM) has been developed by Rong et al 2000, for function approximation and

classification problems. The optimized Takagi-Sugeno-type (TS) neuro-fuzzy model

proposed in Siekmann et al (1997) for stock exchange consisted of 31 rules and 22

membership functions.

Fuzzy sets are considered to be advantageous in the logical field, and in handling

higher order processing easily. The higher flexibility is a characteristic feature of

neural nets produced by learning and, hence, this suits data-driven processing better

(Kosko 1997). Equivalence between fuzzy rule-based systems and neural networks is

studied in Zhang et al (2004). Jang et al (1997) have shown that fuzzy systems are

functionally equivalent to a class of Radial Basis Function (RBF) networks, based on

the similarity between the local receptive fields of the network and the membership

functions of the fuzzy system. CANFIS is an extension to ANFIS by using

nonlinearity in the TSK rules of ANFIS and can handle multiple inputs (Jang et al

1997).

2.1.4 GA for Optimization

Genetic algorithms are a class of general-purpose stochastic optimization algorithms

under the universally accepted neo-Darwinian paradigm which is a combination of

classical Darwinian evolutionary theory, the selection of Weismann and the genetics

of Mendel. Generally, the performance of GA is measured by the speed of the search

on one hand and the reliability of the search on the other. Reliability denotes the

chance of getting good results even if the problem is very complex (Back 1995).

There is always a tradeoff between the two factors and the success of GA for a

particular problem optimization depends on the choice of the right set of GA

parameters. GAs have been theoretically and empirically proven to provide robust

search in complex space and have found wide applicability in scientific and

engineering areas including function optimization, machine learning, scheduling, and

others (Buckles and Petry 1992).

20

Goldberg and Grefenstette (2005) had choosen individuals for birth according to their

objective function values. Variants of unbiased tournament selection were analyzed

by Sokolov and Whitley (2005). Copying the corresponding gene from one or other

parent creates each gene in offspring according to a randomly generated crossover

mask (Syswerda 1989). There are also crossovers like cycle crossover, partially

mapped crossover, segmented crossover, shuffle crossover, etc as mentioned in

Potts et al (1994). For two-dimensional applications like image processing

conventional mutation and reproduction operators can be applied in a normal way, but

unbiased crossover like uniform block crossover has to be used. The uniform block

crossover is a two-dimensional wraparound crossover and can sample all the matrix

positions equally (Cartwright and Harris 1993).The convergence rates of two-

dimensional GAs are higher than that of simple GA for bitmaps (Cartwright and

Harris 1993).

The parameters of GA that play a vital role in determining the exploitation and

exploration characteristics of the genetic algorithm are the population size, number of

generations, termination condition, elitism strategy, reproduction, crossover and

mutation percentages (DeJong and Spears 1990). The convergence analysis of a

simple GA is based on the concept of schema (Holland 1973).

In the practice of designing efficient GA’s, there has been strong empirical evidence

showing that population size is one of the most important and also a critical parameter

that plays a significant role in the performance of the genetic algorithms (Lobo and

Lima 2005). This parameter is hard to estimate. If it is too small, then GA converges

to poor solutions, else if it is too large, GA spends unnecessary computational

resources. Determining an appropriate population size is an important task in the

design of genetic algorithms and is closely related to the principle of implicit

parallelism. The different methods of handling the population size parameter in

various GA can be classified as static, when the size of the population remains

unchanged throughout the GA run and dynamic, when the population size is adjusted

21

on the fly during the GA execution (Arabas et al 1994; Hinterding et al 1996;

Back et al 1995).

De Jong and Spears (1990) suggests the following parameters for GA:

Population size = 50, Crossover rate = 0.6, Mutation rate = 0.001,

Crossover type = typically two point, Mutation types = bit flip and Number of

generations 1000. Schaffer et al 1989, has suggested the following parameter setting

after extensive research on these parameters: Population size = 20 - 30,

Crossover rate = 0.75 – 0.95, Mutation rate = 0.005 - 0.01. The genetic diversity of

the population can be improved so as to prevent premature convergence by adapting

the size of population (Goldberg et al 2005; Michalewicz 1996).

Initialization 1. Generate initial population randomly

Fitness Evaluation 2. Evaluate the fitness of each individual

Group and Breeding 4. Sort the individuals in accordance to their fitness values

5. Arrange the population into groups based on their fitness 6. For each group

a. Select Individuals from each group b. Apply Crossover/Mutation operators

c. Evaluate the fitness of offspring d. Add offspring to the same group

Migration 7. Combine all the groups into a single population

8. Sort the population based on their fitness values and trims it to the size of groups

Iteration 9. Repeat the process from Step-5 to the required number of

generations 10. Select the best (high fit) individual

Figure 2.2 Steps of GA (Melanie 1998).

22

To evolve optimal solution, the steps to be followed are given in algorithm of

Figure 2.2 adapted from Melanie (1998). The genetic programming (Koza 1992) is a

variant of GA for symbolic regression and can be used for dicovery of empirical laws.

Summarizing, evolutionary algorithms (Goldberg et al 2005; Koza1992) are

Easy, modular and supports multi objective optimization

Inherently parallel and easily distributed

Easy to exploit for previous or alternate solutions

Flexible in forming building blocks for hybrid applications

Good for noisy environments and gets a solution which gets better with time

Table 2.5 Applications of GA (Goldberg et al 2005)

Domains Application Types

Control

Gas pipe line Pole balancing Missile evasion Pursuit

Design

Semi Conductor layout Aircraft design Keyboard configuration Communication networks

Scheduling Manufacturing Facility scheduling Resource allocation

Robotics Trajectory planning

Machine learning Designing neural networks Classification algorithms

Signal Processing Filter design

Game playing Poker Checker Prisoner’s dilemma

Combinatorial optimization

Set covering Traveling salesman Routing Bin packing Graph colouring Partitioning

23

Table 2.5 gives the fields where GA has been successfully applied and utilized to

achieve multiobjective, multimodal and constraint-satisfaction optimizations (Du and

Swamy 2008).

2.2 COMPUTER AIDED MEDICAL DIAGNOSIS

Today, cancer constitutes a major health problem. Approximately one out of every

two men and one out of every three women get cancer at some point during their

lifetime (Cigdem Demir and Bulent Yener 2009). Furthermore, the risk of getting

cancer has been further increasing due to the change in the habits of people in this

century such as the increase in tobacco use, deterioration of dietary habits, and lack of

activity. Fortunately, the recent advances in medicine have significantly increased the

possibility of curing cancer. However, the chance of curing cancer primarily relies on

its early diagnosis and the selection of its treatment depends on its malignancy level

(Abbass 2002). Therefore, it is critical for us to detect cancer, distinguish cancerous

structures from the benign and healthy ones and identify its malignancy level.

Breast cancer is considered to be one of the most common and fatal cancers among

women in the USA (http://www.cancer.gov/cancertopics/types/breast,2008).

According to National Cancer Institute, 40 480 women died due to this disease and on

average every three minutes one woman is diagnosed with this cancer. Pisani et al

(1993) estimated worldwide mortality from eighteen major cancers including breast

cancer. Li et al (1995) developed a method for detecting tumors using a segmentation

process, adaptive thresholding, and modified Markov random fields, followed by a

classification step based on a fuzzy binary decision tree. Li (1995) used Markov

random field for tumor detection in digital mammography. Smart et al (1995)

analyzed the benefits of mammographic screening and showed that it has an overall

accuracy rate of 90%. Tsujii et al (1999) proposed classification of micro

calcifications in mammograms using RBF. Peña-Reyes and Sipper (1999) applied a

combined fuzzy-genetic approach with new methods as a CAD system. Kim et al

(1999) proposed statistical textural features for detection of microcalcifications in

digitized mammograms. These systems are regarded as a second reader, and the final

24

decision is left to the radiologist. CAD algorithms have improved radiologist’s total

accuracy of detection of cancerous tissues (Giger et al 2001).

Sickles (2000) proposed mammographic follow-up of lesions and Rudy Setiono

(2000) proposed concise and accurate classification rules for breast cancer diagnosis.

Sheybani (2001) have taken up the challenge to create a tele-radiology system, which

consists of a fiber optic network derived by a set of asynchronous transfer mode

(ATM) that switches in association with CAD algorithms. This research explores a

new technology, which is ATM tele-radiology network and the high-speed fiber

backbone architecture that offers real time, online, more accurate screening, detection

and diagnosis of breast cancer. Thus, ATM tele-radiology network has been an

important tool in the development of tele-mammography (Sheybani 2001). Zhen and

Chan (2001) combined AI methods and discrete Wavelet Transform (WT) to build an

algorithm for mass detection. Lisboa (2002) analyzed a review on evidence of health

benefits from ANNs. Early detection of breast cancer via mammography improves

treatment chances and survival rates (Lee 2002). Unfortunately, mammography is not

perfect. False Positive (FP) rates are 15–30% due to the overlap in the appearance of

malignant and benign abnormalities while False Negative (FN) rates are 10–30%.

Bocchi et al (2004) developed an algorithm for microcalcification detection and

classification by which the existing tumors are detected using a region growing

method combined with ANN based classifier. Then, microcalcification clusters are

detected and classified by using a second fractal model. Hassanien and Ali (2004)

proposed an enhanced rough set technique for feature reduction and classification.

Swiniarski and Lim (2006) integrated independent component analysis (ICA) with

rough set model for breast cancer detection. First, features are reduced and extracted

using ICA. Then, extracted features are selected using a rough set model. Finally, a

rough set-based method is used for rule-based classifier design. Bommanna Raja

(2008) proposed a hybrid fuzzy neural system for CAD of kidney images. Park et al

(2009) proposed a method for improving the performance of CAD scheme by

combining results from machine learning classifiers. The first preprocessing step in

computer aided diagnosis is removing the noise from the image thereby enhancing the

quality of image.

25

2.3 IMAGE DENOISING

In image denoising, a compromise has to be achieved between noise reduction and the

preservation of significant edges, corners and other image details (Civicioglu et al

2004). Window-based filtering algorithms (Yüksel 2006) such as median-based filters

are well known to suppress noise, but one of their big drawbacks is that they often

remove important details and blur the image when large window sizes are used, or

noise suppression cannot be obtained sufficiently for small window sizes. Another

pitfall is that median-based filters use the local information and they do not consider

the long-range correlation within natural images. In order to overcome this drawback,

various generations of the median filter, such as switching median filters ( Zhang and

Karim 2002), center weighted median filters (Chen and Wu 2001), rank-ordered

median filters, iterative median filters, and noise detection-based median filters (Fried

et al 2006) with thresholding operations have been proposed. Histogram based fuzzy

filters was proposed by Wang et al (2002) and neuro-fuzzy filters for impulse noise

removal was proposed by Yuksel and Bestok (2004). Besdok et al (2005) used ANFIS

to remove impulse noise. Most of the natural images have additive random noise,

which is modeled as Gaussian. Speckle noise (Guo et al 1994) is observed in US

images and rician noise (Robert Nowak 1999) affects MRI images.

2.3.1 Evolution of Image Denoising

Wavelets give a superior performance in image denoising due to properties such as

sparsity and multiresolution structure. Donoho’s (1995) methods did not require

tracking or correlation of the wavelet maxima and minima across the different scales.

Researchers described different ways to compute the parameters for the thresholding

of wavelet coefficients. These thresholding techniques were applied to the non-

orthogonal wavelet coefficients to reduce artifacts. Data adaptive thresholds (Imola K

Fodor and Chandrika Kamath 2003) were introduced to achieve optimum value of

threshold. Hidden Markov models (HMM) and Gaussian Scale Mixtures (GSM) have

also become popular and more research work continues to be reported. Research in

higher dimensional wavelet transforms has given rise to ridgelets, shearlets, curvelets

and contourlets etc (Latha Parthiban and Subramanian 2006).

26

2.3.2 Classification of Denoising Algorithms

As shown in Figure 2.3, there are two basic approaches to image denoising, spatial

filtering methods and transform domain-filtering methods. Transform domain

filtering methods have less computational complexity compared to spatial filtering

methods but with little trade-off in quality.

Figure 2.3 Classifications of Image Denoising Techniques

Non-Orthogonal Wavelet Transform

IMAGE DENOISING METHODS

Spatial Domain Transform Domain

Linear Non - Linear Non- Data Adaptive Transform

Data Adaptive Transform

Mean

Weiner

Median

Weighted Median

ICA

Wavelet Domain Spatial Frequency Domain

Linear Filtering Non- Linear Threshold Filtering

Wavelet Co-efficient Model

Weiner UDWT

SIWPD

MultiwaveletsNon-Adaptive Adaptive Deterministic Statistical

VISUShrink

SUREShrink

BayesShrink

Cross Validation

Tree Approximation

Marginal Joint

GMM

GGD RMF

HMM

Contourlet

27

2.3.2.1 Spatial Filtering

A traditional way to remove noise from image data is to employ spatial filters. Spatial

filters can be further classified into non-linear and linear filters. Linear filters can be

categorized to mean and weiner filter and non-linear filters can be categorized to

median and weighted median filter.

2.3.2.2 Transform Domain Filtering

The transform domain filtering methods can be subdivided according to the choice of

the basis functions. The basis functions can be further classified as data adaptive and

non-adaptive. Non-adaptive transforms are discussed first since they are popular.

a Spatial-Frequency Filtering

Spatial-frequency filtering refers to use of low pass filters using fast Fourier

transform. In frequency smoothing methods (Jain 1989) the removal of the noise is

achieved by designing a frequency domain filter and adapting a cut-off frequency

when the noise components are decorrelated from the useful signal in frequency

domain. These methods are time consuming and depend on filter function behavior.

b Wavelet domain Filtering

Several wavelet based methods, categorized as ‘denoising from singularity detection’,

have been reported in the literature (Hsung et al 1999). An area that attracted a lot of

attention is adaptive wavelet-based denoising (Mihcak et al 1999; Li and Orchard

2000). This kind of methods often assumes a general type of statistical model for the

image wavelet coefficients. For each wavelet coefficient, the parameters of the

statistical model is calculated and used to estimate the clean wavelet coefficient value.

Filtering operations in the wavelet domain can be either linear or nonlinear.

b.1 Linear Filters

Linear filters such as Wiener filter in the wavelet domain yield optimal results when

the signal corruption can be modeled as a Gaussian process and the accuracy criterion

is the Mean Square Error (MSE). However, designing a filter based on this

28

assumption frequently results in a filtered image that is more visually displeasing than

the original noisy signal, even though the filtering operation successfully reduces the

MSE (Choi and Baraniuk 1998). Zhang et al (2000) proposed a wavelet-domain

spatially adaptive finite impulse response Wiener filtering for image denoising where

Wiener filtering is performed only within each scale and intrascale filtering is not

allowed and it has become standard linear filter in this domain.

b.2 Non-Linear Threshold Filtering

The most investigated domain in denoising using WT is the non-linear coefficient

thresholding based methods. The procedure exploits sparsity property of the WT and

the fact that WT maps white noise in the signal domain to white noise in the

transform domain. Thus, while signal energy becomes more concentrated into fewer

coefficients in the transform domain, noise energy does not.

The procedure in which small coefficients are removed while others are left

untouched is called hard thresholding (Donoho 1995). But the method generates

spurious blips, better known as artifacts, in the images as a result of unsuccessful

attempts of removing moderately large noise coefficients. To overcome the demerits

of hard thresholding, WT using soft thresholding was also introduced in (Donoho

1995). In this scheme, coefficients above the threshold are shrunk by the absolute

value of the threshold itself. Similar to soft thresholding, other techniques of applying

thresholds are semi-soft thresholding and Garrote thresholding (Imola K Fodor and

Chandrika Kamath 2003). Most of the wavelet shrinkage literature is based on

methods for choosing the optimal threshold which can be adaptive or non-adaptive to

the image.

b.3 Wavelet Coefficient Model

This approach focuses on exploiting the multiresolution properties of WT. This

technique identifies close correlation of signal at different resolutions by observing

the signal across multiple resolutions. This method produces excellent output but is

computationally much more complex and expensive. The modeling of the wavelet

coefficients can either be deterministic or statistical.

29

i. Deterministic

The deterministic method of modeling involves creating tree structure of wavelet

coefficients with every level in the tree representing each scale of transformation and

nodes representing the wavelet coefficients. This approach is adopted in Baraniuk

1999. The optimal tree approximation displays a hierarchical interpretation of wavelet

decomposition. Wavelet coefficients of singularities have large wavelet coefficients

that persist along the branches of tree. Thus if a wavelet coefficient has strong

presence at particular node then in case of it being signal, its presence should be more

pronounced at its parent nodes. If it is noisy coefficient, for instance spurious blip,

then such consistent presence will be missing. Lu et al. (1992) tracked wavelet local

maxima in scale space, by using a tree structure. Another denoising method based on

wavelet coefficient trees was proposed by Donoho (1995).

ii. Statistical Modeling of Wavelet Coefficients

This approach focuses on appealing properties of the WT such as multiscale

correlation between the wavelet coefficients, local correlation between neighborhood

coefficients etc. and has an inherent goal of perfecting the exact modeling of image

data using WT. A good review of statistical properties of wavelet coefficients can be

found in Buccigrossi and Simoncelli 1999 and Romberg et al 2001. While the

objective of denoising is to remove the noise from a signal, it is also very important

that the edges in the image are not blurred by the denoising operation. The Lipschitz

regularity theory is widely used to detect the edge and non-edge wavelet coefficients,

based on the dyadic discrete WT (Mallat 1999). The following two techniques exploit

the statistical properties of the wavelet coefficients based on a probabilistic model.

Marginal Probabilistic Model

A number of researchers have developed homogeneous local probability models for

images in the wavelet domain. Specifically, the marginal distributions of wavelet

coefficients are highly kurtotic, and usually have a marked peak at zero and heavy

tails. The Gaussian mixture model (GMM) (Chipman et al 1997) and the generalized

Gaussian distribution (GGD) (Liu and Moulin 1999) are commonly used to model the

30

wavelet coefficients distribution. Although GGD is more accurate, GMM is simpler

to use. In Mihcak et al (1999), authors proposed a methodology in which the wavelet

coefficients are assumed to be conditionally independent zero-mean Gaussian random

variables, with variances modeled as identically distributed, highly correlated random

variables. An approximate maximum a posteriori probability rule is used to estimate

marginal prior distribution of wavelet coefficient variances. All these methods

mentioned above require a noise estimate, which may be difficult to obtain in

practical applications. Simoncelli and Adelson (1996) used a two parameter

generalized Laplacian distribution for the wavelet coefficients of the image, which is

estimated from the noisy observations. Chang et al (2000) proposed the use of

adaptive wavelet thresholding for image denoising, by modeling the wavelet

coefficients as a generalized Gaussian random variable, whose parameters are

estimated locally (i.e., within a given neighborhood).

Joint Probabilistic Model

Hidden Markov models (HMM) (Romberg et al 2001) are efficient in capturing inter-

scale dependencies, whereas Random Markov Field (RMF) models are more efficient

to capture intrascale correlations (Malfait and Roose 1997). The correlation between

coefficients at same scale but residing in a close neighborhood are modeled by hidden

Markov chain model where as the correlation between coefficients across the chain is

modeled by hidden Markov trees (HMT). Once the correlation is captured by HMM,

expectation maximization is used to estimate the required parameters and from those,

de-noised signal is estimated from noisy observation using well-known maximum a

posteriori estimator. Portilla et al (2002) described a model in which each

neighborhood of wavelet coefficients is described as a GSM which is a product of a

Gaussian random vector, and an independent hidden random scalar multiplier. Strela

(2000) described the joint densities of clusters of wavelet coefficients as a GSM, and

developed a maximum likelihood solution for estimating relevant wavelet coefficients

from the noisy observations. A disadvantage of HMT is the computational burden of

the training stage. In order to overcome this computational problem, a simplified

HMT, named as uHMT (Romberg et al 2001) was proposed.

31

b.4 Non-orthogonal Wavelet Transforms

Un-Decimated WT (UDWT) has also been used for decomposing the signal to

provide visually better solution. Since UDWT is shift invariant it avoids visual

artifacts such as pseudo-Gibbs phenomenon but adds a large overhead of

computations thus making it less feasible. In Lang et al (1995) normal hard/soft

thresholding was extended to shift invariant discrete WT. In Cohen et al (1999) Shift

Invariant Wavelet Packet Decomposition (SIWPD) is exploited to obtain number of

basis functions. Then using minimum description length principle, the best basis

function was found out which yielded smallest code length required for description of

the given data. Then, thresholding was applied to denoise the data. The multiwavelets

are obtained by applying more than one mother function (scaling function) to given

dataset. Multiwavelets possess properties such as short support, symmetry, and most

importantly higher order of vanishing moments. This combination of shift invariance

and multiwavelets are implemented (Bui et al 1998) which give superior results for

the Lena image in context of MSE.

b.5 Contourlet Domain

In 2002, Do and Vetterli pioneered a sparse representation for two-dimensional

piecewise smooth signals that resemble images and named it contourlet transform.

But, image denoising by means of the contourlet transform introduces many visual

artifacts because of the Gibbs-like phenomena around singularities (Ramin Eslami

and Hayder Radha 2003). The contourlet transform has a fast iterated filter bank

algorithm that requires order N operations for N-pixel images. It is easily adjustable

for detecting fine details in any orientation (Do and Vetterli 2005) at various scale

levels. Due to the lack of translation invariance of the contourlet transform, the NSCT

is proposed (Arthur L da Cunha 2006) whose structure consists of a bank of filters

and can be divided into the following two shift-invariant parts:

i. Nonsubsampled pyramid (NSP) and

ii. Nonsubsampled directional filter bank (NSDFB)

32

2.4 FEATURE EXTRACTION

After preprocessing the image, the features from the image have to be extracted.

Although it is possible to extract a large set of features, only a small subset of them is

used in the classification due to the curse of dimensionality. The curse of

dimensionality states that as the dimensionality increases, the amount of required

training data increases exponentially. There might be a strong correlation between

different features which is an incentive to reduce the size of the feature set. Different

features selected from image during feature extraction quantify the properties of

biological structures of interest extracting features either at the cellular-level or tissue-

level. While cellular-level features focus on capturing the deviations in the cell

structures, tissue-level features focus on capturing the changes in the cell distribution

across the tissue. The five grouped features of interest are:

The textural features provide information about the variation in intensity of a

surface and quantify properties like smoothness, coarseness, and regularity.

The morphological features provide information about the size and the shape

of a nucleus/cell.

The fractal-based features provide information on the regularity and

complexity of a cell/tissue by quantifying its self-similarity level.

The topological features provide information on the cellular structure of a

tissue by quantifying the spatial distribution of its cells.

The intensity-based features provide information on the intensity (gray-level

or color) histogram of the pixels located in a nucleus/cell.

The types of features used for diagnosis of different types of cancers are given in

Table 2.6. Either one feature can be used as in the case of prostrate cancer or two

features can be used as in the case of skin, lung and liver cancers or three features can

be grouped as in the case of cervical, colorectal and gastric cancers or multiple

features can be grouped as in the case of bladder, breast and mesothelioma types of

cancer (Table 2.6).

The nuclear features (Morphological) to be computed for each identified nucleus

(Street et al 1993) from Figure 2.4 for ultimate cancer diagnosis are:

33

Table 2.6 Types of Features used in the Diagnosis of different type of Cancers

S.No Type of Cancer Features References

1 Bladder

Morphological Textural Fractal-based Topological

Choi et al (1997) Rajesh and Dey (2003) Bommanna Raja (2008)

2 Brain Textural Topological

Spyridonos et al (2002) Gunduz et al (2004) Demir et al (2005)

3 Breast

Morphological Textural Fractal-based Intensity-based

Schnorrenberg et al (1996) Anderson et al (1997) Einstein et al (1998) Dey and Mohanty (2003) Swiniarski and Lim (2006) Zografos et al (2010)

4 Cervical Textural Fractal-based Topological

Keenan et al (2000) McGregor and Olaitan (2010)

5 Colorectal Textural Fractal-based Intensity-based

Hamilton et al (1997) Esgiar et al (1998) Vegard et al (2009)

6 Gastric Morphological Textural Intensity-based

Blekas et al (1998) Cunningham and Schulick (2009)

7 Liver Textural Fractal-based

Nielsen et al (1999) Albregtsen et al (2000) Rong Mu et al (2010)

8 Lung Morphological Intensity-based

Thiran and Macq (1996) Zhou et al (2002) Jiang et al (2009)

9 Mesothelioma

Morphological Textural Topological Intensity-based

Weyn et al (1999) Wong et al (2009)

10 Prostate Textural Diamond et al (2004) Wong et al (2009)

11 Skin Textural Intensity-based

Smolle (2000) Wiltgen et al (2003) Quéreux et al (2010)

34

Figure 2.4 A Digital image taken from a breast FNA (Street et al 1993).

Radius: average length of a radial line segment, from centre of mass to

a snake point.

Perimeter: distance around the boundary.

Area: number of pixels in the interior of the nucleus.

Compactness: Area

)(Perimeter 2

Smoothness: average difference in length of adjacent radial lines.

Concavity: size of any indentations in nuclear border.

Concave points: number of points on the boundary that lie on an indentation.

Symmetry: relative difference in length between line segments

perpendicular to and on either side of the major axis.

Fractal dimension: the fractal dimension of the boundary based on the

‘coastline approximation’.

Texture: variance of grey-scale level of internal pixels.

2.5 SUMMARY

This chapter provides a detailed literature survey on soft computing techniques,

computer aided medical diagnosis, image denoising algorithms and feature extraction.

chapter ii literature survey -...

Documents