b. verma sir tit application of ga

Upload: anuraggupta

Post on 07-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    1/30

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    2/30

    "eural "et#or$ GA Based

    %y&rid system•"eural "et#or$ can learn 'arious tas$s fromtraining eamples, classify and model non linearrelationships.

    • GA s ha'e &een used to optimie parameters of""

    •GA encode the parameters of "" as stringrepresenting chromosome

    •GA "" technology ha'e the a&ility to locate theneigh&orhood of the optimal solution )uic$ly

    •  *arge amount of memory re)uired to handleand manipulate the chromosomes

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    3/30

    Eample+ GA Based Bac$ propagation"et#or$s

    "et#or$ ie + ---"um&er of /eights+ 0

    Input to %idden *ayer +1

    %idden *ayer to 2utput *ayer+13on'entional "" use ma$e use of gradientdescent learning to o&tain their #eight

    In con'entional "" there is a pro&lem of

    local minimaGA Although does not guarantee glo&aloptima &ut has &een found to o&tain

    accepta&ly good solution.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    4/30

    3hromosome 4epresentation

    5 /66,/6-, /-6, /--, V66,V6-, V-6, V--7

    Each Gene is a real 'alue coded in decimaldigits.

    /e are considering #eights up to three decimalplaces so the num&er of digits re)uired is 1.

    2ne digit is re)uired for sign ( 89:!

    E

    501;-6 1 ;1->< ?0>-;11; ?-;6; 7

    @itness @unction @ 69 4E

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    5/30

    Need of Data Mining

    Traditional statistical techniques and data management tools are no longer

    adequate for analyzing this vast collection of data being generated.

    Examples Domains

    Financial Investment !toc" indexes and prices# interest rates# credit carddata# fraud detection

      Health Care !everal diagnostic information stored by hospital

    management systems

    Manufacturing and Production $rocess optimization and trouble shooting

    Telecommunication network %alling patterns and fault managementsystems

    Scientific Domain &stronomical observations '()# genomic data# biological

    data.

    The *orld *ide *eb

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    6/30

    "no+ledge discovery in databases, -DD

    • The term -DD refers to the overall process of "no+ledge

    discovery in databases. Data mining is a particular step in

    this process# involving the application of specific algorithms

    for extracting patterns models/ from data.

    • The additional steps in the -DD process# such as data

    preparation# data selection# data cleaning# data 0ntegration

    and proper interpretation of the results of mining# ensures

    that useful "no+ledge is derived from the data.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    7/30

     

    The %ommon 1unctions of Data Mining

    • Classification  classifies a data item into one of several predefined

    categorical classes.

    • Regression  maps a data item to a real valued prediction variable.

    • Clustering  maps a data item into one of several clusters# +here

    clusters are natural groupings of data items based on similarity

    metrics or probability densitymodels.

    • Rule generation extracts classification rules from the data.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    8/30

    The %ommon 1unctions of Data Mining

    • Discovering association rules describes association

    relationship among different attributes.

    • Summarization  provides a compact description for a subset

    of data.

    • Deendenc! modeling  describes significant dependencies

    among variables.

    • Se"uence anal!sis models sequential patterns# li"e time,series analysis. The goal is to model the states of the process

    generating the sequence or to extract and report deviation

    and trends over time

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    9/30

    %hallenges of Data Mining

    • Massive data sets and high dimensionality . 2uge data sets create

    combinatorial explosive search space and increase the chances that a datamining algorithm +ill find spurious patterns that are not generally valid.

    $ossible solutions include robust and efficient algorithms# sampling

    approximation methods and parallel processing.

    • User interaction and prior knowledge. Data mining is inherently an

    interactive and iterative process. 3sers interaction is required at various

    stages# and domain "no+ledge may be used either in the form of a high,level

    specification of the model# or at a more detailed level.

    • Over fitting and assessing the statistical significance. Data sets used for

    mining are usually huge and available from distributed sources. &s a result#often the presence of spurious data points leads to over fitting of the models.

    4egularization and re,sampling methodologies need to be emphasized for

    model design.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    10/30

    %hallenges of Data Mining

    • Understandability of patterns. It is necessary to ma$e thedisco'eries more understanda&le to humans.

    Possi&le solutions include rule structuring, natural languagerepresentation, and the 'isualiation of data and $no#ledge.

    • Nonstandard and incomplete data.  The data can &e missingand9or noisy.

    • Mixed media data. *earning from data that is represented &y a

    com&ination of 'arious media, li$e numeric, sym&olic, images andtet.

    • Management of changing data and knowledge. 4apidlychanging data, in a data&ase that is modiCed9deleted9augmented,may ma$e pre'iously disco'ered patterns in'alid.

    Possi&le solutions include incremental methods for updating thepatterns.

    • Integration. Data mining tools are often only a part of the entiredecision ma$ing system. It is desira&le that they integratesmoothly, &oth #ith the data&ase and the Cnal decision ma$ingprocedure.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    11/30

    GA for 3lassiCcation 4ule Disco'ery

    • Michigan aroach# population consists of individuals 5chromosomes6

    +here each individual encodes a single prediction rule.

    • Pitts$urgh aroach# each individual encodes a set of prediction rules

    • Pluses and minuses#

    , The $ittsburgh approach directly ta"es into account rule interaction +hen

    computing the fitness function of an individual.

    , This approach leads to syntactically longer individuals.

    , 0n the Michigan approach the individuals are simpler and syntactically

    shorter.

    , 0t simplifies the design of genetic operators.

    • Ta"e the rule 01 cond78 &ND cond79 &ND :cnd7n:.T2EN class; ci/

     < 4epresentation of the rule antecedent

     < 4epresentation of rule consequent the T2EN part/

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    12/30

     The 4ule Antecedent(sing GA!

    • %ften there is a con&unction of conditions'

    • (suall! use $inar! encoding'

    • ) given attri$ute can take on k discrete values' *ncoding

    can consist of k $its'+ for ,on- or ,off-.'

     < / / 0 0 0 / 0 0 / /111/

    • )ll $its can $e turned into ,0- 2s in order to ,turn off- this

    condition'

    • 3on4$inar! encoding is ossi$le' 5aria$le4length

    individuals will arise' Ma! have to modif! crossover to $ea$le to coe with varia$le4length individuals'

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    13/30

    4epresenting the 4ule 3onse)uent(Predicted 3lass!

    • Three wa!s of reresenting the redicted class' +the TH*3 art.

     < 1irst# encode it in the genome of an individual possibly ma"ing it

    sub=ect to evolution./

     < !econd# associate all individuals of the population +ith the samepredicted class# +hich is never modified during the running of

    the algorithm.

     < Third# choose the predicted class most suitable for a rule a

    deterministic +ay/ as soon as the corresponding rule antecedent

    is formed. e.g. Maximize fitness./

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    14/30

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    15/30

    • 6eneralizing7secializing crossover# @asic

    idea of this special "ind of crossover is to

    generalize or specialize a given rule# depending

    on +hether it is currently over fitting or under

    fitting the data.

    • E

    *ith the Michigan approach , +here each individual represents asingle rule , using a binary encoding. Then the generalizing ?

    specializing crossover operators can be implemented as the logical

    >4 and the logical &ND# respectively

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    16/30

    @itness @unction for 4uleDisco'ery

    • Let a rule be IF A THEN C ,

    •  The predicti'e performance of a rule can &e summaried &y

    a - - matri, sometimes called a confusion matri

     TP True Positi'es "um&er of eamples satisfying A and 3

    @P @alse Positi'es "um&er of eamples satisfying A &utnot 3

    @" @alse "egati'es "um&er of eamples not satisfying A&ut satisfying 3

     T" True "egati'es "um&er of eamples not satisfying Anor 3

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    17/30

    • 6iven $! + confusion matri8. )ctual Class

    • ' C not C

    • ' Predicted C TP FP

    • ' Class not C F3 T3

    • Calculate the confidence factor

    CF 9 TP 7 + TP : FP.• comleteness measure+ C%MP. 9 TP 7 +TP : F3.

    • Fitness 9 CF ; C%MP 9 +TP.+TP. 7 +TP:FP.+TP:F3.

    • Fitness 9 w0 < +CF ; C%MP. : w= < +Sim.

    where Sim is a measure of rule simlicit! /> Sim>0 and

    w0 and w= are user defined weights

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    18/30

    GA for 3lustering

    • 3rucial issue in the design of an GA for clustering isto decide #hat $ind of indi'idual representation #ill&e used to specify the clusters

    , 3luster description:&ased representation+

    • In this case each indi'idual eplicitly represents the parametersnecessary to precisely specify each cluster. "ature of parameterdepends on shape of cluster

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    19/30

    3entroid 9 edoid :&ased representation

    • In this case each indi'idual represents the coordinates of eachclusters centroid or medoid.

    • A centroid is simply a point in the data space #hose coordinatesspecify the centre of the cluster.

    • A medoid is the data instance #hich is nearest to the clusterscentroid.

    •  The position of the centroids 9 medoids and the procedure used toassign instances to clusters implicitly determine the precise shape andsie of the clusters

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    20/30

    Instance:&asedrepresentation• In this case each indi'idual consists of a string of n

    elements (genes!, #here n is the num&er of datainstances. Each gene i, i6,. . . ,n, represents the inde(id! of the cluster to #hich the i:th data instance is

    assigned. %ence, each gene i can ta$e one out of F'alues, #here F is the num&er of clusters.

    • Eample

    suppose that n 6 and F ;. The indi'idual H- 6 - ;; - 6 6 - ; corresponds to a candidate clustering

    #here the second, se'enth and eighth instances areassigned to cluster 6, the Crst, third, sith and ninthinstances are assigned to cluster - and the otherinstances are assigned to cluster ;.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    21/30

    3omparison of 'arious representations

    • In &oth the centroid 9 medoid:&ased and the instance:&ased representation,clusters are mutually eclusi'e and ehausti'e

    •  The cluster descriptions may ha'e some o'erlapping so that an instancemay &e located #ithin t#o or more clusters.

    •  The instance:&ased representation has the disad'antage that it does notscale 'ery #ell for large data sets, since each indi'iduals length is directlyproportional to the num&er of instances &eing clustered.

    •  This representation also in'ol'es a considera&le degree of redundancy,#hich may lead to pro&lems in the application of con'entional geneticoperators. @or instance, let n = 4 and K = 2, and consider the

    indi'iduals and . Tese t!" indi#iduals ha'e diJerentgene 'alues in all the four genes, &ut they represent the same candidateclustering solution, i.e., assigning the Crst and third instances to

    one cluster and assigning the second and fourth instances to anothercluster. They crate 'ery diJerent results in crosso'er

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    22/30

    @itness e'aluation for3lustering

    •  The Ctness of an indi'idual is ameasure of the )uality of the clusteringrepresented &y the indi'idual.

    • Basic ideas of Ctness usually in'ol'ethe follo#ing principles

    : maller the intra:cluster (#ithin

    cluster! distance, the &etter the Ctness.: The larger the inter:cluster (&et#eencluster! distance, the &etter the Ctness.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    23/30

    Genetic Algorithms (GAs! for Pre:processing

    • “The use of GAs for attribute selection seems natural. Themain reason is that the major source of dicult! inattribute selection is attribute interaction, and one of thestren"ths of Gas is that the! usuall! co#e $ell $ithattribute interactions.%

    •  The standard indi'idual representation for attri&ute selectionconsists simply of a string of " &its, #here " is the num&er oforiginal attri&utes and the i:th &it, i6,. . . ,", can ta$e the 'alue 6or , indicating #hether or not, respecti'ely, the i:th attri&ute isselected.

    &  This indi'idual representation is simple, and traditionalcrosso'er and mutation operators can &e easily applied.

    : %o#e'er, it has the disad'antage that it does not scale 'ery

    #ell #ith the num&er of attri&utes.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    24/30

    •  &n alternative individual representation# +here each

    individual represents a candidate attribute subset. &

    candidate attribute subset can be represented as a string

    +ith m binary genes +here m is the number of attributes

    and each gene can ta"e on a 5A6 or 586.

    • 1or instance# the individual B A C B A/# +here M = 5,

    represents a candidate solution where only the Brd and

    the Cth attributes are selected.

    • >ne advantage of this representation is that it scales up

    better +ith respect to a large number of original attributes

    • 1ollo+ crossover and mutation procedures.

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    25/30

    @itness @unction for Attri&uteelection

    • GAs for attri&ute selection can &eroughly di'ided into t#o approaches

    :/rapper approach+ the GA uses theclassiCcation algorithm to computethe Ctness of indi'iduals

    : @ilter approach+ the GA does not

    use the classiCcation algorithm

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    26/30

    Genetic Algorithms (GAs! for Post:processing

    • GAs can &e used in the post:processing step #hen there is anensem&le of classiCers (e.g. rule sets! created. Generating anensem&le of classiCers is a relati'ely recent trend in machinelearning #hen our primary goal is to maimie predicti'e accuracy

    • Generating an ensem&le of classiCers is useful since it has &eensho#n that in se'eral cases an ensem&le of classiCers has a &etterpredicti'e accuracy than a single classiCer.

    • A Ctness function may &e created using #eights for each classiCerin the ensem&le. (A user may help.! There are also GA schemes tooptimie the #eights of the classiCers.

    •  There is a ris$ of generating too many classiCers #hich end upo'er Ctting the training data hence pruning is some times used

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    27/30

    4esearch Pro&lems

    • 'isco(erin" sur#risin" rules)

    Evolutionary algorithms seem to have a good potential todiscover truly surprising rules# due to their ability to cope

    +ell +ith attribute interaction.

    , &n interesting research direction is to design ne+surprisingness measures to evaluate the rules produced

    by evolutionary algorithms

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    28/30

    • *calin" u# E(olutionar!Al"orithms $ith +arallel

    +rocessin")0n the context of mining very large databases# the vast

    ma=ority of the processing time of an evolutionary algorithm

    is spent on evaluating an individuals fitness

    , Distributing the population individuals across the available

    processors and computing their fitness in parallel. 2o+ever#

    this strategy reduces scalability for large databases.

    , 1itness of each individual is computed in parallel by all

    processors. Data being mined is partitioned across the

    processors

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    29/30

    EA for FDD may &e applied to otherdomains

    • -DD has a very interdisciplinary nature and uses many

    different paradigms of "no+ledge discovery algorithms.

    This motivates the integration of evolutionary algorithms

    +ith other "no+ledge discovery paradigms

    • -DD tas"s involve some "ind of prediction# +here

    generalization performance on a separate test set is

    much more important than the performance on a training

    set. This principle may be applied to other domain as+ell

  • 8/18/2019 b. Verma Sir Tit Application of Ga

    30/30

    4eferences

    •  &lex &. 1reitas# 5& 4evie+ of Evolutionary &lgorithms for Data Mining6#

    3niversity of -ent# 3-# %omputing aboratory

    • !ushmita Mitra# !an"ar -. $al#and $abitra Mitra 5Data Mining in !oft

    %omputing 1rame+or" & !urvey6 0EEE Transactions on Neural Net+or"s#

    Fol. 8B# No 8# Ganuary 9AA9

    • @ehrouz Minaei,@idgoli and *illiam 1. $unch# 53sing Henetic &lgorithms forData Mining >ptimization in an Educational *eb,@ased !ystem6# H&4&He#

    Department of %omputer !cience I Engineering# Michigan !tate 3niversity#

    http??garage.cse.msu.edu

    •  &lex &lves 1reitas# 5Evolutionary %omputation6# http??+++.ppgia.pucpr.,

    br?Jalex

    • !id @hattacharyya # 5Henetic &lgorithms 1or Data Mining6# +++.uic.edu?,

    classes?idsc?idsKL9cna?H&DataMining%N&.pdf 

    • ###.site.uotta#a.ca9Knat93ourses9csi=;009...9LimMlides.ppt