dwm course

Upload: nareshkosuri6966

Post on 01-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Dwm Course

    1/67

    Geethanjali College of Engineering and Technology

    DEPARTMENT OF INFORMATION TECHNOLOGY

    (Name of the Subject/Lab Cou!e"#O$eat%&' S!tem!

    ()NT* CODE# ++,-.. "

    Po'amme# *G/PG

    a&ch# IT 0e!%o&

    No# . 1

    Yea# III

    Docume&t Numbe #GCET/IT/-,2 11

    Seme!te# I No3 of

    Pa'e!#4,

    C5a!!%6cat%o& !tatu! (*&e!t%cte7/Re!t%cte7 " #

    D%!t%but%o& L%!t#

    Pe$ae7 b #

    ." Name # Y3RA)*

    8" S%'& #

    -" De!%'& #ASSOC3PROFF

    2" Date #8

    *$7ate7 b #

    ." Name #

    8" S%'& #

    -" De!%'& #

    2" Date #

    0e%6e7 b # 1Fo 93C o&5

    ." Name # ."Name #

    8" S%'& # 8"

    S%'& #

    -" De!%'& # -" De!%'& #

    2" Date # 2" Date #

    A$$o:e7 b (HOD" #

    ." Name#

  • 8/9/2019 Dwm Course

    2/67

    8" S%'& #

    -" Date #

    *If it is prepared first time 1 , if it is updated 2 or 3

    **GCET/Dept./3 indicates 3rdyear; ! indicates fourt" in t"e #ist of $%T& 'y##a(us (oo)

    SYLLABUS

    UNIT-IINDTODUCTION:Fundamentals of data mining, Data Mining Functionalities,

    Classification of Data Mining systems, Major issues in Data Mining.

    Data Preprocessing : Needs Preprocessing the Data, Data Cleaning, Data Integration and

    Transformation, Data eduction, Discreti!ation and Concept "ierarchy #eneration.

    UNIT-II

    Data $arehouse and %&'P Technology for Data Mining Data $arehouse, Multidimensional

    Data Model, Data $arehouse 'rchitecture, Data $arehouse

    Implementation, Further De(elopment of Data Cu)e Technology, From Data $arehousing to

    Data Mining.

    UNIT-III

    DATA MINING PRIMITIVES, LANGUAGES AND SYSTEM ARCHITECTURES:Data

    Mining Primiti(es, Data Mining *uery &anguages, Designing #raphical +ser Interfaces ased

    on a Data Mining *uery &anguage 'rchitectures of Data Mining -ystems.

    UNIT-IV

    CONCEPTS DESCRIPTION : Characteri!ation and Comparison : Data #enerali!ation and

    -ummari!ation ased Characteri!ation, 'nalytical Characteri!ation: 'nalysis of 'ttri)ute

    ele(ance, Mining Class Comparisons: Discriminating )et/een Different Classes, Mining

  • 8/9/2019 Dwm Course

    3/67

    Descripti(e -tatistical Measures in &arge Data)ases.

    UNIT-V

    MINING ASSSOCIATION RULES IN LARGE DATABASES:'ssociation ule Mining,Mining -ingleDimensional oolean 'ssociation ules from Transactional Data)ases, Mining

    Multile(el 'ssociation ules from Transaction Data)ases, Mining Multidimensional 'ssociation

    ules from elational Data)ases and Data $arehouses, From 'ssociation Mining to Correlation

    'nalysis, Constraintased 'ssociation Mining.

    UNIT-VI

    CLASSIFICATION AND PREDICTION:Issues egarding Classification and Prediction,

    Classification )y Decision Tree Induction, ayesian Classification, Classification )yac0propagation, Classification ased on Concepts from 'ssociation ule Mining, %ther

    Classification Methods, Prediction, Classifier 'ccuracy.

    UNIT-VII

    CLUSTER ANALYSIS INTRODUCTION:Types of Data in Cluster 'nalysis, '

    Categori!ation of Major Clustering Methods, Partitioning Methods, Densityased Methods,

    #ridased Methods, Modelased Clustering Methods, %utlier 'nalysis.

    UNIT-VIII

    MINING COMPLEX TYPES OF DATA: Multimensional 'nalysis and Descripti(e Mining of

    Comple1, Data %)jects, Mining -patial Data)ases, Mining Multimedia Data)ases, Mining Time

    -eries and -e2uence Data, Mining Te1t Data)ases, Mining the $orld $ide $e).

    TEXT BOOKS :

    3. Data Mining 4 Concepts and Techni2ues 5I'$6I "'N 7 MIC"6&IN6

    8'M6 "arcourt India.

    REFERENCES :

    3. Data Mining Introductory and ad(anced topics 4M'#'6T " D+N"'M,

    P6'-%N 6D+C'TI%N

    9. Data Mining Techni2ues 4 '+N 8 P+5'I, +ni(ersity Press.

    . Data $arehousing in the eal $orld 4 -'M 'N'"%; 7 D6NNI-

  • 8/9/2019 Dwm Course

    4/67

    M+';. Pearson 6dn 'sia.

    < Data $arehousing Fundamentals 4 P'+&'5 P%NN'I'" $I&6; -T+D6NT

    6DITI%N.

    =. The Data $arehouse &ife cycle Tool 0it 4 '&P" 8IM'&& $I&6;

    -T+D6NT 6DITI%N.

    For more de!"#$, %"$" H&:''((()*+)

    GEETHANALI COLLEGE OF ENGINEERING . TECHNOLOGY

    CHEERYAL /V0 KEESARA /M0 RR D"$r"1)

    De&!rme+ o2: IT

    Ye!r !+d Seme$er o 34om S5*e1 "$ O22ered: III BTe14, IISem

    N!me o2 4e S5*e1: D!!(!re4o$"+6 A+d D!! M"+"+6

    N!me o2 4e F!1#7:Y)RAU De$"6+!"o+: A$$o) Pro2e$$or

    De&!rme+: IT

    8)8) I+rod1"o+ o 4e $5*e1:

    Data mining, the extraction of hidden predictive information from large databases, is

    a po/erful ne/ technology /ith great potential to help companies focus on the most

    important information in their data /arehouses. Data mining tools predict future trends and

    )eha(iors, allo/ing )usinesses to ma0e proacti(e, 0no/ledgedri(en decisions. The

    automated, prospecti(e analyses offered )y data mining mo(e )eyond the analyses of past

    e(entsprovided by retrospective tools typical of decision support systems. Data mining tools

  • 8/9/2019 Dwm Course

    5/67

    canans/er )usiness 2uestions that traditionally /ere too time consuming to resol(e. They

    scour data)ases for hidden patterns, finding predicti(e information that e1perts may miss

    )ecause it lies outside their e1pectations.

    Most companies already collect and refine massi(e 2uantities of data. Data mining

    techni2ues can )e implemented rapidly on e1isting soft/are and hard/are platforms to enhance

    the (alue of e1isting information resources, and can )e integrated /ith ne/ products and systems

    as they are )rought online. $hen implemented on high performance client>ser(er or parallel

    processing computers, data mining tools can analy!e massi(e data)ases to deli(er ans/ers to

    2uestions such as, ?$hich clients are most li0ely to respond to my ne1t promotional mailing, and

    /hy@?

    This /hite paper pro(ides an introduction to the )asic technologies of data mining.

    61amples of profita)le applications illustrate its rele(ance to todayAs )usiness en(ironment as

    /ell as a )asic description of ho/ data /arehouse architectures can e(ol(e to deli(er the (alue of

    data mining to end users.

    8)9)O5*e1"%e$ o2 4e $5*e1

    Im&ro%e !#"7 o2 D!!

    -ince a common D-- deficiency is ?dirty data,? it is almost guaranteed that you /ill ha(e

    to address the 2uality of your data during e(ery data /arehouse iteration. Data cleansing is a

    stic0y pro)lem in data /arehousing. %n one hand, a data /arehouse is supposed to pro(ide

    clean, integrated, consistent and reconciled data from multiple sources. %n the other hand, /e are

    faced /ith a de(elopment schedule of B39 months. It is almost impossi)le to achie(e )oth/ithout ma0ing some compromises. The difficulty lies in determining /hat compromises to

    ma0e. "ere are some guidelines for determining your specific goal to cleanse your source data:

  • 8/9/2019 Dwm Course

    6/67

    Ne%er r7 o 1#e!+$e ALL 4e d!!)6(eryone /ould li0e to ha(e all the data perfectly

    clean, )ut no)ody is /illing to pay for the cleansing or to /ait for it to get done. To clean it all

    /ould simply ta0e too long. The time and cost in(ol(ed often e1ceeds the )enefit.

    Ne%er 1#e!+$e NOTHING)In other /ords, al/ays plan to clean something. 'fter all,

    one of the reasons for )uilding the data /arehouse is to pro(ide cleaner and more relia)le data

    than you ha(e in your e1isting %&TP or D-- systems.

    Deerm"+e 4e 5e+e2"$ o2 4!%"+6 1#e!+ d!!) 61amine the reasons for )uilding the data

    /arehouse:

    Do you ha(e inconsistent reports@

    $hat is the cause for these inconsistencies@

    Is the cause dirty data or is it programming errors@

    $hat dollars are lost due to dirty data@

    $hich data is dirty@

    Deerm"+e 4e 1o$ 2or 1#e!+$"+6 4e d!!) efore you ma0e cleansing all the dirty data

    your goal, you must determine the cleansing cost for each dirty data element. 61amine ho/ long

    it /ould ta0e to perform the follo/ing tas0s:

    'naly!e the data

    Determine the correct data (alues and correction algorithms

    $rite the data cleansing programs

    Correct the old files and data)ases if appropriate

  • 8/9/2019 Dwm Course

    7/67

    Com&!re 1o$ 2or 1#e!+$"+6 o do##!r$ #o$ 57 #e!%"+6 " d"r7) 6(erything in )usiness

    must )e costjustified. This applies to data cleansing as /ell. For each data element, compare the

    cost for cleansing it to the )usiness loss )eing incurred )y lea(ing it dirty and decide /hether to

    include it in your data cleansing goal. If dollars lost e1ceeds the cost of cleansing, put the data on

    the ?to )e cleansed? list. If cost for cleansing e1ceeds dollars lost, do not put the data on the ?to

    )e cleansed? list.

    Pr"or"";e 4e d"r7 d!! 7o 1o+$"dered 2or 7or d!! 1#e!+$"+6 6o!#) ' difficult part

    of compromising is )alancing the time you ha(e for the project /ith the goals you are trying to

    achie(e. 6(en though you may ha(e )een cautious in selecting dirty data for your cleansing goal,

    you may still ha(e too much dirty data on your ?to )e cleansed? list. Prioriti!e your list.

    For e!14 &r"or"";ed d"r7 d!! "em !$

  • 8/9/2019 Dwm Course

    8/67

    M"+"m";e I+1o+$"$e+ Re&or$

    'ddressing another common complaint a)out current D-- en(ironments, namely

    inconsistent reports, /ill most li0ely )ecome one of your data /arehouse goals. Inconsistent

    reports are mainly caused )y misuse of data, and the primary reason for misuse of data is

    disagreement or misunderstanding of the meaning or the content of data. Correcting this pro)lem

    is another predicament in data /arehousing, )ecause it re2uires the interested )usiness units to

    resol(e their disagreements or misunderstandings. This type of effort has more than once

    torpedoed a data /arehouse project )ecause it too0 too long to resol(e the disputes. Ignoring the

    issue is not a solution either. $e suggest the follo/ing guidelines:

  • 8/9/2019 Dwm Course

    9/67

    8)>) NTU S7##!5$ ("4 Add""o+!# To&"1$

    -

    .no

    +NIT

    N%

    Topic 'dditional

    Topics

    3 3 Introduction : Fundamentals of data

    mining,

    Data Mining Functionalities

    Classification of Data Mining systems,

    Major issues in DataMining.

    Data Preprocessing : Needs Preprocessing

    the Data

    Data Cleaning, Data Integration and

    Transformation

    Dataeduction

    Discreti!ationandConcept

    "ierarchy#eneration

    9 9 Data $arehouse and %&'P Technology for

  • 8/9/2019 Dwm Course

    10/67

    Data Mining Data $arehouse,

    Multidimensional Data Model

    Data $arehouse 'rchitecture,

    Data$arehouseImplementation,

    Further De(elopment of Data Cu)e

    Technology

    From Data $arehousing to Data Mining.

    +NITIII

    Data Mining Primiti(es

    &anguages Testing methods

    -ystem'rchitectures lac0 )o1 testing

    Data Mining Primiti(es

  • 8/9/2019 Dwm Course

    11/67

    Data Mining *uery &anguages

    Designing #raphical +ser Interfaces

    ased on a Data Mining.

    *uery &anguage 'rchitectures of

    Data Mining -ystems

    < < +NITIG

    Concepts Description.

    Characteri!ation and Comparison

    Data #enerali!ation and -ummari!ation

    ased Characteri!ation

    'nalytical Characteri!ation

    'nalysis of 'ttri)ute ele(ance

    Mining Class Comparisons

    Discriminating )et/een Different Classes,

    Mining Descripti(e -tatistical Measures in

    &arge Data)ases

    = = +NITG

    Mining 'ssociation ules in &arge

  • 8/9/2019 Dwm Course

    12/67

    Data)ases: 'ssociation ule Mining,.

    Mining-ingleDimensionaloolean

    'ssociation ules from Transactional

    Data)ases,

    $arehouses, From 'ssociation Mining to

    Correlation 'nalysis,

    Constraintased 'ssociation Mining

    'ssociation Mining

    B B +NITGI

    Classification and Prediction

    Issues egarding Classification and

    Prediction

    Classification )y Decision Tree Induction,

    ayesian Classification

    Classification )y ac0 propagation,

    Classification ased on Concepts from

    'ssociationuleMining,

  • 8/9/2019 Dwm Course

    13/67

    %ther Classification Methods, Prediction

    Classifier 'ccuracy.

    H H +NITGII

    Cluster'nalysisIntroduction.

    Types of Data in Cluster 'nalysis

    ' Categori!ation of Major Clustering

    Methods

    Partitioning Methods

    #ridased Methods

    Modelased Clustering Methods,

    Densityased Methods,

    %utlier 'nalysis

    +NITGIII

    Mining Comple1 Types of Data

    Multimensional 'nalysis and Descripti(e

  • 8/9/2019 Dwm Course

    14/67

    Mining of Comple1

    Data %)jects

    Mining-patialData)ases

    Mining Multimedia Data)ases

    Mining Time-eries and -e2uence

    Data, Mining Te1t Data)ases

    Mining the $orld $ide $e)

    I)?) Sor1e$ o2 I+2orm!"o+

    I)?)8) Te@ 5oo

  • 8/9/2019 Dwm Course

    15/67

    . Data $arehousing in the eal $orld 4 -'M 'N'"%; 7 D6NNI-

    M+';.Pearson6dn'sia.

    < Data $arehousing Fundamentals 4 P'+&'5 P%NN'I'" $I&6; -T+D6NT 6DITI%N.

    =. The Data $arehouse &ife cy Tool 0it 4 '&P" 8IM'&& $I&6; -T+D6NT 6DITI%N.

    .

    8)?)>) 3e5$"e$:- H&:''((()*+)!1)"+'

    I)?)?) or+!#$:-

    8)) U+" ("$e Smm!r7

  • 8/9/2019 Dwm Course

    16/67

    -

    .no

    +

    NIT N%

    Topic 'dditional

    Topics

    3 3 Introduction: Fundamentals of data mining, Data

    Mining Functionalities

    Classification of Data Mining systems,

    MajorissuesinDataMining.

    Data Preprocessing: Needs Preprocessing the

    Data, Data Cleaning,

    Data Integration and Transformation, Data

    eduction,

    *TP

    Discreti!ation and

    Concept "ierarchy #eneration

    9 9 Data $arehouse and %&'P Technology for

    DataMiningData$arehouse,

    Data $arehouse 'rchitecture

    Data$arehouseImplementation

  • 8/9/2019 Dwm Course

    17/67

    Further De(elopment of Data Cu)e

    Technology

    From Data $arehousing to Data Mining.

    Multidimensional Data Model

    DataMiningPrimiti(es,

    Data Mining Primiti(es, Data Mining *uery

    &anguages,

    Designing #raphical +ser Interfaces

    ased on a Data Mining *uery &anguage.

    'rchitectures of Data Mining -ystems

    and -ystem 'rchitectures

    &anguages

    < < Concepts Description : Characteri!ation and

    Comparison:

  • 8/9/2019 Dwm Course

    18/67

    Data #enerali!ation and -ummari!ation &arge

    Data)ases

    asedCharacteri!ation,'nalytical

    Characteri!ation:

    -il0 Testing

    'nalysis of 'ttri)ute ele(ance, Mining Class

    Comparisons

    Discriminating )et/een Different Classes,

    Mining Descripti(e -tatistical Measures in

    = = Mining 'ssociation ules in &arge Data)ases :

    'ssociation ule Mining,

    Mining -ingleDimensional oolean 'ssociation

    ules from Transactional Data)ases,

    Mining Multile(el 'ssociation ules from

    Transaction Data)ases

    Mining Multidimensional 'ssociation ules from

    elational Data)ases and Data $arehouses,

    From 'ssociation Mining to Correlation 'nalysis

    Constraintased'ssociationMining.

  • 8/9/2019 Dwm Course

    19/67

    B B Classification and Prediction : Issues egarding

    Classification and Prediction,

    Classification )y Decision Tree Induction,

    ayesian Classification

    Classification )y ac0 propagation, 8GC"'T

    'PP&IC'TI%N

    Classification ased on Concepts from

    'ssociation ule Mining

    %ther Classification Methods,

    Prediction,Classifier'ccuracy.

    H H Cluster 'nalysis Introduction : Types of

    DatainCluster'nalysis,

    ' Categori!ation of Major Clustering Methods,

    Partitioning Methods,

    Densityased Methods, 'utomationTechni2ues

    #ridased Methods, Modelased Clustering

    Methods, %utlier 'nalysis.

  • 8/9/2019 Dwm Course

    20/67

    Mining Comple1 Types of Data :

    Multimensional 'nalysis and Descripti(e Mining

    of Comple1,

    Data %)jects

    Mining Time-eries and -e2uence Data, Mining

    Te1t Data)ases

    Mining the $orld $ide $e).

    Mining -patial Data)ases,

    Mining Multimedia Data)ases

    ,,

    'gel model

    8)) M"1ro P#!+

    -

    .&

    n

    +nit

    No

    Total no of

    Periods

    Topics to )e co(ered eg>'dditi

    onal

    Teac

    hing aids

    used

    emar

  • 8/9/2019 Dwm Course

    21/67

    o &CD>

    %"P>

    3 3 3 Introduction : Fundamentals of data mining, egular %"P,

    9 Data Mining Functionalities egular %"P,

    Classification of Data Mining systems egular %"P,

    < MajorissuesinDataMining. egular %"P,

    = DataPreprocessing:NeedsPreprocessing the

    Data

    egular

    DataCleaning,DataIntegrationandTransformat

    ion

    9 9 B Data$arehouseand egular %"P,

    H %&'P Technology for Data Mining Data

    $arehouse,

    egular

    MultidimensionalDataModel. egular %"P,

  • 8/9/2019 Dwm Course

    22/67

    K Data $arehouse 'rchitecture, egular

    3L Data$arehouseImplementation, egular

    33 Further De(elopment of Data Cu)e

    Technology,

    egular %"P,

    From Data $arehousing to Data Mining egular

    39 DataMiningPrimiti(es egular %"P,

    3 Data Mining Primiti(es egular

    3< Data Mining *uery &anguages, egular

    3= Designing #raphical +ser Interfaces ased

    on a Data Mining *uery

    egular %"P,

    3B &anguage 'rchitectures of Data Mining

    -ystems.

    egular

    &anguages,and-ystem'rchitectures egular %"P,

    < < 3H Concepts Description egular

    3 Characteri!ation and Comparison egular

  • 8/9/2019 Dwm Course

    23/67

    3K Data#enerali!ationand -ummari!ationased

    Characteri!ation

    egular

    9L 'nalytical Characteri!ation: 'nalysis of

    'ttri)ute ele(ance

    egular

    93 MiningClassComparisons:Discriminating

    )et/een Different Classes

    egular

    99 Mining Descripti(e -tatistical Measures in

    &arge Data)ases.

    = = 9 Mining 'ssociation ules in &arge Data)ases egular

    9< 'ssociation ule Mining egular

    9= Mining -ingleDimensional oolean

    'ssociation ules from Transactional

    Data)ases

    egular

    9B Mining Multile(el 'ssociation ules from

    Transaction Data)ases

    egular

    9H Mining Multidimensional 'ssociation ules

    from elational Data)ases and Data

    $arehouses

    egular

    From 'ssociation Mining to Correlation

  • 8/9/2019 Dwm Course

    24/67

    'nalysis, Constraintased 'ssociation

    Mining.

    B B 9 ClassificationandPrediction egular %"P,

    9K Issues egarding Classification and

    Prediction

    egular

    L Classification )y Decision Tree Induction egular

    3 ayesian Classification egular %"P

    9 Classification )y ac0 propagation,

    Classification ased on Concepts from

    'ssociation ule Mining

    egular %"P,

    %therClassificationMethods, Prediction,

    Classifier 'ccuracy.

    egular %"P,

    H H Cluster'nalysisIntroduction egular

    < Types of Data in Cluster 'nalysis egular

    = ' Categori!ation of Major Clustering egular %"P,

  • 8/9/2019 Dwm Course

    25/67

    Methods

    B Partitioning Methods egular

    H Mining Comple1 Types of Data egular %"P,

    Multimensional 'nalysis and Descripti(e

    Mining of Comple1

    egular

    K Data %)jects, Mining -patial Data)ases egular &CD,

    %"P,

  • 8/9/2019 Dwm Course

    26/67

    3.ppts

    9.ohp slides

    . su)jecti(e type 2uestionsappro1imately = tL in no

  • 8/9/2019 Dwm Course

    27/67

    UNIT-I

    DEFINITIONS:

    DATAMINING:Data mining refers to e1tracting or mining 0no/ledge fromlarge amounts of data)

    DATAMINING FUNTIONALITIES:Characteri!ation and discrimination,

    Mining Fre2uent Patterns, 'ssociations, and Correlations ,'ssociation 'nalysis,

    Classification and Prediction ,Cluster analysis, %utlier analysis, Trend ande(olution analysis

    CLASSIFICATION OF DATAMINING SYSTEMS:

    Ge+er!# 2+1"o+!#"7

    Descripti(e data mining

    Predicti(e data mining

    D!! m"+"+6 %!r"o$ 1r"er"!$:

    8inds of data)ases to )e mined

    8inds of 0no/ledge to )e disco(ered

    8inds of techni2ues utili!ed

    8inds of applications adapted

    D!!5!$e$ o 5e m"+ed

    elational, transactional, o)jectoriented, o)jectrelational, acti(e, spatial, time

    series, te1t, multimedia, heterogeneous, legacy, $$$, etc.

    K+o(#ed6e o 5e m"+ed

  • 8/9/2019 Dwm Course

    28/67

    Characteri!ation, discrimination, association, classification, clustering, trend,

    de(iation and outlier analysis, etc.

    Multiple>integrated functions and mining at multiple le(els

    analysis, $e) mining, $e)log analysis, etc.

    Te14+"e$ "#";ed

    Data)aseoriented, data /arehouse %&'P, machine learning, statistics,

    (isuali!ation, neural net/or0, etc.

    A&"1!"o+$ !d!&ed

    etail, telecommunication, )an0ing, fraud analysis, DN' mining, stoc0 mar0et

    MAOR ISSUES IN DATAMINING

    Mining methodology and user interaction issues

    Performance issues

    Issues relating to the di(ersity of data types

    DATA PREPROSESSING

    integrating multiple, heterogeneous data sources

    DATA CLEANSING

    6nsure consistency in naming con(entions, encoding structures, attri)ute measures,

    etc. among different data sources

    IT-

    3. Re6re$$"o+is the oldest and most /ell0no/n statistical techni2ue that the data mining

    community utili!es9. D!! m"+"+6is the use of automated data analysis techni2ues to unco(er pre(iously

    undetected relationships among data items

    . Three of the major data mining techni2ues are re6re$$"o+, 1#!$$"2"1!"o+ !+d 1#$er"+6)

  • 8/9/2019 Dwm Course

    29/67

  • 8/9/2019 Dwm Course

    30/67

    H.61plain Preprocess techni2ues@

    UNIT-II

    DATA3AREHOUSING

    ' decision support data)ase that is maintained separately from the organi!ationAs

    operational data)ase

    ' data /arehouse is a su)jectoriented, integrated, time(ariant, and non(olatile

    collection of data in support of managementAs decisionma0ing process.

    DEFINITIONS:

    OLAP/o+-#"+e !+!#7"1!# &ro1e$$"+60

  • 8/9/2019 Dwm Course

    31/67

    Major tas0 of data /arehouse system

    Data analysis and decision ma0ing

    MULTIDIMENTIONAL DATAMODEL

    -tar schema

    -no/fla0e schema

    Fact constellations

    CUBE DEFINITION /F!1 T!5#e0

    define cu)e Ocu)e nameQ ROdimension listQS: Omeasure listQ

    DATA3AREHOUSE APPLICATIONS

    supports 2uerying, )asic statistical analysis, and reporting using crossta)s, ta)les,

    charts and graphs

    multidimensional analysis of data /arehouse data

    supports )asic %&'P operations, slicedice, drilling, pi(oting

    M!*or T!$

  • 8/9/2019 Dwm Course

    32/67

    com)ines data from multiple sources into a coherent store

    Red+d!+ d!! o11r o2e+ (4e+ "+e6r!"o+ o2 m#"e d!!5!$e$

    The same attri)ute may ha(e different names in different data)ases

    %ne attri)ute may )e a deri(ed attri)ute in another ta)le, e.g., annual re(enue

    edundant data may )e a)le to )e detected )y correlation analysis

    Careful integration of the data from multiple sources may help reduce>a(oid

    redundancies and inconsistencies and impro(e mining speed and 2uality

    D!! red1"o+ $r!e6"e$

    Data cu)e aggregation

    'ttri)ute su)set selection

    Dimensionality reduction

    Numerosity reduction

    Discreti!ation and concept hierarchy generation

    34! "$ T4e #o(e$ #e%e# o2 ! d!! 15e

    the aggregated data for an indi(idual entity of interest

    e.g., a customer in a phone calling data /arehouse.

    P!r!mer"1 me4od$

    'ssume the data fits some model, estimate model parameters, store only the

    parameters, and discard the data e1cept possi)le outliers

    &oglinear models: o)tain (alue at a point in mD space as the product on

    appropriate marginal su)spaces

  • 8/9/2019 Dwm Course

    33/67

    No+-&!r!mer"1 me4od$

    Do not assume models

    Major families: histograms, clustering, sampling

    D"$1re";!"o+

    reduce the num)er of (alues for a gi(en continuous attri)ute )y di(iding the range

    of the attri)ute into inter(als. Inter(al la)els can then )e used to replace actual data

    (alues.

    Co+1e& 4"er!r14"e$

    reduce the data )y collecting and replacing lo/ le(el concepts such as numeric

    (alues for the attri)ute age )y higher le(el concepts such as young, middleaged,or senior.

    IT-

    3. A D!! 3!re4o$e Is ' -tructured epository of "istoric Data)

    9) A data warehouse integrates data from multiple data sources

    >) A data warehouseis a copy of transaction data specifically structured for query and analysis.

    ?) OLAP stands for On-Line Analytical Processing

    ) OLAP can be braodly divided into two different ways that is:MOLAP and ROLAP

    ) ' data /arehouse maintains its functions in three layers $!6"+6, "+e6r!"o+, !+d !11e$$) The data accessed for reporting and analy!ing and the tools for reporting and analy!ing

    data is is is also called thed!! m!r)

    . D!! !11e$$layer is the interface )et/een the operational and informational access layer) the data /arehousing concept /as intended to pro(ide an architectural model for the flo/

    of data from operational systems tode1"$"o+ $&&or e+%"ro+me+$8) The integrationlayer is used to integrate data and to ha(e a le(el of a)straction from

    users

    E!$7 e$"o+$

    3. 61plain Preprocessing procedure@

    9. 61plain data Transformation@

    . 61plain data Integration@

  • 8/9/2019 Dwm Course

    34/67

    UNIT-III

    DEFINITIONS

    DATAMINING PRIMITIVES

    More fle1i)le user interaction

    Foundation for design of graphical user interface

    -tandardi!ation of data mining industry and practice

    DATAMINING UERY LANGUAGES

    ' DM*& can pro(ide the a)ility to support adhoc and interacti(e data mining

    y pro(iding a standardi!ed language li0e -*&

    to achie(e a similar effect li0e that -*& has on relational data)ase

    Foundation for system de(elopment and e(olution

    Facilitate information e1change, technology transfer, commerciali!ation and

    /ide acceptance

    34! !$

  • 8/9/2019 Dwm Course

    35/67

    Data collection and data mining 2uery composition

    Presentation of disco(ered patterns

    "ierarchy specification and manipulation

    Manipulation of data mining primiti(es

    Interacti(e multile(el mining

    %ther miscellaneous information

    34! De2"+e$ ! D!! M"+"+6 T!$< =

    Tas0rele(ant data

    Type of 0no/ledge to )e mined

    ac0ground 0no/ledge

    Pattern interestingness measurements

    Gisuali!ation of disco(ered patternsT!$

  • 8/9/2019 Dwm Course

    36/67

  • 8/9/2019 Dwm Course

    37/67

    Mine8no/ledge-pecification ::

    mine associations Ras patternnameS

    34! !$

  • 8/9/2019 Dwm Course

    38/67

    Mine8no/ledge-pecification ::J

    mine comparison Ras patternnameS

    for targetclass /here targetcondition

    V(ersus contrastclassi /here contrastconditioniW

    analy!e measures

    34! "$ 4e S7+!@ 2or !$D$ systems, efficient

    implementation of a fe/ DM primiti(es.

    E!$7 e$"o+$

    3.61plain Data Mining Primiti(es@

  • 8/9/2019 Dwm Course

    39/67

    9.61plain ' data mining 2uery language@

    .

    'rchitecture of data mining systems@

  • 8/9/2019 Dwm Course

    40/67

    Gisuali!ation techni2ues:

    Pie charts, )ar charts, cur(es, cu)es, and other (isual forms.

    !+"!"%e 14!r!1er"$"1 r#e$

    Mapping generali!ed result into characteristic rules /ith 2uantitati(e information

    associated /ith it

    De1"$"o+ ree

    each internal node tests an attri)ute

    each )ranch corresponds to attri)ute (alue

    each leaf node assigns a classification

    ID> !#6or"4m

    )uild decision tree )ased on training o)jects /ith 0no/n class la)els to classify

    testing o)jects

    ran0 attri)utes /ith information gain measure

    minimal height

    the least num)er of tests to classify an o)ject

    De1"$"o+ ree

    each internal node tests an attri)ute

    each )ranch corresponds to attri)ute (alue

    each leaf node assigns a classification

    ID> !#6or"4m

  • 8/9/2019 Dwm Course

    41/67

    )uild decision tree )ased on training o)jects /ith 0no/n class la)els to classify

    testing o)jects

    ran0 attri)utes /ith information gain measure

    minimal height

    the least num)er of tests to classify an o)ject

    D!! d"$&er$"o+ 14!r!1er"$"1$

    median, ma1, min, 2uantiles, outliers, (ariance, etc.

    Nmer"1!# d"me+$"o+$ -1orre$&o+d o $ored "+er%!#$

    Data dispersion: analy!ed /ith multiple granularities of precision

    o1plot or 2uantile analysis on sorted inter(als

    D"$&er$"o+ !+!#7$"$ o+ 1om&ed me!$re$

    Folding measures into numerical dimensions

    o1plot or 2uantile analysis on the transformed cu)e

    !r"#e$, o#"er$ !+d 5o@o$

    *uartiles: *39=thpercentile, *H=

    thpercentile

    Inter2uartile range: I* *4*3

    Fi(e num)er summary: min, *3, M,*, ma1

    o1plot: ends of the )o1 are the 2uartiles, median is mar0ed, /his0ers, and plot

    outlier indi(idually

    %utlier: usually, a (alue higher>lo/er than 3.= 1 I*

    F"%e-+m5er $mm!r7 o2 ! d"$r"5"o+:

    Minimum, *3, M, *, Ma1imum

    Bo@o

  • 8/9/2019 Dwm Course

    42/67

    Data is represented /ith a )o1

    The ends of the )o1 are at the first and third 2uartiles, i.e., the height of the )o1 is

    I*

    The median is mar0ed )y a line /ithin the )o1

    $his0ers: t/o lines outside the )o1 e1tend to Minimum and Ma1imum

    S!+d!rd de%"!"o+:

    the s2uare root of the (ariance

    Measures spread a)out the mean

    It is !ero if and only if all the (alues are e2ual

    oth the de(iation and the (ariance are alge)raic

    D"22ere+1e "+ &4"#o$o&4"e$ !+d 5!$"1 !$$m&"o+$

    Positi(e and negati(e samples in learningfrome1ample: positi(e used for

    generali!ation, negati(e for speciali!ation

    Positi(e samples only in data mining: hence generali!ation)ased, to drilldo/n

    )ac0trac0 the generali!ation to a pre(ious state

    D"22ere+1e "+ me4od$ o2 6e+er!#";!"o+$

    Machine learning generali!es on a tuple )y tuple )asis

    Data mining generali!es on an attri)ute )y attri)ute )asis

    IT-

    8) C4!r!1er";!"o+ of the composition of the postsynaptic proteome P-P pro(ides a

    frame/or0 for understanding the o(erall organi!ation and function9) Clustering using representati(es called CURE

    >) The D!! M"+"+6 Ser%ermust )e integrated /ith the data /arehouse and the %&'Pser(er to em)ed %Ifocused )usiness analysis directly into this infrastructure

    ?) ' de1"$"o+ ree techni2ue used for classification of a dataset

    ) 1#!$$"2"1!"o+ The process of di(iding a dataset into mutually e1clusi(e groups

  • 8/9/2019 Dwm Course

    43/67

    ) d!! 1#e!+$"+6is The process of ensuring that all (alues in a dataset are consistent andcorrectly recorded.

    ) d!! (!re4o$e is a system for storing and deli(ering massi(e 2uantities of data.

    ) !+!#7"1!# mode#is a structure and process for analy!ing a dataset

    ) d!! +!%"6!"o+The process of (ie/ing different dimensions, slices, and le(els of detailof a multidimensional data)ase.

    8) #o6"$"1 re6re$$"o+a linear regression that predicts the proportions of a categorical target(aria)le, such as type of customer, in a population.

    6asy *uestions

    3. 61plain $hat is concept description@

    9. Data generali!ation and summari!ation)ased characteri!ation@

    . 'nalytical characteri!ation: 'nalysis of attri)ute rele(ance@

  • 8/9/2019 Dwm Course

    44/67

    UNIT-V

    A$$o1"!"o+ r#e m"+"+6

    Finding fre2uent patterns, associations, correlations, or causal structures

    among sets of items or o)jects in transaction data)ases, relational

    data)ases, and other information repositories.

    B!$"1 Co+1e&$ o2 A$$o1"!"o+ R#e

    #i(en a data)ase of transactions each transaction is a list of items purchased )y a

    customer in a (isit

    Find all rules that correlate the presence of one set of items /ith that of another set

    of items

    Find fre2uent patterns

    61ample for fre2uent itemset mining is mar0et )as0et analysis.

    A$$o1"!"o+ r#e &er2orm!+1e me!$re$

    Confidence

    -upport

    Minimum support threshold

    Minimum confidence threshold

    M!r

  • 8/9/2019 Dwm Course

    45/67

    Identify patterns from oolean (ector

    Patterns can )e represented )y association rules.

    A&r"or" A#6or"4m

    -ingle dimensional, singlele(el, oolean fre2uent item sets

    Finding fre2uent item sets using candidate generation

    #enerating association rules from fre2uent item sets

    S"+6#e-d"me+$"o+!# r#e$

    )uysJ, mil0 )uysJ, )read

    M#"-d"me+$"o+!# r#e$

    Interdimension association rules no repeated predicates

    ageJ,3K9= occupationJ,student )uysJ,co0e

    hy)riddimension association rules repeated predicates

    ageJ,3K9= )uysJ, popcorn )uysJ, co0e

    C!e6or"1!# Ar"5e$

    finite num)er of possi)le (alues, no ordering among (alues

    !+"!"%e Ar"5e$

    numeric, implicit ordering among (alues

    S!"1 D"$1re";!"o+ o2 !+"!"%e Ar"5e$

    Discreti!ed prior to mining using concept hierarchy.

    Numeric (alues are replaced )y ranges.

  • 8/9/2019 Dwm Course

    46/67

    In relational data)ase, finding all fre2uent 0predicate sets /ill re2uire k

    or kX3 ta)le scans.

    Data cu)e is /ell suited for mining.

    The cells of an ndimensional cu)oid correspond to the predicate sets.

    Mining from data cu)escan )e much faster.

    O5*e1"%e me!$re$

    T/o popular measurements

    support

    confidence

    S5*e1"%e me!$re$

    ' rule pattern is interesting if

    Yit is unexpectedsurprising to the user and>or

    *actionablethe user can do something /ith it

    le(el constraints

    ule constraint

    Interestingness constraints

    A 1o+$r!"+ C!"$ !+"-mo+oo+e "22) 2or !+7 &!er+ S +o $!"$27"+6

    C!, +o+e o2 4e $&er-&!er+$ o2 S 1!+ $!"$27 C!

  • 8/9/2019 Dwm Course

    47/67

    A 1o+$r!"+ Cm"$ mo+oo+e "22) 2or !+7 &!er+ S $!"$27"+6 Cm,

    e%er7 $&er-&!er+ o2 S !#$o $!"$2"e$ "

    S11"+1+e$$ Pro&er7 o2 Co+$r!"+$

    For any set -3 and -9 satisfying C, -3 -9 satisfies C

    #i(en '3 is the sets of si!e 3 satisfying C, then any set - satisfying C are )ased on

    '3 , i.e., it contains a su)set )elongs to '3 ,

    61ample :

    sum(S.Price ) v is not succinct

    min(S.Price ) v is succinct

    IT-

    3. A+ !$$o1"!"o+rule is a pattern that states /henXoccurs, occurs /ith certain

    pro)a)ility9. Go!#Find all rules that satisfy the userspecified minimum support minsup and

    minimum confidenceminconf.. T!5#e d!!need to )e con(erted to transaction form for association mining.

  • 8/9/2019 Dwm Course

    48/67

    6asy *uestions

    3.e1plain

    'ssociation rule mining@

    9.

    Mining singledimensional oolean association rules from transactional

    data)ases@

    .61plain Mining multile(el association rules from transactional data)ases@

  • 8/9/2019 Dwm Course

    49/67

    UNIT-VI

    C#!$$"2"1!"o+:predicts categorical class la)els

    classifies data constructs a model )ased on the training set and the (alues class

    la)els in a classifying attri)ute and uses it in classifying ne/ data

    Pred"1"o+:

    models continuous(alued functions

    predicts un0no/n or missing (alues

    S&er%"$ed #e!r+"+6 /1#!$$"2"1!"o+0

    -uper(ision: The training data o)ser(ations, measurements, etc. are accompanied

    )y la)els indicating the class of the o)ser(ations

    Ne/ data is classified )ased on the training set

    U+$&er%"$ed #e!r+"+6 /1#$er"+60

    The class la)els of training data is un0no/n

    #i(en a set of measurements, o)ser(ations, etc. /ith the aim of esta)lishing the

    e1istence of classes or clusters in the data

    I$$e$ re6!rd"+6 1#!$$"2"1!"o+ !+d &red"1"o+ Com&!r"+6 C#!$$"2"1!"o+

    Me4od$

    'ccuracy

    -peed and scala)ility

    o)ustness

    -cala)ility

    Interpreta)ility:

  • 8/9/2019 Dwm Course

    50/67

    Interpreta)ility

    De1"$"o+ ree

    ' flo/chartli0e tree structure

    Internal node denotes a test on an attri)ute

    ranch represents an outcome of the test

    &eaf nodes represent class la)els or class distri)ution

    De1"$"o+ ree 6e+er!"o+ 1o+$"$$ o2 (o &4!$e$

    Tree construction

    't start, all the training e1amples are at the root

    Partition e1amples recursi(ely )ased on selected attri)utes

    Tree pruning

    Identify and remo(e )ranches that reflect noise or outliers

    U$e o2 de1"$"o+ ree: C#!$$"27"+6 !+ +

  • 8/9/2019 Dwm Course

    51/67

    E@r!1"+6 C#!$$"2"1!"o+ R#e$ 2rom Tree$

    epresent the 0no/ledge in the form of IFT"6N rules

    %ne rule is created for each path from the root to a leaf

    6ach attri)ute(alue pair along a path forms a conjunction

    The leaf node holds the class prediction

    ules are easier for humans to understand

    T(o !&&ro!14e$ o !%o"d o%er2""+6

    Pre&r+"+6:"alt tree construction earlyUdo not split a node if this /ould result

    in the goodness measure falling )elo/ a threshold

    Difficult to choose an appropriate threshold

    Po$&r+"+6: emo(e )ranches from a fully gro/n treeUget a se2uence of

    progressi(ely pruned trees

    +se a set of data different from the training data to decide /hich is the )est

    pruned tree

    A&&ro!14e$ o Deerm"+e 4e F"+!# Tree S";e

    -eparate training and testing sets

    +se cross (alidation, 3Lfold cross (alidation

    +se all the data for training

    +se minimum description length MD& principle

    E+4!+1eme+$ o 5!$"1 de1"$"o+ ree "+d1"o+

    'llo/ for continuous(alued attri)utes

    "andle missing attri)ute (alues

    'ttri)ute construction

  • 8/9/2019 Dwm Course

    52/67

    C#!$$"2"1!"o+Ua classical pro)lem e1tensi(ely studied )y statisticians and

    machine learning researchers

    S1!#!5"#"7:Classifying data sets /ith millions of e1amples and hundreds of

    attri)utes /ith reasona)le speed

    347 de1"$"o+ ree "+d1"o+ "+ d!! m"+"+6=

    relati(ely faster learning speed than other classification methods

    con(erti)le to simple and easy to understand classification rules

    can use -*& 2ueries for accessing data)ases

    compara)le classification accuracy /ith other methods

    B!7e$"!+ C#!$$"2"1!"o+

    -tatical classifiers

    ased on ayeAs theorem

    NaZ(e ayesian classification

    Class conditional independence

    B!7e$"!+ 5e#"e2 +e(o

  • 8/9/2019 Dwm Course

    53/67

    E@r!1"+6 r#e$ 2rom ! r!"+ed +e(or

  • 8/9/2019 Dwm Course

    54/67

    Major method for prediction is regression

    &inear and multiple regressions

    Nonlinear regression

    L"+e!r re6re$$"o+:; X J

    T/o parameters, and specify the line and are to )e estimated )y using the data

    at hand.

    +sing the least s2uares criterion to the 0no/n (alues of ;3, ;9[ J3, J9, [.

    M#"e re6re$$"o+:; )L X )3 J3 X )9 J9.

    Many nonlinear functions can )e transformed into the a)o(e.

    Lo6-#"+e!r mode#$:

    The multi/ay ta)le of joint pro)a)ilities is appro1imated )y a product of lo/er

    order ta)les.

    Pro5!5"#"7:p(a! b! c! d) " ab acad bcd

    3. Mode# 1o+$r1"o+descri)ing a set of predetermined classes.

    9. S1!#!5"#"7 Classifying data sets /ith millions of e1amples and hundreds of attri)utes

    /ith reasona)le speed. C#!$$"2"1!"o+predicts categorical class la)els.

  • 8/9/2019 Dwm Course

    55/67

    9. 61plain IssuesA regarding classification and prediction@

    . 61plainClassification )y decision tree induction@

  • 8/9/2019 Dwm Course

    56/67

  • 8/9/2019 Dwm Course

    57/67

    Inter(alscaled (aria)les

    inary (aria)les

    Categorical, %rdinal, and atio -caled (aria)les

    Garia)les of mi1ed types

    M!*or C#$er"+6 A&&ro!14e$

    Partitioning algorithms

    "ierarchy algorithms

    Density)ased

    #rid)ased

    Model ased

    %utlier 'nalysis

    CLARA/C#$er"+6 L!r6e A&"1!"o+$0 /80

    #$%&%8aufmann and ousseeu/ in 3KKL

    uilt in statistical analysis pac0ages, such as -XIt dra/s multiple samplesof the data set, appliesP%'on each sample, and gi(es

    the )est clustering as the output

    -trength: deals /ith larger data sets thanP%'

    $ea0ness:

    6fficiency depends on the sample si!e

    ' good clustering )ased on samples /ill not necessarily represent a goodclustering of the /hole data set if the sample is )iased

    B"r14:alanced Iterati(e educing and Clustering using "ierarchies, )y \hang,

    ama0rishna, &i(ny -I#M%DAKB

  • 8/9/2019 Dwm Course

    58/67

    CHAMELEON: hierarchical clustering using dynamic modeling, )y #. 8arypis,

    6.". "an and G. 8umarAKK

    DBSCAN A#6or"4m

    'r)itrary select a pointp

    etrie(e all points densityreacha)le fromp/rtpsand'inPts.

    Ifpis a core point, a cluster is formed.

    Ifpis a )order point, no points are densityreacha)le frompand D-C'N (isits

    the ne1t point of the data)ase.

    Continue the process until all of the points ha(e )een processed.

    L"m"!"o+$ o2 COB3EB

    The assumption that the attri)utes are independent of each other is often too strong

    )ecause correlation may e1ist

    Not suita)le for clustering large data)ase data 4 s0e/ed tree and e1pensi(e

    pro)a)ility distri)utions

    Ner!# +e(or< !&&ro!14e$

    epresent each cluster as an e1emplar, acting as a prototype of the cluster

    Ne/ o)jects are distri)uted to the cluster /hose e1emplar is the most similar

    according to some do stance measure

    O#"er$

    The set of o)jects are considera)ly dissimilar from the remainder of the data

    61ample: -ports: Michael 5ordon, $ayne #ret!0y, ...

    D"$!+1e-5!$ed o#"er:' D p, Doutlier is an o)ject % in a dataset T such that

    at least a fraction p of the o)jects in T lies at a distance greater than D from %

    A#6or"4m$ 2or m"+"+6 d"$!+1e-5!$ed o#"er$

    Inde1)ased algorithm

  • 8/9/2019 Dwm Course

    59/67

    Nestedloop algorithm

    Cell)ased algorithm

    See+"!# e@1e&"o+ e14+"e

    -imulates the /ay in /hich humans can distinguish unusual o)jects from among a

    series of supposedly li0e o)jects

    OLAP d!! 15e e14+"e

    +ses data cu)es to identify regions of anomalies in large multidimensional data

    IT-

    8) 1#$er"+6is the assignment of a set of o)ser(ations into su)sets.9) S5$&!1e 1#$er"+6methods loo0 for clusters that can only )e seen in a particular

    projection of the data.>) Many clustering algorithms re2uire the $&e1"2"1!"o+ o2 4e +m5er o2 1#$er$to

    produce in the input data set, prior to e1ecution of the algorithm.?) D"$!+1e me!$re/hich /ill determine ho/ thesimilarityof t/o elements is calculated.

    ) "ierarchical clustering creates a 4"er!r147of clusters /hich may )e represented in a ree

    $r1recalled a dendrogram.

    ) T 1#$er"+6 "s an alternati(e method of partitioning data, in(ented for gene clustering.

    H. *T clustering *T stands for !#"7 T4re$4o#d.) Form!# 1o+1e& !+!#7$"$is a techni2ue for generating clusterscalled formal concepts

    of o)jects and attri)utes.) 6(aluation of clustering is sometimes referred to as C#$er %!#"d!"o+.

    Several dierent clustering systems based on mutua5 %&fomat%o&have been

    proposed.

    E!$7 e$"o+

    3.

    hat is Cluster !nalysis"

    9. 61plain Types of Data in Cluster 'nalysis@

    . 61plain ' Categori!ation of Major Clustering Methods@

  • 8/9/2019 Dwm Course

    60/67

    H. 61plain #ridased Methods@

    . 61plain Modelased Clustering Methods@

    K. 61plain %utlier 'nalysis@

    UNIT-VIII

    Se-%!#ed !r"5e

    #enerali!ation of each (alue in the set into its corresponding higherle(el concepts

    Deri(ation of the general )eha(ior of the set, such as the num)er of elements in the

    set, the types or (alue ranges in the set, or the /eighted a(erage for numerical data

    hobby Vtennis! hockey! chess! violin! nintendogamesW generali!es to Vsports!music! videogamesW

    L"$-%!#ed or ! $ee+1e-%!#ed !r"5e

    -ame as set(alued attri)utes e1cept that the order of the elements in the se2uence

    should )e o)ser(ed in the generali!ation

  • 8/9/2019 Dwm Course

    61/67

    S&!"!# d!!:

    #enerali!e detailed geographic points into clustered regions, such as )usiness,

    residential, industrial, or agricultural areas, according to land usage

    e2uire the merge of a set of geographic areas )y spatial operations

    Im!6e d!!:

    61tracted )y aggregation and>or appro1imation

    -i!e, color, shape, te1ture, orientation, and relati(e positions and structures of the

    contained o)jects or regions in the image

    M$"1 d!!:

    -ummari!e its melody: )ased on the appro1imate patterns that repeatedly occur in

    the segment

    Smm!r";ed "$ $7#e: )ased on its tone, tempo, or the major musical instruments

    played

    O5*e1 "de+"2"er:generali!e to the lo/est le(el of class in the class>su)class

    hierarchies

    C#!$$ 1om&o$""o+ 4"er!r14"e$

    generali!e nested structured data

    generali!e only o)jects closely related in semantics to the current one

    P#!+:a (aria)le se2uence of actions

    6.g., Tra(el flight: Otra(eler, departure, arri(al, dtime, atime, airline, price,

    seatQ

    P#!+ m"+"+6:e1traction of important or significant generali!ed se2uential

    patterns from a plan)ase a large collection of plans

    6.g., Disco(er tra(el patterns in an air flight data)ase, or

    find significant patterns from the se2uences of actions in the repair of automo)iles

  • 8/9/2019 Dwm Course

    62/67

    S&!"!# d!! (!re4o$e: Integrated, su)jectoriented, time(ariant, and

    non(olatile spatial data repository for data analysis and decision ma0ing

    S&!"!# d!! "+e6r!"o+:a )ig issue

    -tructurespecific formats raster (s. (ector)ased, %% (s. relational models,

    different storage and inde1ing

    Gendorspecific formats 6-I, MapInfo, Integraph

    S&!"!# d!! 15e:multidimensional spatial data)ase

    oth dimensions and measures may contain spatial components

    S&!"!# !$$o1"!"o+ r#e:%Rs], c]S

    ' and are sets of spatial or nonspatial predicates

    To&o#o6"1!# re#!"o+$:intersects! overlaps! dis+oint! etc.

    S&!"!# or"e+!"o+$:leftof! ,estof! under!etc.

    D"$!+1e "+2orm!"o+:closeto! ,ithindistance! etc.

    H"er!r147 o2 $&!"!# re#!"o+$4"&:

    gcloseto: nearby,touch,intersect, contain, etc.

    First search for rough relationship and then refine it

    S&!"!# 1#!$$"2"1!"o+

    'naly!e spatial o)jects to deri(e classification schemes, such as decision trees in

    rele(ance to certain spatial properties district, high/ay, ri(er, etc.

    61ample: Classify regions in a pro(ince into rich(s.pooraccording to the a(erage

    family income

    De$1r"&"o+-5!$ed rer"e%!# $7$em$

    uild indices and perform o)ject retrie(al )ased on image descriptions, such as

    0ey/ords, captions, si!e, and time of creation

  • 8/9/2019 Dwm Course

    63/67

    &a)orintensi(e if performed manually

    esults are typically of poor 2uality if automated

    Co+e+-5!$ed rer"e%!# $7$em$

    -upport retrie(al )ased on the image content, such as color histogram, te1ture,

    shape, o)jects, and /a(elet transforms

    Im!6e $!me-5!$ed er"e$:

    Find all of the images that are similar to the gi(en image sample

    Compare the feature (ector signature e1tracted from the sample /ith the feature

    (ectors of images that ha(e already )een e1tracted and inde1ed in the image

    data)ase

    Im!6e 2e!re $&e1"2"1!"o+ er"e$:

    -pecify or s0etch image features li0e color, te1ture, or shape, /hich are translated

    into a feature (ector

    Match the feature (ector /ith the feature (ectors of the images in the data)ase

    T"me-$er"e$ d!!5!$e

    Consists of se2uences of (alues or e(ents changing /ith time

    Data is recorded at regular inter(als

    Characteristic timeseries components

    Trend, cycle, seasonal, irregular

    E$"m!"o+ o2 171#"1 %!r"!"o+$

    If appro1imate periodicity of cycles occurs, cyclic inde1 can )e constructed inmuch the same manner as seasonal inde1es

    E$"m!"o+ o2 "rre6#!r %!r"!"o+$

    y adjusting the data for trend, seasonal and cyclic (ariations

  • 8/9/2019 Dwm Course

    64/67

    Se&$ 2or &er2orm"+6 ! $"m"#!r"7 $e!r14

    Aom"1 m!14"+6

    Find all pairs of gapfree /indo/s of a small length that are similar

    3"+do( $"14"+6

    -titch similar /indo/s to form pairs of large similar su)se2uences allo/ing gaps

    )et/een atomic matches

    S5$ee+1e Order"+6

    &inearly order the su)se2uence matches to determine /hether enough similar

    pieces e1ist

    Pro5#em$ ("4 4e 3e5 #"+

    Not e(ery hyperlin0 represents an endorsement

    %ther purposes are for na(igation or for paid ad(ertisements

    If the majority of hyperlin0s are for endorsement, the collecti(e opinion /ill still

    dominate

    %ne authority /ill seldom ha(e its $e) page point to its ri(al authorities in the

    same field

    'uthoritati(e pages are seldom particularly descripti(e

    H5

    -et of $e) pages that pro(ides collections of lin0s to authorities

    HITS /H7&er#"+

  • 8/9/2019 Dwm Course

    65/67

    De$"6+ o2 ! 3e5 Lo6 M"+er

    $e) log is filtered to generate a relational data)ase

    ' data cu)e is generated form data)ase

    %&'P is used to drilldo/n and rollup in the cu)e

    %&'M is used for mining interesting 0no/ledge

    Be+e2"$ o2 M#"-L!7er Me!-3e5

    Multidimensional $e) info summary analysis

    'ppro1imate and intelligent 2uery ans/ering

    $e) highle(el 2uery ans/ering $e)-*&, $e)M&

    $e) content and structure mining

    %)ser(ing the dynamics>e(olution of the $e)

    IT-

    3. A "me Ser"e$ D!!5!$econsists of se2uences of (alues or e(ents o)tained o(er repeated

    measurements of time.9. See+"!# P!er+ M"+"+6 is the disco(ery of fre2uently occurring ordered e(ents as

    patterns.. D-M- stands forD!! Sre!m M!+!6eme+ S7$em)

  • 8/9/2019 Dwm Course

    66/67

    3.

    #ultidimensional analysis and descriptive mining of comple$

    data objects

    9. 61plain mining spatial data)ases

    . 61plain

    #ultidimensional analysis and descriptive mining of

    comple$ data objects"

  • 8/9/2019 Dwm Course

    67/67