data mining session

Upload: ali-nguyen

Post on 09-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Data Mining Session

    1/16

  • 8/7/2019 Data Mining Session

    2/16

    Introduction

    Data Mining is the process of extracting valid,previously unknown, and ultimately comprehensibleinformation from large databases and using it to makecrucial business decision. The extracted informationcan be used to form a prediction or classification

    model, identify relations between database records, orprovide a summary of the database being mined.

  • 8/7/2019 Data Mining Session

    3/16

    Introduction

    The goal of identifying and utilizing information hidden in datahas three requirements.

    - First, the captured data must be integrated into organization-

    wide views, instead of department-specific views, and often

    supplemented with open source and/or purchased data.

    - Second, the information contained in the integrated data must

    be extracted, or mined.

    - Third, the mined information must be organized in ways that

    enable decision-making.

  • 8/7/2019 Data Mining Session

    4/16

    Hypothesis Verification

    The verification model takes an hypothesis from the user andtests the validity of it against the data. The emphasis is with theuser who is responsible for formulating the hypothesis and issuingthe query on the data to affirm or negate the hypothesis.

    System supports this operation is called verification-driven datamining system. Such system suffers from two problems: 1) Theyrequire the decision-maker to hypothesize the desired information.2) The quality of the extracted information is based on the usersinterpretation of the posed querys results.

  • 8/7/2019 Data Mining Session

    5/16

    Information Discovery

    The discovery model differs in its emphasis in that it is the system

    automatically discovering important information hidden in the data.The data is sifted in search of frequently occurring patterns, trends andgeneralisations about the data without intervention or guidance fromthe user. The discovery or data mining tools aim to reveal a largenumber of facts about the data in as short a time as possible. Thecorresponding systems are called discovery-driven data mining

    systems.Summary, verification-driven data mining will allow the decision-

    maker to express and verify organizational and personal domainknowledge and hypotheses, while discovery-driven data mining will beused to refine these hypotheses, as well as identify information notpreviously hypothesized by the user.

  • 8/7/2019 Data Mining Session

    6/16

    Data Mining Process

    Data TargetData

    Preprocess

    Data

    Transform

    Data

    Pattern

    KnowledgeSelection

    Preprocessing

    Transformation

    DataMining

    Interpretation &evaluation

  • 8/7/2019 Data Mining Session

    7/16

    Data Mining Process

    Data

    Warehouse

    Selected

    Data

    Transformed Data

    Extracted

    Information

    Assimilated

    Information

    Select Transform MineAssimilate

  • 8/7/2019 Data Mining Session

    8/16

    Data Mining Operation1) Creation of prediction and classification models: The goal of thisoperation is to use the contents of the database, which reflect

    historical data, ie., data about the past, to automatically generate amodel that can predict a future behavior.

    2) Link analysis: the goal of link analysis is to establish relationsbetween the records in a database.

    3) Database segmentation: it is often necessary to partition them

    into collections of related records either as a means of obtaining asummary of each database, or before performing a data miningoperation such as model creation, or link analysis.

    4) Deviation detection: its goal is to identify outlying points in aparticular data set, and explain whether they are due to noise orother impurities being present in the data, or due to causal reasons.

  • 8/7/2019 Data Mining Session

    9/16

    Data Mining Techniques

    Supervised Induction

    Association Discovery

    Sequence Discovery

    Conceptual Clustering

    Visualization

    Neural Network

  • 8/7/2019 Data Mining Session

    10/16

    Supervised Induction

    Supervised induction refers to the process of automatically creating aclassification model from a set of records (examples), called the training

    set. The training set may either be a sample of the database or warehousebeing mined, the entire database, or a data warehouse. A supervisedinduction technique is particularly suitable for data mining if it has threecharacteristics:

    1. It can produce high quality models even when the data in thetraining set is noisy and incomplete.

    2. The resulting models are comprehensible and explainable so that theuser can understand how decision are made by the system.

    3. It can accept domain knowledge. Such knowledge can expedite theinduction task while simultaneously improving the quality of the inducedmodel.

  • 8/7/2019 Data Mining Session

    11/16

    Association Discovery

    Association Discovery function is an operation against

    this set of records which return affinities that exist amongthe collection of items. These affinities can be expressed byrules such as 72% of all the records that contain itemsA,B, and C also contain items D and E. The specificpercentage of occurrences is called the confidence factor ofthe association.

  • 8/7/2019 Data Mining Session

    12/16

    Sequence Discovery

    Such a situation is typical of a Direct Mail Application.In this case, a catalog merchant has the information, for

    each customer, of the sets of products that the customerbuys in every purchase order. A sequence discoveryfunction will analyze such collections of related recordsand will detect frequently occurring patterns of productsbought over time.

  • 8/7/2019 Data Mining Session

    13/16

    Conceptual ClusteringClustering is used to segment a database into subsets, the clusters, with the members of

    each cluster sharing a number of interesting properties. The results of a clustering operationare used in one of two ways. First, for summarizing the contents of the target database byconsidering the characteristics of each created cluster rather than those of each record in thedatabase. Second, as an input to other methods, eg. Supervised induction. A cluster is asmaller and more manageable data set to the supervised inductive learning component.

  • 8/7/2019 Data Mining Session

    14/16

    Visualization

    Visualization provides analysts with visual summaries of data from adatabase. It can also be used as a method for understanding the

    information extracted using other data mining methods. Data miningnecessitates the use of interactive visualization techniques that allow theuser to quickly and easily change the type of information displayed, as wellas the particular visualization method used. Visualizations are particularlyuseful for noticing phenomena that hold for a relatively small subset of thedata, and thus are drowned out by the rest of the data when statisticaltests are used since these tests generally check for global features.

    The advantage of using visualization is that the analyst does not haveto know what type of phenomenon he is looking for in order to noticesomething unusual or interesting.

  • 8/7/2019 Data Mining Session

    15/16

    Neural NetworkNeural networks are an approach to computing that involves developing

    mathematical structures with the ability to learn. The methods are the result ofacademic investigations to model nervous system learning. Neural networks have theremarkable ability to derive meaning from complicated or imprecise data and can be

    used to extract patterns and detect trends that are too complex to be noticed by eitherhumans or other computer techniques. A trained neural network can be thought of asan "expert" in the category of information it has been given to analyse. This expert canthen be used to provide projections given new situations of interest and answer "whatif" questions.

    Z1

    Z2

    X1 X2 X3 X4 X5 X6

    F(I)

    X1

    X2

    X3

    X4

    X5

    X6

    W1

    W2

    W3

    W4

    W5

    W6

    F(I

    )

    F(I) = X1*W1 + X2*W2 + .

  • 8/7/2019 Data Mining Session

    16/16