chapter one(2004)

Upload: yared-berhe

Post on 05-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Chapter One(2004)

    1/60

    1

    Department of Information

    Technology

    Course Title: Data Warehousing and Data Mining

    Course No: IT4204

    Credit Hr: 3

    By: TWW

  • 8/2/2019 Chapter One(2004)

    2/60

    2

    Chapter one

    Introduction In this chapter we will cover the following issues in brief

    Motivation: Why data mining?

    What is data mining & data warehousing?

    Architecture of a typical DM system

    Why DM: Application of data mining

    Data Mining: On what kind of data?

    Data mining functionalities

    Are all the patterns interesting?

    Classification of data mining systems

    Major issues in data mining

  • 8/2/2019 Chapter One(2004)

    3/60

    3

    Motivation:

    Necessity is the Mother of Invention

    Our capacity of generating and collecting data have been increased

    rapidly in the last several decades

    Huge amount of data is available at the tip of our hand

    It is predicted that more data will be produced in the next year than

    has been generated during the entire existence of humankind!

    A study by IBM predicts that by 2005 the amount of information on

    the planet will increase from 3.2 million exabytes to 43 million

    exabytes (1 exabyte=250 bytes) .

  • 8/2/2019 Chapter One(2004)

    4/60

    4

    Motivation:

    Necessity is the Mother of Invention

    Contributing factors include

    Widespread use of bar code for most commercial products,

    Computerization of many business, scientific, and governmental

    transactions, Advances in data collection tools (audio, video, satellite remote

    sensing, scanning, image capturing tools)

    Usage of WWW as a global information system (the Internet in

    general), Development comprehensive application software,

    new computing and storage technologies

  • 8/2/2019 Chapter One(2004)

    5/60

    5

    Motivation:

    Necessity is the Mother of Invention

    All this have made it easier to create, collect, and store all types of

    data.

    As a result it creates a problem what is called data exposition

    Data explosion is the problem of having huge amount of data in anenterprise stored in databases, data warehouses and other information

    repositories generated by automated data collection tools and mature

    database technology in large databases which has to be processed to

    make a decision.

    As the size of data get larger, analyzing the data becomes very

    difficult

  • 8/2/2019 Chapter One(2004)

    6/60

    6

    Motivation:

    Necessity is the Mother of Invention

    Data can be managed and stored in

    structured relational databases;

    in semi-structured file systems, such as e-mail;

    unstructured fixed content, like documents and graphic files.

    Companies rely on this enterprise data to improve decision-makingand to gain a competitive advantage;

    Data has indeed become a highly valued business asset.

    The huge amount of data exceeds our human ability to make

    comprehension on the data and to put the best decision without tools

    Generating and storing of large volumes of data has reached a critical

    mass and appropriate tools for comprehend the data becomes vital.

  • 8/2/2019 Chapter One(2004)

    7/60

    7

    Motivation:

    Necessity is the Mother of Invention

    We are drowning in data, but starving for knowledge!

    The Solution: Data warehousing and data mining

    Data mining can be viewed as a result of the natural evolution of information

    technology.

    This can be more explained if we look at the evolution of database technology

    since 19th century.

  • 8/2/2019 Chapter One(2004)

    8/60

    8

    Motivation:

    Necessity is the Mother of Invention 1960s:

    Known to be the era of primitive file processing

    There were activities such as

    Data collection,

    database creation (No DBMS),

    Information management system (IMS), mainly using COBOL

    1970s:

    Relational data model, relational DBMS implementation

    Data modeling tools like ER diagram

    Indexing and data organization techniques such as B+ tree, hashing, etc

    Query language such as SQL

    User interfaces, forms and reports

    Query processing and optimization techniques

    Transaction management: recovery, concurrency control, etc

    Online Transaction processing (OLTP)

  • 8/2/2019 Chapter One(2004)

    9/60

    9

    Motivation:

    Necessity is the Mother of Invention

    1980s:

    Period of advanced DB Systems

    advanced data models

    extended-relational, Object Oriented, Object-Relational, deductive, etc.)

    application-oriented DBMS

    spatial, temporal, multimedia, active, scientific, engineering, Knowledgebase,

    etc.)

    1990s2000s:

    Data mining and data warehousing, Knowledge discovery, OLAP and Web

    based databases

  • 8/2/2019 Chapter One(2004)

    10/60

    10

    What is Data Mining & Data warehousing?

    Data mining is extraction of interesting (non-trivial, implicit, previously unknownand potentially useful) information or patterns from data in large databases (data

    warehouse)

    The term Data mining is a misnomer as it doesnt directly related to what is does.

    For exampling mining gold from rock is called Gold mining but not rock mining.

    Similarly oil mining is mining oil from the ground.

    Data mining should best describe as knowledge mining from data rather that datamining

    Any way, we will use the term with this understanding

  • 8/2/2019 Chapter One(2004)

    11/60

    11

    What is Data Mining & Data warehousing?

    Alternative names

    Knowledge discovery(mining) from databases (KDD), knowledge extraction,

    data/pattern analysis,

    data archeology,

    data dredging,

    information harvesting,

    business intelligence, etc.

    Note that: query processing systems, Expert systems (knowledge base systems) or

    Information retrieval systems are not data mining tasks

  • 8/2/2019 Chapter One(2004)

    12/60

    12

    What is Data Mining & Data warehousing?

    Data can now be stored in many different types of databases.

    Special DB architecture that has recently emerged is data warehouse

    Data warehouse is a repository of multiple heterogeneous data sources organized

    under a unified schema at a single site in order to facilitate management decisionmaking process

    Data warehouse technology includes

    Data cleansing (removing noise and inconsistent data)

    Data integration (combining multiple data sources into one data warehouse)

    On-Line Analytical Processing (OLAP)

  • 8/2/2019 Chapter One(2004)

    13/60

    13

    What is Data Mining & Data warehousing?

    OLAP is analysis technique which have functionalities such as

    Summarization

    Consolidation

    Aggregation as well as

    The ability to view information from different angle

    Although OLAP tools support multidimensional analysis and decision making,

    additional data analysis tools are required for in depth data analysis such as

    Data classification

    Clustering and

    Characterization of data changes over time

    The abundance of data, coupled with the need for powerful data analysis tools has

    been described as data rich but information poor situation

  • 8/2/2019 Chapter One(2004)

    14/60

    14

    What is Data Mining & Data warehousing?

    Data mining (Knowledge Discovery in Databases) consists of iterative sequences

    of the following seven steps

    1. Learning the application domain:

    Learn relevant prior knowledge to be used in the DM process

    Learn the goals of DM application

    2. Create target data set (data selection)

    Data cleaning and preprocessing: (may take 60% of effort!)

    Data reduction and transformation:

    Find useful features, dimensionality/variable reduction, invariant representation.

    3. Choosing functions of data mining

    summarization, classification, regression, association, clustering.

  • 8/2/2019 Chapter One(2004)

    15/60

    15

    What is Data Mining & Data warehousing?

    4. Choosing the mining algorithm(s)

    Data mining: search for patterns of interest

    5. Identify the relevant knowledge by measuring the mining resultinterestingness

    Pattern evaluation

    6. Present the knowledge to the user

    knowledge presentation visualization, transformation, removing redundant patterns, etc.

    7. Use of discovered knowledge

  • 8/2/2019 Chapter One(2004)

    16/60

    16April 7, 2012 16

    Data mining Models

    The above seven steps are usually standardized with various data models

    The most common standard in data mining is the CRoss-Industry Standard

    Process for Data Mining (CRISP-DM),

    CRISP-DM has the following steps

    Business/research understanding (Learning the application domain),

    Data understanding (data selection for the problem)

    Data preparation which involves

    collecting, cleaning, consolidating and amalgamating records, summarizing fields,

    checking for data integrity, detecting irregularities and illegal attributes, filling in for

    missing values, trimming outliers.

    Data modeling involves

    selecting data mining tools, transforming the data if the tools require it, generating

    samples for training and testing the model and finally using the tools to build and

    select a model

    Evaluating the model and

    Deploying the model

  • 8/2/2019 Chapter One(2004)

    17/60

    17April 7, 2012 17

    Data mining Models

  • 8/2/2019 Chapter One(2004)

    18/60

    18April 7, 2012 18

    Data Mining: A KDD Process

    Data mining: the core of knowledgediscovery in Database

    Data Cleaning

    Data Integration

    Databases

    Data Warehouse

    Task-relevant Data

    Selection & Transformation

    Data Mining

    Pattern Evaluation

  • 8/2/2019 Chapter One(2004)

    19/60

    19April 7, 2012 19

    Data Mining and Business Intelligence

    Increasing potentialto support

    business decisions End User

    Business Analyst

    Data Analyst

    DBA

    Making

    Decisions

    Data PresentationVisualization Techniques

    Data Mining

    Information Discovery

    Data Exploration

    OLAP

    Statistical Analysis, Querying and Reporting

    Data Warehouses / Data Marts

    Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

  • 8/2/2019 Chapter One(2004)

    20/60

    20April 7, 2012 20

    Architecture of a Typical Data Mining

    System

    Data

    Warehouse

    Data cleaning & data integration Filtering

    Databases

    Databaseordata

    warehouseserver

    Dataminingengine

    Patternevaluation

    Graphical user interface

    Knowledge-base

  • 8/2/2019 Chapter One(2004)

    21/60

    21April 7, 2012 21

    Why Data Mining?

    Potential Applications

    Data mining has many and varied fields of application

    some of which are listed below.1 Retail/Marketing

    2 Banking

    3 Insurance and Health Care

    4 Transportation

    5 Medicine

  • 8/2/2019 Chapter One(2004)

    22/60

    22April 7, 2012 22

    Potential Applications: Retail/Marketing Analysis

    Market analysis is a tool companies use in order to better understand the

    environment in which they operate.

    It is one of the main steps in the development of a marketing plan which

    involves critically reviewing and organizing collected data so that it can be

    used in making strategic marketing decisions

    Retailers can use information collected through affinity programs (e.g.,

    shoppers club cards, frequent flyer points, contests) to assess the effectiveness

    of product selection and placement decisions, coupon offers, and which

    products are often purchased together.

  • 8/2/2019 Chapter One(2004)

    23/60

    23April 7, 2012 23

    Potential Applications: Retail/Marketing Analysis

    Companies such asbanking service providers andmusic clubs can use data

    mining to create a churn analysis, to assess which customers are likely to

    remain as subscribers and which ones are likely to switch to a competitor

    The source of data can be credit card transactions, loyalty cards, discount

    coupons, customer complaint calls, plus (public) lifestyle studies

  • 8/2/2019 Chapter One(2004)

    24/60

    24April 7, 2012 24

    Potential Applications: Retail/Marketing Analysis

    The potential application involves

    Target marketing analysis which involves finding clusters model

    customers who share the same characteristics, interest, income level,

    spending habits, etc to participate in specific market.

    Predict response to mailing, calling, advertizing campaigns

    Determine customer purchasing patterns over time: Finding pattern of

    customer on buying items: based on attribute or combination of attributes

    such as marital status, age, salary, citizen, type of credit card, family

    condition, etc

    Cross-market analysis Associations/co-relations between product sales

    Prediction based on the association information

  • 8/2/2019 Chapter One(2004)

    25/60

    25April 7, 2012 25

    Potential Applications: Retail/Marketing Analysis

    The potential application involves

    Customer profiling

    what types of customers buy what products (clustering or classification)

    Find associations among customer demographic characteristics (age, name,

    address, profession, sex, etc)

    Identify buying patterns from customers

    Identifying customer requirements

    identifying the best products for different customers

    use prediction to find what factors will attract new customers

    Provides summary information

    various multidimensional summary reports for decision making

    statistical summary information (data central tendency and variation)

  • 8/2/2019 Chapter One(2004)

    26/60

    26April 7, 2012 26

    Potential Applications: Retail/Marketing Analysis

    The potential application involves

    market basket analysis:

    identifying group of items that customers usually buy together

    Market segmentation

    Gathering demographic, geographic, behavioral and physiological information

    about a customer and cluster them for proper handling

    Risk analysis and management

    Forecasting, customer retention, improved underwriting, quality control,

    competitive analysis

  • 8/2/2019 Chapter One(2004)

    27/60

    27April 7, 2012 27

    Potential Applications: Fraud detection andmanagement

    Detecting outliers and manage them before they destroy the organization

    environment

    Widely used in health care, retail, credit card services, telecommunications

    (phone card fraud), etc.

    Use historical data to build models of fraudulent behavior and use data

    mining to help identify similar instances

    Potential Applications: Fraud detection and

  • 8/2/2019 Chapter One(2004)

    28/60

    28April 7, 2012 28

    Potential Applications: Fraud detection andmanagement

    Examples

    Auto insurance:

    detect a group of people who stage accidents to collect on insurance

    Money laundering:

    detect suspicious money transactions in banking network

    Medical insurance:

    detect professional patients and ring of doctors and ring of references

    Detecting inappropriate medical treatment

    Detecting telephone fraud

    detect users of telephone line which either hijacked or stolen from acustomer

    Potential Applications: Fraud detection and

  • 8/2/2019 Chapter One(2004)

    29/60

    29April 7, 2012 29

    Potential Applications: Fraud detection andmanagement

    Examples

    Telephone call model:

    destination of the call, duration, time of day or week. Analyze patterns that deviatefrom an expected norm.

    Telecom can identify discrete groups of callers with frequent intra-groupcalls, especially mobile phones, and broke a possible multimillion dollar

    fraud.

    Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees.

  • 8/2/2019 Chapter One(2004)

    30/60

    30April 7, 2012 30

    Potential Applications: Banking

    Detect patterns of fraudulent credit card use

    Identify loyal' customers

    Predict customers likely to change their credit card affiliation

    Determine credit card spending by customer groups

    Find hidden correlations between different financial indicators Identify stock trading rules from historical market data

  • 8/2/2019 Chapter One(2004)

    31/60

    31April 7, 2012 31

    Potential Applications: Insurance and Health Care

    Claims analysis - i.e which medical procedures are claimed together

    Predict which customers will buy new policies

    Identify behavior patterns of risky customers

    Identify fraudulent behavior

  • 8/2/2019 Chapter One(2004)

    32/60

    32April 7, 2012 32

    Potential Applications: Transportation

    Determine the distribution schedules among outlets

    Analyze loading patterns

  • 8/2/2019 Chapter One(2004)

    33/60

    P t ti l A li ti

  • 8/2/2019 Chapter One(2004)

    34/60

    34April 7, 2012 34

    Potential Applications: Others

    Text mining (news group, email, documents) and Web analysis.

    Intelligent query answering

    Sports

    Astronomy

  • 8/2/2019 Chapter One(2004)

    35/60

    35April 7, 2012 35

    Potential Applications: Corporate Analysisand Risk Management

    Finance planning and asset evaluation

    cash flow analysis and prediction

    contingent claim analysis to evaluate assets

    cross-sectional and time series analysis (financial-ratio, trend analysis,

    etc.) Resource planning:

    summarize and compare the resources and spending

    Competition:

    monitor competitors and market directions

    group customers into classes and a class-based pricing procedure

    set pricing strategy in a highly competitive market

  • 8/2/2019 Chapter One(2004)

    36/60

    36April 7, 2012 36

    Data Mining: On What Kind of Data?

    Relational databases

    Data warehouses

    Transactional databases

    Advanced DB and information repositories

    Object-oriented and object-relational databases

    Spatial databases

    Time-series data and temporal data

    Text databases and multimedia databases

    Heterogeneous and legacy databases

    WWW

  • 8/2/2019 Chapter One(2004)

    37/60

    37April 7, 2012 37

    Data Mining Functionalities

    Data mining can be performed on various types of data stores and

    Databases

    Data mining functionalities are used to specify the kind of patterns to

    be found in data mining task

    The kind of pattern mined form a given data is not known for the

    user.

    Techniques should be implemented to extract various pattern from

    the available data so that user can choose what they need to use.

    There are different kinds of data mining functionalities (tasks) that

    can be used to extract various types of pattern from data

  • 8/2/2019 Chapter One(2004)

    38/60

    38April 7, 2012 38

    Data Mining Functionalities

    Generally data mining task can be broadly classified as

    Descriptive (un supervised)

    Predictive (supervised)

    Descriptive data mining task characterize the general properties of the data

    in a database.

    This kind of data mining is usually unsupervised

    Predictive data mining task perform inference on the current data in order tomake prediction to the future reference

    This is usually supervised technique

  • 8/2/2019 Chapter One(2004)

    39/60

    39April 7, 2012 39

    Data Mining Functionalities

    The supervised predictive data mining functionalities includes

    Classification

    Regression

    Time series

    Estimation

    The unsupervised descriptive data mining functionalities includes Concept /class description: Characterization (summarization) and discrimination

    Association Analysis

    Clustering analysis

    Outlier analysis

    Evolution analysis

    Data Mining Functionalities:

  • 8/2/2019 Chapter One(2004)

    40/60

    40April 7, 2012 40

    Data Mining Functionalities:Classification

    Classification is the process of finding a set of models (or functions) that

    describe and distinguish data classes or concepts for the purpose of being able

    to use the model to predict the class of an object whose class is unknown

    The derived class is based on training data set and can be represented in

    various forms such as classification IFTHEN rule, decision tree,

    mathematical formulae or neural networks

    Prediction is the process of predicting some missing or unavailable data values

    rather than class labels.

    Finding models (functions) that describe and distinguish classes or concepts

    for future prediction

    Data Mining Functionalities:

  • 8/2/2019 Chapter One(2004)

    41/60

    41April 7, 2012 41

    Data Mining Functionalities:Regression

    Regression is the process of predicting some missing or unavailable data values

    rather than class labels.

    Finding models (functions) that describe and distinguish classes or concepts

    for future prediction

    Data Mining Functionalities

  • 8/2/2019 Chapter One(2004)

    42/60

    42April 7, 2012 42

    Data Mining FunctionalitiesConcept/class description: Characterization and discrimination

    Given a class/classes with data that belongs to the class, describe the class by

    making observation of its members. Hence one can describe individual classes or concepts in a summarized,

    concise and yet precise terms which is called class/concept description.

    These description can be derived via data characterization or data

    discrimination or both

    Data characterization refers to summarizing the data of the class under

    consideration (target class) in general term

    For example one may characterize the item class as a class in which 90% of the objects are

    computer and its peripheral

    Data discrimination is description made by making comparative analysisbetween the target class with the other comparative class (contrasting classes)

    For example one may discriminate item class from other class like customer and order class

    by saying the item class attributes get modified more frequently than others

    a a n ng unc ona es:

  • 8/2/2019 Chapter One(2004)

    43/60

    43April 7, 2012 43

    a a n ng unc ona es:Association Analysis

    Association analysis is the discovery of association rules showing attribute-

    value conditions that occur frequently together in a given set of data

    Association rules are of the form XY [Support = s%, confidence = c%] where

    X is conjunctions of attributes and Y is conjunctions of values and interpreted as

    if X then it is likely to happen Y with support s% and confidence c%.

    For example

    age(X, 20..29) ^ income(X, 20..29K)buys(X, PC) [support = 2%,confidence = 60%]

    Interpreted as any one whose age ranges from 20 to 29 and income

    range is from 20 to 29K likely buy PC with support 2% and confidence

    of 60%

    Support shows the probability that all the predicates in X and Y fulfill

    together. i.e. P(X U Y)

    Confidence shows if predicates in X fulfilled then the predicate in Y is also

    fulfilled with the stated percentage. i.e. P(Y | X)

    Data Mining Functionalities

  • 8/2/2019 Chapter One(2004)

    44/60

    44April 7, 2012 44

    Data Mining FunctionalitiesAssociation Analysis

    Example 2:

    contains(T, computer)contains(T, software) [1%, 75%]

    Interpreted as if Item T contains computer it is also likely to contain

    software with support 1% and confidence 75%

    In the above two examples, Age, Income, buys and Contains are calledattributes or predicates

    An attribute is a value if it is after the implication sign

    Association rule can be Multi-dimensional (more than 1 predicate in X and Y)

    or single-dimensional association rule (only one predicate in both X and Y)

    For example the association rule in example 1 is multi-dimensional where as in

    example two is single dimensional

    Data Mining Functionalities

  • 8/2/2019 Chapter One(2004)

    45/60

    45April 7, 2012 45

    Data Mining Functionalities

    Cluster analysis

    In cluster Analysis, class labels are unknown and a group of data is

    given to be classified.

    Cluster analysis group data to form new classes, e.g., cluster houses

    to find distribution patterns

    Clustering based on the principle:

    maximizing the intra-class similarity and minimizing the interclass

    similarity

  • 8/2/2019 Chapter One(2004)

    46/60

    46April 7, 2012 46

    Data Mining Functionalities

    Outlier analysis

    Database may contain data object that do not comply with the general

    behavior or model of the data.

    These data objects are outliers.

    Usually outlier data items are considered as noise or exception in many

    data mining applications

    However, in some application such as fraud detection, the rare events can

    be more interesting than the more regularly occurring ones.

    The analysis of outlier data is referred to as outlier mining

  • 8/2/2019 Chapter One(2004)

    47/60

    47April 7, 2012 47

    Data Mining Functionalities

    Trend and evolution analysis

    Describe and model regularities or trends for objects whose

    behavior changes over time.

    Though this may include characterization, discrimination,

    association or clustering of time related data, distinct features of

    such an analysis include time series data analysis, sequence or

    periodicity pattern matching and similarity based data analysis.

    It is also referred as regression analysis, sequential pattern mining,

    periodicity analysis, similarity-based analysis

    A All th Di d P tt I t ti ?

  • 8/2/2019 Chapter One(2004)

    48/60

    48April 7, 2012 48

    Are All the Discovered Patterns Interesting?

    A data mining system/query may generate thousands of patterns, not all of them

    are interesting Questions

    What makes a pattern interesting?

    Can a data mining system generate all of the interesting patterns?

    Can a data mining system generate only interesting patterns?

    Answers for all the three questions will be given bellow

    Q i 1

  • 8/2/2019 Chapter One(2004)

    49/60

    49April 7, 2012 49

    A pattern is interesting if it is easily understood by humans, valid on

    new or test data with some degree of certainty, potentially useful, novel,

    or validates some hypothesis that a user seeks to confirm

    An interesting pattern represents knowledge

    Measure of Interestingness measures

    Two types (Objective vs. subjective)

    Objective: based on statistics and structures of patterns, e.g., support,

    confidence, etc.

    Subjective:based on users belief in the data, e.g., unexpectedness

    (contradicting a users belief), novelty, actionability, etc.

    Question 1

    Are All the Discovered Patterns Interesting?

    Q i 2

  • 8/2/2019 Chapter One(2004)

    50/60

    50April 7, 2012 50

    Referred as Completeness of the data mining algorithm

    No single data mining system is complete but users can set a constraint on

    the type of pattern they are looking for in which the data mining function

    generate all the pattern with the specified constraints

    Association algorithms dont find classification pattern and others for

    example

    Question 2

    Can We Find All Interesting Patterns?

  • 8/2/2019 Chapter One(2004)

    51/60

    51April 7, 2012 51

    This is an Optimization problem in data mining system

    it remain an challenging issue

    Approaches in the research topic

    First generate all the patterns and then filter out the uninteresting

    ones.

    Generate only the interesting patternsmining query optimization

    Question 3

    Can We Find Only Interesting Patterns?

  • 8/2/2019 Chapter One(2004)

    52/60

    52April 7, 2012 52

    Data Mining: Confluence of Multiple Disciplines

    Data Mining

    Database

    TechnologyStatistics

    Other

    Disciplines

    Information

    Science

    Machine

    LearningVisualization

    D Mi i Cl ifi i S h

  • 8/2/2019 Chapter One(2004)

    53/60

    53April 7, 2012 53

    Data Mining: Classification Schemes

    General functionality

    Descriptive data mining

    Predictive data mining

    Different views, different classifications Kinds of databases to be mined

    Kinds of knowledge to be discovered

    Kinds of techniques utilized

    Kinds of applications adapted

  • 8/2/2019 Chapter One(2004)

    54/60

    54April 7, 2012 54

    A Multi-Dimensional View of Data Mining Classification

    Databases to be mined Relational, transactional, object-oriented, object-relational, active,

    spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,etc.

    Knowledge to be mined

    Characterization, discrimination, association, classification, clustering,trend, deviation and outlier analysis, etc.

    Multiple/integrated functions and mining at multiple levels

    Techniques utilized

    Database-oriented, data warehouse (OLAP) oriented, machinelearning, statistics, visualization, neural network, etc.

    Applications adapted

    Retail, telecommunication, banking, fraud analysis, DNA mining, stockmarket analysis, Web mining, Weblog analysis, etc.

  • 8/2/2019 Chapter One(2004)

    55/60

    55April 7, 2012 55

    Major Issues in Data Mining

    The major issue in data mining includes the following aspects Mining methodology and user interaction

    Performance and scalability

    Issues related to the diversity of data types

    Issues related to applications and social impacts These will be briefly discussed in the following slides

    a or ssues:

  • 8/2/2019 Chapter One(2004)

    56/60

    56April 7, 2012 56

    Mining methodology and user interaction Reflects the kind of knowledge mined, the ability to mine knowledge at different

    granularities, the use of domain knowledge, and knowledge visualization

    Mining different kinds of knowledge in databases for different users

    Require different data mining functionalities and algorithms

    Interactive mining ofknowledge at multiple levels of abstraction

    Enable to verify the discovered knowledge is relevant or not for the intended purpose

    Incorporation of background knowledge Background knowledge is vital in refining the data mining performance and evaluate its

    relevance

    Data mining query languages and ad-hoc data mining

    Such query language enable to retrieve knowledge from a data warehouse just like what SQL

    does from the database in a DBMS

    Presentation and visualization of data mining results

    Result should be presented visually or using high level natural language

    Handling noise, incomplete data and exceptions

    Pattern evaluation: the interestingness problem

    Major Issues:

  • 8/2/2019 Chapter One(2004)

    57/60

    57April 7, 2012 57

    Major Issues:Performance and scalability

    This includes Efficiency and scalability of data mining algorithms

    Parallel, distributed and incremental mining methods for huge amount of

    data

    Major Issues:

  • 8/2/2019 Chapter One(2004)

    58/60

    58April 7, 2012 58

    Major Issues:Issues related to the diversity of data types

    This involves

    Handling relational and complex types of data.

    Relational data need efficient and effective data mining system as it is

    widely used

    Complex data refers to data such as hyper text, multimedia, spatial, temporaland transactional which further require attention to do data mining on such

    data

    Mining information from heterogeneous databases and

    global information systems (WWW)

    Which is a major issue to use as a source of data in data mining

    Major Issues:

  • 8/2/2019 Chapter One(2004)

    59/60

    59April 7, 2012 59

    Major Issues:Issues related to applications and social impacts

    Data mining also concerned on issues related to its application

    and the impact it may have on the social aspect

    These includes

    Identifying the application of discovered knowledge which can be

    Domain-specific data mining tools

    Intelligent query answering

    Process control and decision making

    How to integrate the discovered knowledge with existing knowledge:A

    knowledge fusion problem

    Protection of data security, integrity, and privacy that may have socialimpact

  • 8/2/2019 Chapter One(2004)

    60/60

    Summary Data mining: discovering interesting patterns from large amounts of data

    A natural evolution of database technology, in great demand, with wideapplications

    A KDD process includes data cleaning, data integration, data selection,

    transformation, data mining, pattern evaluation, and knowledge presentation

    Mining can be performed in a variety of information repositories

    Data mining functionalities: characterization, discrimination, association,

    classification, clustering, outlier and trend analysis, etc.

    Classification of data mining systems

    Major issues in data mining