14 data mining and warehousing

Upload: ramana-yellapu

Post on 06-Apr-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 14 Data Mining and Warehousing

    1/7

    ABSTRACT..

    Data mining is a combination of databaseand artificial intelligence technologies. Although theAI field has taken a major dive in the last decade; this

    new emerging field has shown that AI can add majorcontributions to existing fields in computer science. Infact, many experts believe that data mining is the thirdhottest field in the

    industry behi-nd the Internet, and datawarehousing.

    Data mining is really just the next step inthe process of analyzing data. Instead of gettingqueries on standard or user-specified relationships,data mining goes a step farther by finding meaningful

    relationships in data. Relationships that were thoughtto have not existed or ones that give a more insightfulview of the data. For example, a computer-generatedgraph may not give the user any insight; however data

    mining can find trends in the same data that shows theuser more precisely what is going on. Using trendsthat the end-user would have never thought to querythe computer about.

    A data warehouse is a repository of anorganization's electronically stored data. Data

    warehouses are designed to facilitate reporting andanalysis.This classic definition of the data warehousefocuses on data storage. However, the means toretrieve and analyze data, to extract, transform and

    load data, and to manage the dictionary data are alsoconsidered essential components of a data

    warehousing system. Many references to datawarehousing use this broader context. Thus, an

    expanded definition for data warehousing includesbusiness intelligent tools to extract, transform, andload data into the repository, and tools to manage andretrieve metadata

    A data warehouse is a relational databasethat is designed for query and analysisrather than for transaction processing. It

    usually contains historical data derivedfrom transaction data, but

    can include data from other sources. Datawarehouses separate analysis workloadfrom transaction workload and enable anorganization to consolidate data fromseveral sources. This helps in:

    Maintaining historical records

    Analyzing the data to gain a betterunderstanding of the business andto improve the business.

    Introduction :-

    Data mining, the extraction of hiddenpredictive information from large databases , is apowerful new technology with great potential to help

    companies focus on the most important information intheir data warehouses. Data mining tools predict futuretrends and behaviors, allowing businesses to make

    proactive, knowledge-driven decisions. Data miningtools can answer business questions that traditionally

    were too time consuming to resolve.

    This evolution began when businessdata was first stored on computers, continuedwith improvements in data access, and more

    recently, generated technologies that allow usersto navigate through their data in real time.

    Massive data collection

    Powerful multiprocessor computers

    Data mining algorithms

    The Scope of Data Mining

    Data mining derives its name from the

    similarities between searching for valuable business

    information in a large database for example,

    1

    DATA MINING AND WAREHOUSING

    Presented by :

    P.Satya vathi

    M.Divya

    SRI SIVANI COLLEGE OF ENGINEERINGSRIKAKULAM

    EMAIL ID :[email protected]

    [email protected]

  • 8/3/2019 14 Data Mining and Warehousing

    2/7

    finding linked products in gigabytes of store scanner

    data and mining a mountain for a vein of valuable

    ore. Both processes require either sifting through an

    immense amount of material, or intelligently probing

    it to find exactly where the value resides.

    Automated prediction of trends and

    behaviors. Data mining automates theprocess of finding predictive information in

    large databases. Questions that traditionallyrequired extensive hands-on analysis cannow be answered directly from the data.

    Automated discovery of previously

    unknown patterns. Data mining tools

    sweep through databases and identifypreviously hidden patterns in one step. Anexample of pattern discovery is the analysis

    of retail sales data to identify seeminglyunrelated products that are often purchasedtogether.

    Techniques:

    3.3 Neural networks

    Neural networks have broad applicability to realworld business problems and have already been

    successfully applied in many industries. Since neuralnetworks are best at identifying patterns or trends in

    data, they are well suited for prediction or forecastingneeds including:

    sales forecasting

    industrial process control

    customer research

    data validation

    risk management

    target marketing etc.

    The bottom layer represents the input layer,

    in this case with 5 inputs labels X1 through X5. In themiddle is something called the hidden layer, with avariable number of nodes. It is the hidden layer that

    performs much of the work of the network. The output

    layer in this case has two nodes, Z1 and Z2

    representing output values we are trying to determinefrom the inputs.

    3.2.1 Decision trees

    Decision trees are simple knowledge

    representation and they classify examples to a finitenumber of classes, the nodes are labeled with attributenames, the edges are labeled with possible values forthis attribute and the leaves labeled with different

    classes.

    The following is an example of objects that

    describe the weather at a given time. The objectscontain information on the outlook, humidity etc.Some objects are positive examples denote by P andothers are negative i.e. N.

    Decision tree structure

    Genetic algorithms: Optimizationtechniques that use processes such as geneticcombination, mutation, and natural selection

    in a design based on the concepts ofevolution.

    Nearest neighbor method: A technique thatclassifies each record in a dataset based on acombination of the classes of the k record(s)

    most similar to it in a historical dataset(where k 1). Sometimes called the k-nearest neighbor technique.

    Rule induction: The extraction of useful if-then rules from data based on statistical

    significance.

    2

  • 8/3/2019 14 Data Mining and Warehousing

    3/7

    :-How Data Mining Works

    How exactly is data mining able to tell youimportant things that you didn't know or what is goingto happen next? The technique that is used to perform

    these feats in data mining is called modeling. For

    instance, if you were looking for a sunken Spanishgalleon on the high seas the first thing you might do isto research the times when Spanish treasure had been

    found by others in the past. You might note that theseships often tend to be found off the coast of Bermudaand that there are certain characteristics to the oceancurrents, and certain routes that have likely been taken

    by the ships captains in that era. You note thesesimilarities and build a model that includes the

    characteristics that are common to the locations ofthese sunken treasures. With these models in hand you

    sail off looking for treasure where your modelindicates it most likely might be given a similar

    situation in the past.

    This act of model building is thus something thatpeople have been doing for a long time, certainlybefore the advent of computers or data mining

    technology. What happens on computers, however, isnot much different than the way people build models.Computers are loaded up with lots of informationabout a variety of situations where an answer is known

    and then the data mining software on the computermust run through that data and distill the

    characteristics of the data that should go into themodel. For example, say that you are the director of

    marketing for a telecommunications company andyou'd like to acquire some new long distance phone

    customers.

    Table 2 - Data Mining for Prospecting

    Table 3 shows another common scenario for buildingmodels: predict what is going to happen in the future.

    3

    Customers Prospects

    General

    information

    (e.g.

    demographicdata)

    Known Known

    Proprietaryinformation

    (e.g. customertransactions)

    Known Target

  • 8/3/2019 14 Data Mining and Warehousing

    4/7

    Yesterday Today Tomorrow

    Staticinformationand current

    Known Known Known

    plans (e.g.demographic

    data,marketing

    plans)

    Table 3 - Data Mining for Predictions

    Architecture for Data Mining :

    To best apply these advanced techniques,

    they must be fully integrated with a data warehouse aswell as flexible interactive business analysis tools. Theresulting analytic data warehouse can be applied toimprove business processes throughout the

    organization, in areas such as promotional campaignmanagement, fraud detection, new product rollout, and

    so on. Figure 1 illustrates architecture for advancedanalysis in a large data warehouse.

    Figure 1 - Integrated Data Mining Architecture

    The ideal starting point is a data warehousecontaining a combination of internal data tracking all

    customer contact coupled with external market dataabout competitor activity. Background information on

    potential customers also provides an excellent basisfor prospecting.

    :-Applications

    A wide range of companies have deployed

    successful applications of data mining.

    Combating Terrorism

    Data mining has been cited as the method bywhich the U.S. Army unit Able Danger hadidentified the September 11, 2001 attacks leader,Mohamed Attar, and three other 9/11 hijackers as

    possible members of anAl Qaedacell operating in theU.S. more than a year before the attack.

    A pharmaceutical company can analyze its recent

    sales force activity and their results to improvetargeting of high-value physicians and determine

    which marketing activities will have the greatestimpact in the next few months.

    A credit card company can leverage its vast

    warehouse of customer transaction data toidentify customers most likely to be interested ina new credit product. Using a small test mailing,the attributes of customers with an affinity for the

    product can be identified.

    A diversified transportation company with alarge direct sales force can apply data mining to

    identify the best prospects for its services.

    A large consumer package goods company can

    apply data mining to improve its sales process toretailers.

    Introduction:

    Most firms want to set uptransaction processing systems so there isa high probability that transactions will becompleted in what is judged to be anacceptable amount of time. Reports andqueries, which can require a much greaterrange of limited server/disk resources than

    4

    Dynamic

    information (e.g.customertransactions)

    Known Known Target

    http://en.wikipedia.org/wiki/Al_Qaedahttp://en.wikipedia.org/wiki/Al_Qaedahttp://en.wikipedia.org/wiki/Al_Qaedahttp://en.wikipedia.org/wiki/Al_Qaeda
  • 8/3/2019 14 Data Mining and Warehousing

    5/7

    transaction processing, run on theservers/disks used by transactionprocessing systems can lower theprobability that transactions complete in anacceptable amount of time. Or, runningqueries and reports, with their variableresource requirements, on the

    servers/disks used by transactionprocessing systems can make it quitecomplex to manage servers/disks so thereis a high enough probability thatacceptable response time can be achieved.

    Definition:

    Data Warehouse:

    The term Data Warehouse was coined by Bill Inmanin 1990, which he defined in the following way: "A

    warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support

    of management's decision making process". Hedefined the terms in the sentence as follows:

    Subject Oriented:

    Data that gives information about a particular subject

    instead of about a company's ongoing operations.

    Integrated:

    Data that is gathered into the data warehouse from avariety of sources and merged into a coherent whole.

    Time-variant:

    All data in the data warehouse is identified with a

    particular time period.

    Non-volatile

    Data is stable in a data warehouse. More data is addedbut data is never removed. This enablesmanagement to gain a consistent picture ofthe business.

    Data Warehouse Architectures

    Data warehouses and their architectures varydepending upon the specifics of an organization'ssituation. Three common architectures are:

    Data Warehouse Architecture: Basic

    Data Warehouse Architecture: with aStaging Area

    Data Warehouse Architecture: with a

    Staging Area and Data Marts

    Data Warehouse Architecture: Basic

    Figure 1-2 shows a simple architecture for a data

    warehouse. End users directly access data derivedfrom several source systems through the datawarehouse.

    Figure 1-2 Architecture of a Data Warehouse

    Description of "Figure 1-2 Architecture of aData Warehouse"

    In Figure 1-2, themetadata and raw

    data of atraditional OLTPsystem is present,as is an additional

    type of data,

    summary data.

    Data

    Warehouse Architecture: with aStaging Area

    You need to clean and process your operational databefore putting it into the warehouse, as shown in

    Figure 1-2.

    5

    http://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg013.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg013.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg013.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg013.htm
  • 8/3/2019 14 Data Mining and Warehousing

    6/7

    Figure 1-3 Architecture of a Data Warehouse with a

    St

    aging Area

    Description of "Figure 1-3 Architecture of aData Warehouse with a Staging Area"

    Data Warehouse Architecture: with aStaging Area and Data Marts

    Although the architecture in Figure 1-3 is quitecommon, you may want to customize yourwarehouse's architecture for different groups within

    your organization.

    Figure 1-4 Architecture of a Data Warehouse with aStaging Area and Data Marts

    Description of "Figure 1-4 Architecture of aData Warehouse with a Staging Area andData Marts"

    Note:

    Data marts are an important part of many data

    warehouses, but they are not the focus of this book.

    Data warehouse Components

    Data warehousing is essentially what you need to doin order to create a data warehouse, and what you do

    with it. It is the process of creating, populating, andthen querying a data warehouse and can involve a

    number of discrete technologies

    Application Uses

    DW appliances provide solutions for many analytic

    application uses, including:

    Enterprise data warehousing

    Super-sized sandboxes isolate power users

    with resource intensive queries

    Pilot projects or projects requiring rapid

    prototyping and rapid time-to-value

    Off-loading projects from the enterprise data

    warehouse; ie large analytical query projectsthat affect the overall workload of theenterprise data warehouse

    Applications with specific performance or

    loading requirements

    Data marts that have outgrown their presentenvironment

    Turnkey data warehouses or data marts

    Solutions for applications with high datagrowth and high performance requirements

    Applications requiring data warehouse

    encryption

    Disadvantages of data warehouses

    There are also disadvantages to using a data

    warehouse. Some of them are:

    Over their life, data warehouses can have

    high costs. The data warehouse is usuallynot static. Maintenance costs are high.

    Data warehouses can get outdated relatively

    quickly. There is a cost of delivering

    suboptimal information to the organization.

    There is often a fine line between datawarehouses and operational systems.

    Duplicate, expensive functionality may bedeveloped. Or, functionality may bedeveloped in the data warehouse that, inretrospect, should have been developed in

    the operational systems and vice versa..

    The future of data warehousing

    Data warehousing, like any technology niche, has ahistory of innovations that did not receive marketacceptance.

    Service Oriented Architecture

    Search capabilities integrated into reporting

    and analysis technology

    Software as a Service

    Analytic tools that work in memory

    Visualization

    Another prediction is that data warehouse performancewill continue to be improved by use of data warehouse

    appliances,

    Difference between data mining and data ware

    housing :

    6

    http://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg015.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg015.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg064.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg064.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg064.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg015.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg015.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg064.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg064.htmhttp://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28313/img_text/dwhsg064.htm
  • 8/3/2019 14 Data Mining and Warehousing

    7/7

    Data mining: A method of comparing large amountsof data to find patters. Normally this is used for

    models and forecasting.Or The process of discovering meaningful

    correlations, patterns, and trends by sifting throughlarge amounts of data stored in repositories, using

    pattern recognition technologies as well as statisticaland mathematical techniques.

    Data warehousing: The ability of a system to store

    data resulting from Data Mining to be used in futureinquiries of that database.Or A data warehouse is a central repository (orstorehouse) for data that an enterprise's various

    business systems collect. Data from various online

    applications and other sources is selectively extractedand organized in the data warehouse for usefulanalysis

    Conclusion :

    Comprehensive data warehouses that integrate

    operational data with customer, supplier, and marketinformation have resulted in an explosion of information.Competition requires timely and sophisticated analysison an integrated view of the data. However, there is a

    growing gap between more powerful storage andretrieval systems and the users ability to effectivelyanalyze and act on the information they contain. A newtechnological leap is needed to structure and prioritize

    information for specific end-user problems. The datamining tools can make this leap. Quantifiable business

    benefits have been proven through the integration of datamining with current information systems, and new

    products are on the horizon that will bring thisintegration to an even wider audience of users.

    REFERENCES :

    1.DATA MINING TECHNIQUES BY ARUN KPUJARI

    2. DATA WAREHOUSING IN THE REAL WORLDBY SAM ANAHORY & DENNIS MURRY

    3. GOOGLE SEARCH & MSN SEARCH

    4. ENCYLOPEDIA

    5. IEEE magazines

    6. DATA WAREHOUSING & FUNDAMENTALS

    7