data warehousing and data mining

17
WALCHAND INSTITUTE OF TECHNOLOGY A PAPER PRESENTATION ON DATA WAREHOUSING AND DATA MINING AT SUBMITTED BY: AMOL P. NITAVE ABBAS HASHMI B.E. (C.S.E) B.E. (C.S.E)

Upload: abbas-hashmi

Post on 23-Nov-2014

607 views

Category:

Documents


3 download

DESCRIPTION

This paper includes the application that is implemented at my college. Here is perfect explanation of Data Warehousing and Data Mining with full description of the project.

TRANSCRIPT

Page 1: Data Warehousing and Data Mining

WALCHAND INSTITUTE OF TECHNOLOGY

A

PAPER PRESENTATION ON

DATA WAREHOUSING AND DATA MINING

AT

SUBMITTED BY:

AMOL P. NITAVE ABBAS HASHMI

B.E. (C.S.E) B.E. (C.S.E)

[email protected] [email protected]

GUIDED BY:

Prof. R. B. Kulkarni (CSE Dept. WIT, Solapur)

Page 2: Data Warehousing and Data Mining

INDEX

1. ABSTRACT

2. DATA WAREHOUSING

Introduction

Need of Data Warehousing

Purpose of Data Warehousing

Characteristics

Life cycle

Architecture

Tools and technologies

Applications

3. DATA MINING

Introduction

Types of Data Mining

Major elements of Data Mining

Data Mining: A KDD process

Steps in KDD process

Methods of Data Mining

4. PROJECT ON DATA MINING: Website Data Mining

Aim of project

Implementation

Working

Advantages

5. CONCLUSION

6. REFERENCE

Page 3: Data Warehousing and Data Mining

DATA WAREHOUSING AND DATA MINING

ABSTRACT:

Fast, accurate and scalable data analysis techniques are needed to extract useful

information from huge pile of data. Data warehouse is a single, integrated source of decision

support information formed by collecting data from multiple sources, internal to the

organization as well as external, and transforming and summarizing this information to enable

improved decision making. Data warehouse is designed for easy access by users to large

amounts of information, and data access is typically supported by specialized analytical tools

and applications. Typical applications include decision support systems and execution

information system.

Data mining is the exploration and analysis of large quantities of data in order to

discover valid, novel, potentially useful, and ultimately understandable patterns in data. It is

“An information extraction activity whose goal is to discover hidden facts contained in

databases”.

The process of extracting valid, previously unknown, comprehensible and actionable

information from large databases and using it to make crucial business decisions.

The project entitled “Website Data Mining” is an application of data mining

which is built for the website developers for their effective creation of websites in

internet.

Data mining finds patterns and subtle relationships in data and infers rules that allow

the prediction of future results. It produces output values for an assigned set of input values.

Typical applications include market segmentation, customer profiling, fraud detection,

evaluation of retail promotions, and credit risk analysis.

Page 4: Data Warehousing and Data Mining

DATA WAREHOUSING

Everyday increasingly, organizations are analyzing current and historical data to

identify useful patterns and support business strategies.

A large amount of the right information is the key to survival in today’s competitive

environment. And this kind of information can be made available only if there’s totally

integrated enterprise data warehouse.

What is data warehousing?

A data warehouse is a subject-oriented, integrated, non-volatile & time-variant

collection of data in support of management’s decisions.

Need for Data Warehousing:

• IT or business staff spending a lot of time developing special reports for decision-makers.

• Lots of PC-based or small server systems obtaining extracts of data incapable of presenting a

holistic view of the entire gamut of information.

• Same data present on different systems, in different department and users may be unaware of

this fact.

• Difficulty in getting meaningful information in a timely manner.

• Multiple systems giving different answer to the business questions.

• Less analysis by decision makers and policy planners due to non-availability of sophisticated

tools and easily decipherable, timely and comprehensive information

Purpose of Data Warehousing:

Better business intelligence for end users.

• Reduction in time to access and analyze information.

• Consolidation of disparate information sources.

• Replacement of older, less-responsive decision support systems

Page 5: Data Warehousing and Data Mining

• Faster time to market for products and services

Data Warehouse Characteristics:

1. Subject-orientedàWH is organized around the major subjects of the enterprise rather

than the major application areas. This is reflected in the need to store decision-support

data rather than application-oriented data.

2. Integratedàbecause the source data come together from different enterprise-wide

applications systems. The source data is often inconsistent using..The integrated data

source must be made consistent to present a unified view of the data to the users.

3. Time-variantàthe source data in the WH is only accurate and valid at some point in

time or over some time interval. The time-variance of the data warehouse is also

shown in the extended time that the data is held, the implicit or explicit association of

time with all data, and the fact that the data represents a series of snapshots.

4. Non-volatileàdata is not update in real time but is refresh from OS on a regular basis.

New data is always added as a supplement to DB, rather than replacement. the DB

continually absorbs this new data, incrementally integrating it with previous data

DATA WAREHOUSE LIFE CYCLE :

Data warehousing is a concept. It is not a product that can be purchased off the shelf. It

is a set of hardware and software components integrated together which can be used to analyze

the massive amount of data stored in an efficient manner. It is a process through which one

can build a successful data warehouse. Following are the five steps towards building a

successful data warehouse.

1) JUSTIFICATION

2) REQUIREMENT ANALYSIS

3) DESIGN

4) DEVELOPMENT & IMPLEMENTATION

5) DEPLOYMENT

Page 6: Data Warehousing and Data Mining

DATA WAREHOUSE ARCHITECTURE :

Main Components:

• Operational data sourcesàfor the DW is supplied from mainframe operational data held

in first generation hierarchical and network databases, departmental data held in

proprietary file systems, private data held on workstaions and private serves and external

systems such as the Internet, commercially available DB, or DB assoicated with and

organization’s suppliers or customers

• Operational datastore(ODS)àis a repository of current and integrated operational data

used for analysis. It is often structured and supplied with data in the same way as the data

warehouse, but may in fact simply act as a staging area for data to be moved into the

warehouse

• Load manageràalso called the frontend component, it performance all the operations

associated with the extraction and loading of data into the warehouse. These operations

include simple transformations of the data to prepare the data for entry into the warehouse

Operational

data source1

Query Manage

Warehouse Manager

DBMS

Operational

data source 2

Meta-data

High

summarized data

Detailed data

Lightly

summarized

data

Operational

data store (ods)

Operational

data source n

Archive/backup

data

Load Manager

Data mining

OLAP(online analytical processing) tools

Reporting, query,application development, and EIS(executive information system) tools

End-useraccess tools

Operational data store (ODS)

Typical architecture of a data warehouse

Page 7: Data Warehousing and Data Mining

• Warehouse manageràperforms all the operations associated with the management of the

data in the warehouse. The operations performed by this component include analysis of

data to ensure consistency, transformation and merging of source data, creation of indexes

and views, generation of denormalizations and aggregations, and archiving and backing-up

data

• Query manageràalso called backend component, it performs all the operations

associated with the management of user queries. The operations performed by this

component include directing queries to the appropriate tables and scheduling the execution

of queries

• End-user access toolsàcan be categorized into five main groups: data reporting and

query tools, application development tools, executive information system (EIS) tools,

online analytical processing (OLAP) tools, and data mining tools.

Tools and Technologies:

• The critical steps in the construction of a data warehouse:

a. Extraction b. Cleansing c. Transformation

• After the critical steps, loading the results into target system can be carried out either

by separate products, or by a single, categories:

• Code generators

• Database data replication tools

• Dynamic transformation engine

Applications:

• Online Transaction Processing:

OLTP systems are the major kinds of enterprise applications:

Examples:

Order entry systems, Inventory control systems, Reservation systems, Point-of-sale

systems, Tracking systems, etc.

• Executive information system (EIS) :

Present information at the highest level of summarization using corporate business

measures. They are designed for extreme ease-of-use and, in many cases, only a mouse

is required. Graphics are usually generously incorporated to provide at-a-glance

indications of performance

Page 8: Data Warehousing and Data Mining

• Decision Support Systems (DSS) :

They ideally present information in graphical and tabular form, providing the user

with the ability to drill down on selected information. Note the increased detail and

data manipulation options presented.

DATA MINING

What is data mining?

Data Mining refers to the process of analyzing the data from different perspectives and

summarizing it into useful information. Data mining software is one of the numbers of tools

used for analyzing data from many different dimensions or angles, categorize it, and

summarize the relationship identified.

Definition:

Data mining is the process of finding correlation or patterns among fields in large

relational databases. “The process of extracting valid, previously unknown, comprehensible,

and actionable information from large databases and using it to make crucial business

decision”

Different Types of Data Mining: Business, Scientific and Internet Data Mining

Five major elements of Data Mining:

1. Extract, transform, & load transaction data on to the data warehouse system.

2. Store and manage data in multidimensional database system.

3. Provide access to business analysts and IT Professionals.

4. Analyze the data by application software.

5. Present the data in useful format such as graph or table.

Page 9: Data Warehousing and Data Mining

DATA MINING: A KDD Process:

Steps of KDD Process:

1. Learning the application domain

2. Relevant prior knowledge and goals of application

3. Creating a target data set: data selection

4. Data cleaning and preprocessing

5. Data reduction and transformation

6. Find useful features, dimensionality or variable reduction, and invariant representation.

7. Choosing functions of data mining

8. Summarization, classification, regression, association, clustering.

9. Choosing the mining algorithm(s)

10. Data mining: search for patterns of interest

11. Pattern evaluation and knowledge presentation

12. Visualization, transformation, removing redundant patterns, etc.

13. Use of discovered knowledge.

Methods of Data Mining:

1. Classification 2.Regression 3.Clustering 4.Associative rules 5.Visualization

Page 10: Data Warehousing and Data Mining

PROJECT ON DATA MINING : “Website Data Mining”

We have created an application which works as a data mining for a website developer.

The project has been implemented successfully on a local server and has given an excellent

feedback.

• Aim of the project:

To give a simple graph to a user on the whole information of websites

• Implementation:

The data warehouse that is being used for the project is information gathered by a

survey. The data has been collected to a database. This database is used in the project.

The database contains the information on many websites. This is a huge database. The

database is formed going to the questionnaires that were subtitled by the users of that

websites.

The application we created is a web based one. The application creates particular graph

like, pie chart, line chart or bar graph. These graphs are generated as per the parameters

selected by the website builders. The parameters that are selected would look as the figure

below:

Page 11: Data Warehousing and Data Mining

These constraints entered by the user are considered to generate charts. The abstraction

of the data from the database is done in effective manner. The user will just know, for

example, a website builder wants to know where the social networking sites are used

maximum as per the database will look as below:

• Working:

Java Servlet Pages (JSP) is used for the program the application. The database is stored

in the Microsoft Access DB. For implementation purpose a local server of Tomcat 6.0 Server

is used. For generating the charts in JSP, we made use of the JFreeChart package.

The page navigation is considered for taking the inputs. The traversing is as follows:

Index.jsp à ganechhart.jsp

In index.jsp, the parameters are taken from the user. These parameters are posted to the

genechart.jsp file in the server. The SQL queries are fixed to generate the appropriate records.

These records are used to build the charts. Example of the code for SQL in JSP is as follows:

String url="jdbc:odbc:Driver={Microsoft Access Driver(*.mdb)}; DBQ=/FinalDB.mdb;DriverID=22;READONLY=true";Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");con=DriverManager.getConnection(url,"","");st=con.createStatement();rs = st.executeQuery( sSql );while( rs.next() )

Page 12: Data Warehousing and Data Mining

{out.println( "</tr><tr>" );for( int i=1; i<=n; i++ ) // Achtung: erste Spalte mit 1 statt 0out.println( "<td nowrap>" + rs.getString( i ) + "</td>" );}

These records after getting formed, an algorithm is used to get the statistics of the data. This algorithm will give the whole implementation of websites that can be used to generate the chart. The charts are generated with the following code:

while( rs3.next() ){data.setValue(rs3.getString( 1 ), cvi[i++]);}final ChartRenderingInfo info = new ChartRenderingInfo(new StandardEntityCollection());final File file1 = new File("../piechart3.png");ChartUtilities.saveChartAsPNG(file1, chart, 600, 400, info);

The chart when generated will be saved as ‘.png’ image file. This is then displayed as an output to the user.

• Advantages: The website builder can retrieve the appropriate factors that he wants to know before

creating a site. A big survey results can be generated within records and a simple understandable chart

is prepared that can be used by the surveyors.

CONCLUSION

Data Warehousing provides the means to change the raw data into information for

making effective business decisions-the emphasis on information, not data. The Data

warehouse is the hub for decision support data.

Data mining is a useful tool with multiple algorithms that can be tuned for specific

tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to

speed up data mining process.

REFERENCE

Multidimensional Data analysis and Data Mining

- Arinjay Choudhary, Dr. P.S. Deshande

Data Mining and Data Warehousing and OLAP

-A. Berson, S.J. Smith

Page 13: Data Warehousing and Data Mining

www.datawarehousingonline.com AND www.Wikipedia.com