dwm mini project

19
Manjra Charitable Trust's RAJIV GANDHI INSTITUTE OF TECHNOLOGY JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053 Introduction: TANAGRA is free DATA MINING software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area. This project is the successor of SIPINA which implements various supervised learning algorithms, especially an interactive and visual construction of decision trees. TANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms... TANAGRA is an "open source project" as every researcher can access to the source code, and add his own algorithms, as far as he agrees and conforms to the software distribution license. The main purpose of Tanagra project is to give researchers and students an easy-to-use data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing to analyze either real or synthetic data. The second purpose of TANAGRA is to propose to researchers an architecture allowing them to easily add their own data mining methods, to compare their performances. TANAGRA acts more as an experimental platform in order to let them go to the essential of their work, dispensing them to deal with the unpleasant part in the programmation of this kind of tools: the data management. The third and last purpose, in direction of novice developers, consists in diffusing a possible methodology for DEPARTMENT: INFORMATION TECHNOLOGY Page 1

Upload: harish-pawar

Post on 07-Apr-2015

301 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Introduction:

TANAGRA is free DATA MINING software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area.

This project is the successor of SIPINA which implements various supervised learning algorithms, especially an interactive and visual construction of decision trees. TANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms...

TANAGRA is an "open source project" as every researcher can access to the source code, and add his own algorithms, as far as he agrees and conforms to the software distribution license.

The main purpose of Tanagra project is to give researchers and students an easy-to-use data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing to analyze either real or synthetic data.

The second purpose of TANAGRA is to propose to researchers an architecture allowing them to easily add their own data mining methods, to compare their performances. TANAGRA acts more as an experimental platform in order to let them go to the essential of their work, dispensing them to deal with the unpleasant part in the programmation of this kind of tools: the data management.

The third and last purpose, in direction of novice developers, consists in diffusing a possible methodology for building this kind of software. They should take advantage of free access to source code, to look how this sort of software is built, the problems to avoid, the main steps of the project, and which tools and code libraries to use for. In this way, Tanagra can be considered as a pedagogical tool for learning programming techniques.

TANAGRA does not include, presently, what makes all the strength of the commercial softwares in this domain: a wide set of data sources, direct access to data warehouses and databases, data cleansing, interactive utilization,...

DEPARTMENT: INFORMATION TECHNOLOGY Page 1

Page 2: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Import dataset into Tanagra:

1. Choose “File/New…” in the main menu of TANAGRA

2. Enter a title for the diagram: « TANAGRA : Importing Data »3. Enter the name of the associated file in which you will save your work

(« TANAGRA_ImportingData.bdm »).4. Before click on Save button, to run through the hard disk and place yourself in the

directory « …\TANAGRA\Tutorials ».5. Click on the open button icon to seek the file you have created “weather.txt”.

DEPARTMENT: INFORMATION TECHNOLOGY Page 2

Page 3: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

6. Validate with OK to start data importation.

A new diagram is created, based on the file « weather.txt ». You can see the description of its contents in the right frame.

DEPARTMENT: INFORMATION TECHNOLOGY Page 3

Page 4: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

This project is undertaken in the subject of Data warehouse and Mining and Business Intelligence. It is a tool based project. We are using the Tanagra tool and the database used is the weather report. In this project we are going to show all the attributes affecting the weather and it includes attributes such as temperature, humidity, windy etc. This gives us a brief idea of the weather of the area. Using Tanagra tool we can derive different conclusions about the given database. By using visualization, regression techniques, association and K means method helps us derive different observations and conclusions about the database. To view the data in graphical form we use Scatter plot. Tanagra tool helps us to get an overview of this database.

Database details:

The database that is used in this mini project includes the results of weather and their information. This data consists of various fields. The database is available as an Excel document. The Excel document consists of records of 15 weather.

Tanagra loads data from text files with tab separator, built in the following way:- 1st line: names of attributes- Next lines: values of the attributes for the sample (one line for each record).

This text file (Dataset) includes the following attributes:1. Outlook2. Temp3. Humidity4. Windy5. Class

The dataset contain two continuous attribute and three are discrete attribute .

The discrete values of attribute are as follows:-Outlook = “sunny”,”overcast”,”rain”.Windy = “yes”,”no”.Class =”play”,”dontplay“.

This project contains analysis of the above database in terms of1. Scatter plot with label2. clustering3. association4. Regression tree

DEPARTMENT: INFORMATION TECHNOLOGY Page 4

Page 5: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Scatter plot with label:-

Problem statement:The dataset provided enormous information about the weather. This data set is plotted to form a scatter plot with label. The features taken account to plot the scatter graph are

1. Temp2. Humidity

The scatter plot with label is a tool to provide a graphical view which must include all this information.

Steps for creating the scatter plot with label: 1. The dataset (weather.txt) to be classified is loaded into the Tanagra statistics data

editor. 2. Open data visualization tab from the component bar.3. And select the scatter plot with label option from the visualization tab.4. Drag this option onto dataset and open it.5. The output appears in right frame.6. It contains the scatter plot for the chosen features.

By using data visualization we have derived the scatter plot of the attributes humidity and temperature.

DEPARTMENT: INFORMATION TECHNOLOGY Page 5

Page 6: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Clustering:-

Problem statement:The data set provides vast information based on different characteristics and features.

Clustering is the task of segmenting a diverse group into number of more similar subgroups or clusters. Here the clustering is done on the attribute Temp and Humidity.

Step for creating the clustering (k mean): 1. Add a Define Status operator under the “Dataset” node, by clicking on its icon in the

shortcuts toolbar. A dialog box appears automatically, allowing the definition ofthe status of the attributes.

2. Before all, be sure that the active tab in the dialog is the “Input” one. Then select the continuous attributes in the left list by clicking the corresponding button below the list (as shown in the following screenshot), and hit the arrow button to bring them in the Input list.

DEPARTMENT: INFORMATION TECHNOLOGY Page 6

Page 7: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

3. Select two continuous attributes for input value.4. Now you have defined the descriptors to do this. Click OK to validate and close this

dialog box. 5. Drag the k-mean option onto Define Status 1 for which we define the descriptor.6. And select view option by right clicking on k-mean 1 option.7. The output appears in right frame.8. It contains the clustering for the chosen features.

By performing k-mean clustering operation we have grouped the data into more manageable, distinct and fixed number of cluster.

DEPARTMENT: INFORMATION TECHNOLOGY Page 7

Page 8: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Association:-

Problem statement:The data set provides vast information based on different characteristics and features.

It is used to find relationship in database. The relationship has been shown between three attributes outlook, windy and class.

Step for creating the association:1. Add a Define Status operator under the “Dataset” node, by clicking on its icon in the

shortcuts toolbar. A dialog box appears automatically, allowing the definition ofthe status of the attributes.

2. Before all, be sure that the active tab in the dialog is the “Input” one. Then select the continuous attributes in the left list by clicking the corresponding button below the list (as shown in the following screenshot), and hit the arrow button to bring them in the Input list.

3. Select one continuous attribute for input value as temp.4. And select one continuous attribute for target value as humidity.5. In the same dialog box, activate the Target tab. Select the « class » attribute in the list

and click the arrow button.6. Now you have defined the class attribute (« class » = Target), and the descriptors to

do this (the others = Input). 7. Click OK to validate and close this dialog box

DEPARTMENT: INFORMATION TECHNOLOGY Page 8

Page 9: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

8. Drag Apriori on define status and see the output appears in right frame.9. It contains the Association for the chosen features

DEPARTMENT: INFORMATION TECHNOLOGY Page 9

Page 10: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

By performing association we have manage to show distinct link between two attributes.

DEPARTMENT: INFORMATION TECHNOLOGY Page 10

Page 11: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Regression Tree:-

Problem statement: The data set contains all the information according to the various attributes. We

attempt to use regression tree to find the relationship between variables temp and humidity.

Step for creating the Regression:1. Add a Define Status operator under the “Dataset” node, by clicking on its icon in the

shortcuts toolbar. A dialog box appears automatically, allowing the definition ofthe status of the attributes.

2. Before all, be sure that the active tab in the dialog is the “Input” one. Then select the continuous attributes in the left list by clicking the corresponding button below the list (as shown in the following screenshot), and hit the arrow button to bring them in the Input list.

3. Select one continuous attribute for input value as temp.4. And select one continuous attribute for target value as humidity.

DEPARTMENT: INFORMATION TECHNOLOGY Page 11

Page 12: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

5. In the same dialog box, activate the Target tab. Select the « class » attribute in the list and click the arrow button.

6. Now you have defined the class attribute (« class » = Target), and the descriptors to do this (the others = Input).

7. Click OK to validate and close this dialog box

8. Drag Regression tree on define status and see the output appears in right frame.9. It contains the Association for the chosen features

DEPARTMENT: INFORMATION TECHNOLOGY Page 12

Page 13: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

By constucting the regression tree we have been able to show the relationship.

DEPARTMENT: INFORMATION TECHNOLOGY Page 13

Page 14: Dwm Mini Project

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY

JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

This was a mini project in DWMI using tool Tanagra.

We have successfully completed the analysis of the above data set. The data set contained the information about weather details. Using Tanagra, we could carry out analysis of the data using the tools provided.

The Scatter Plot with Label was then carried out on the above dataset. The features taken account to plot the scatter graph are temperature and humidity of the weather. The scatter plot is a tool to provide a graphical view which includes all this information.

The Clustering Analysis is used to produce segmenting a diverse group into number of more similar subgroups and clustering is done on the attribute Temp and Humidity.

Association Analysis is used to find relationship in database. The relationship has been shown between three attributes outlook, windy and class.

Regression tree was carried on the data to find the relationship between variables like temperature and humidity.

Hence, we have successfully completed the mini project.

DEPARTMENT: INFORMATION TECHNOLOGY Page 14