data mining self study report

8/17/2019 Data mining self study report

1/21

Self-Study seminar

Topic :- Data mining

Submitted by :-

Akash Shinde2k14/S/!"


2/21

ACKNOWLEDGEMENT

I would like to express our greatest gratitude to the people who have helped &support us throughout our Self-Study Seminar Report.

I am grateful to our guide, Ms. Ruchika Malhotra for her continuous supportfor the project, from initial advice to contacts in the early stages of conceptualinception & through ongoing advice & encouragement to this day.

I wish to thank them for their undivided support and interest, which inspiredand encouraged us to go the right way.

t last, !ut not the least, I want to thank my friends who appreciated me for my

work and motivated us" and #nally to $od who made all the things possi!le.

Akash Shinde


3/21

Contents

• Abstract

• #ntroduction

• Data $ining Algorithms

• Applications

• Ad%antages and Disad%antages

• &onclusion


4/21

Abstract

$ining information and kno'ledge from large databases has been recogni(ed by many researchers

as a key research topic in database systems and machine learning) and by many industrialcompanies as an important area 'ith an opportunity of ma*or re%enues+ ,esearchers in manydifferent fields ha%e sho'n great interest in data mining+ Se%eral emerging applications in

information-pro%iding ser%ices) such as data 'arehousing and online ser%ices o%er the #nternet) also

call for %arious data mining techniues to better understand user beha%ior) to impro%e the ser%icepro%ided and to increase business opportunities+ #n response to such a demand) this article pro%ides

a sur%ey) from a database researcher.s point of %ie') on the data mining techniues de%elopedrecently+ A classification of the a%ailable data mining techniues is pro%ided and a comparati%e

study of such techniues is presented+


5/21

Introduction

%ata mining is an interdisciplinary su!#eld of computer science. It is the

computational process of discovering patterns in large data sets involvingmethods at the intersection of arti#cial intelligence, machine learning,

statistics, and data!ase systems. he overall goal of the data mining process is

to extract information from a data set and transform it into an understanda!le

structure for further use. side from the raw analysis step, it involves data!ase

and data management aspects, data pre-processing, model and inference

considerations, interestingness metrics, complexity considerations, post-

processing of discovered structures, visuali'ation, and online updating. %ata

mining is the analysis step of the (knowledge discovery in data!ases( process,

or )%%.

he term is a misnomer, !ecause the goal is the extraction of patterns and

knowledge from large amounts of data, not the extraction *mining+ of data

itself. It also is a !u''word and is freuently applied to any form of large-scale

data or information processing *collection, extraction, warehousing, analysis,

and statistics+ as well as any application of computer decision support system,

including arti#cial intelligence, machine learning, and !usiness intelligence.

he !ook %ata mining ractical machine learning tools and techniues with

/ava *which covers mostly machine learning material+ was originally to !e

named just ractical machine learning, and the term data mining was onlyadded for marketing reasons. 0ften the more general terms *large scale+ data

analysis and analytics 1 or, when referring to actual methods, arti#cial

intelligence and machine learning 1 are more appropriate.

he actual data mining task is the automatic or semi-automatic analysis of

large uantities of data to extract previously unknown, interesting patterns

such as groups of data records *cluster analysis+, unusual records *anomaly

detection+, and dependencies *association rule mining+. his usually involves

using data!ase techniues such as spatial indices. hese patterns can then !e

seen as a kind of summary of the input data, and may !e used in furtheranalysis or, for example, in machine learning and predictive analytics. 2or

example, the data mining step might identify multiple groups in the data, which

can then !e used to o!tain more accurate prediction results !y a decision

support system. 3either the data collection, data preparation, nor result

interpretation and reporting is part of the data mining step, !ut do !elong to

the overall )%% process as additional steps.he related terms data dredging,

data #shing, and data snooping refer to the use of data mining methods to

sample parts of a larger population data set that are *or may !e+ too small for

relia!le statistical inferences to !e made a!out the validity of any patternsdiscovered. hese methods can, however, !e used in creating new hypotheses

to test against the larger data populations.


6/21

Data Mining Algorithms

Clustering Algorithm

The $icrosoft &lustering algorithm is a segmentation algorithm pro%ided by Analysis Ser%ices+ The

algorithm uses iterati%e techniues to group cases in a dataset into clusters that contain similarcharacteristics+ These groupings are useful for eploring data) identifying anomalies in the data) and

creating predictions+&lustering models identify relationships in a dataset that you might not logically deri%e through

casual obser%ation+ 0or eample) you can logically discern that people 'ho commute to their *obs

by bicycle do not typically li%e a long distance from 'here they 'ork+ The algorithm) ho'e%er) canfind other characteristics about bicycle commuters that are not as ob%ious+ #n the follo'ing diagram)

cluster A represents data about people 'ho tend to dri%e to 'ork) 'hile cluster represents dataabout people 'ho tend to ride bicycles to 'ork+

The clustering algorithm differs from other data mining algorithms) such as the $icrosoft Decision

Trees algorithm) in that you do not ha%e to designate a predictable column to be able to build a

clustering model+ The clustering algorithm trains the model strictly from the relationships that eistin the data and from the clusters that the algorithm identifies+

Example&onsider a group of people 'ho share similar demographic information and 'ho buy similar

products from the Ad%enture orks company+ This group of people represents a cluster of data+Se%eral such clusters may eist in a database+ y obser%ing the columns that make up a cluster) you

can more clearly see ho' records in a dataset are related to one another+

How the Algorithm Wors

The $icrosoft &lustering algorithm first identifies relationships in a dataset and generates a seriesof clusters based on those relationships+ A scatter plot is a useful 'ay to %isually represent ho' the

algorithm groups data) as sho'n in the follo'ing diagram+ The scatter plot represents all the casesin the dataset) and each case is a point on the graph+ The clusters group points on the graph and

illustrate the relationships that the algorithm identifies+

After first defining the clusters) the algorithm calculates ho' 'ell the clusters represent groupingsof the points) and then tries to redefine the groupings to create clusters that better represent the data+

http://void%280%29/http://void%280%29/


7/21

The algorithm iterates through this process until it cannot impro%e the results more by redefining

the clusters+3ou can customi(e the 'ay the algorithm 'orks by selecting a specifying a clustering techniue)

limiting the maimum number of clusters) or changing the amount of support reuired to create a

cluster+

Data !e"#ire$ %or Cl#stering Mo$els

hen you prepare data for use in training a clustering model) you should understand thereuirements for the particular algorithm) including ho' much data is needed) and ho' the data is

used+

The reuirements for a clustering model are as follo's:

• A single e& col#mn ach model must contain one numeric or tet column that uniuely

identifies each record+ &ompound keys are not allo'ed+

• 'np#t col#mns ach model must contain at least one input column that contains the %alues

that are used to build the clusters+ 3ou can ha%e as many input columns as you 'ant) but

depending on the number of %alues in each column) the addition of etra columns can

increase the time it takes to train the model+

• Optional pre$ictable col#mn The algorithm does not need a predictable column to build

the model) but you can add a predictable column of almost any data type+ The %alues of thepredictable column can be treated as input to the clustering model) or you can specify that it

be used for prediction only+ 0or eample) if you 'ant to predict customer income by

clustering on demographics such as region or age) you 'ould specify incomeas (re$ictOnl& and add all the other columns) such as region or age) as inputs+

Association Algorithm

Association algorithm is an association algorithm pro%ided by Analysis Ser%ices that is useful forrecommendation engines+ A recommendation engine recommends products to customers based on

items they ha%e already bought) or in 'hich they ha%e indicated an interest+ The $icrosoftAssociation algorithm is also useful for market basket analysis+

Association models are built on datasets that contain identifiers both for indi%idual cases and for the

items that the cases contain+ A group of items in a case is called an itemset + An association modelconsists of a series of itemsets and the rules that describe ho' those items are grouped together

'ithin the cases+ The rules that the algorithm identifies can be used to predict a customer.s likelyfuture purchases) based on the items that already eist in the customer.s shopping cart+ The

follo'ing diagram sho's a series of rules in an itemset+

As the diagram illustrates) the $icrosoft Association algorithm can potentially find many rules

'ithin a dataset+ The algorithm uses t'o parameters) support and probability) to describe theitemsets and rules that it generates+ 0or eample) if and 3 represent t'o items that could be in a

shopping cart) the support parameter is the number of cases in the dataset that contain the

combination of items) and 3+ y using the support parameter in combination 'ith the user-defined parameters) MINIMUM_SUPPORT and MAXIMUM_SUPPORT, the algorithm controls the

number of itemsets that are generated+ The probability parameter) also named confidence) representsthe fraction of cases in the dataset that contain and that also contain 3+ y using the probability

parameter in combination 'ith the MINIMUM_PROBABILITY parameter) the algorithm controls

the number of rules that are generated+

Example

http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/


8/21

The Ad%enture orks &ycle company is redesigning the functionality of its eb site+ The goal of

the redesign is to increase sell-through of products+ ecause the company records each sale in atransactional database) they can use the $icrosoft Association algorithm to identify sets of products

that tend to be purchased together+ They can then predict additional items that a customer might be

interested in) based on items that are already in the customer.s shopping basket+

5o' the Algorithm orks

The $icrosoft Association algorithm tra%erses a dataset to find items that appear together in a case+The algorithm then groups into itemsets any associated items that appear) at a minimum) in the

number of cases that are specified by the MINIMUM_SUPPORT parameter+ 0or eample) anitemset could be 6$ountain 2!!7isting) Sport 1!!7isting6) and could ha%e a support of 81!+

The algorithm then generates rules from the itemsets+ These rules are used to predict the presence ofan item in the database) based on the presence of other specific items that the algorithm identifies as

important+ 0or eample) a rule could be 6if Touring 1!!!7eisting and ,oad bottle cage7eisting)

then ater bottle7eisting6) and could ha%e a probability of !+912+ #n this eample) the algorithmidentifies that the presence in the basket of the Touring 1!!! tire and the 'ater bottle cage predicts

that a 'ater bottle 'ould also likely be in the basket+

Data !e"#ire$ %or Association Mo$els

hen you prepare data for use in an association rules model) you should understand the

reuirements for the particular algorithm) including ho' much data is needed) and ho' the data is

used+The reuirements for an association rules model are as follo's:


identifies each record+ compound keys not permitted+

• A single pre$ictable col#mn An association model can ha%e only one predictable column+

Typically it is the key column of the nested table) such as the filed that lists the products that

'ere purchased+ The %alues must be discrete or discreti(ed+• 'np#t col#mns + The input columns must be discrete+ The input data for an association

model often is contained in t'o tables+ 0or eample) one table might contain customerinformation 'hile another table contains customer purchases+ 3ou can input this data into

the model by using a nested table+

Linear Regression Algorithm

inear ,egression algorithm is a %ariation of the $icrosoft Decision Trees algorithm that helps youcalculate a linear relationship bet'een a dependent and independent %ariable) and then use that

relationship for prediction+

The relationship takes the form of an euation for a line that best represents a series of data+ 0oreample) the line in the follo'ing diagram is the best possible linear representation of the data+



9/21

ach data point in the diagram has an error associated 'ith its distance from the regression line+ The

coefficients a and b in the regression euation ad*ust the angle and location of the regression line+

3ou can obtain the regression euation by ad*usting a and b until the sum of the errors that areassociated 'ith all the points reaches its minimum+

There are other kinds of regression that use multiple %ariables) and also nonlinear methods ofregression+ 5o'e%er) linear regression is a useful and 'ell-kno'n method for modeling a response

to a change in some underlying factor+

Example

3ou can use linear regression to determine a relationship bet'een t'o continuous columns+ 0oreample) you can use linear regression to compute a trend line from manufacturing or sales data+

3ou could also use the linear regression as a precursor to de%elopment of more comple datamining models) to assess the relationships among data columns+

Although there are many 'ays to compute linear regression that do not reuire data mining tools)the ad%antage of using the $icrosoft inear ,egression algorithm for this task is that all thepossible relationships among the %ariables are automatically computed and tested+ 3ou do not ha%e

to select a computation method) such as sol%ing for least suares+ 5o'e%er) linear regression mighto%ersimplify the relationships in scenarios 'here multiple factors affect the outcome+


The $icrosoft inear ,egression algorithm is a %ariation of the $icrosoft Decision Trees

algorithm+ hen you select the $icrosoft inear ,egression algorithm) a special case of the$icrosoft Decision Trees algorithm is in%oked) 'ith parameters that constrain the beha%ior of the

algorithm and reuire certain input data types+ $oreo%er) in a linear regression model) the 'holedata set is used for computing relationships in the initial pass) 'hereas a standard decision treesmodel splits the data repeatedly into smaller subsets or trees+

Data !e"#ire$ %or Linear !egression Mo$elshen you prepare data for use in a linear regression model) you should understand the reuirements

for the particular algorithm+ This includes ho' much data is needed) and ho' the data is used+ The

reuirements for this model type are as follo's:• A single e& col#mn ach model must contain one numeric or tet column that uniuely

identifies each record+ &ompound keys are not permitted+

http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/


10/21


11/21

probability that they 'ill not buy a bike is !+298+ #n this eample) the algorithm uses the numeric

information) deri%ed from customer characteristics @such as commute distance) to predict 'hether acustomer 'ill buy a bike+

Data !e"#ire$ %or Nai)e *a&es Mo$els

hen you prepare data for use in training a ;ai%e ayes model) you should understand thereuirements for the algorithm) including ho' much data is needed) and ho' the data is used+The reuirements for a ;ai%e ayes model are as follo's:


identifies each record+ &ompound keys are not allo'ed+

• 'np#t col#mns #n a ;ai%e ayes model) all columns must be either discrete or discreti(ed

columns+

0or a ;ai%e ayes model) it is also important to ensure that the input attributes areindependent of each other+ This is particularly important 'hen you use the model for

prediction+The reason is that) if you use t'o columns of data that are already closely related) the effect

'ould be to multiply the influence of those columns) 'hich can obscure other factors that

influence the outcome+&on%ersely) the ability of the algorithm to identify correlations among %ariables is useful

'hen you are eploring a model or dataset) to identify relationships among inputs+

• At least one pre$ictable col#mn The predictable attribute must contain discrete or

discreti(ed %alues+

The %alues of the predictable column can be treated as inputs+ This practice can be useful'hen you are eploring a ne' dataset) to find relationships among the columns+

Decision Trees Algorithm

Decision Trees algorithm is a classification and regression algorithm pro%ided by $icrosoft S=

Ser%er Analysis Ser%ices for use in predicti%e modeling of both discrete and continuous attributes+0or discrete attributes) the algorithm makes predictions based on the relationships bet'een input

columns in a dataset+ #t uses the %alues) kno'n as states) of those columns to predict the states of a

column that you designate as predictable+ Specifically) the algorithm identifies the input columnsthat are correlated 'ith the predictable column+ 0or eample) in a scenario to predict 'hich

customers are likely to purchase a bicycle) if nine out of ten younger customers buy a bicycle) butonly t'o out of ten older customers do so) the algorithm infers that age is a good predictor of

bicycle purchase+ The decision tree makes predictions based on this tendency to'ard a particularoutcome+0or continuous attributes) the algorithm uses linear regression to determine 'here a decision tree

splits+#f more than one column is set to predictable) or if the input data contains a nested table that is set to

predictable) the algorithm builds a separate decision tree for each predictable column+

Example

The marketing department of the Ad%enture orks &ycles company 'ants to identify thecharacteristics of pre%ious customers that might indicate 'hether those customers are likely to buy a

product in the future+ The Ad%entureorks2!12 database stores demographic information that

describes pre%ious customers+ y using the $icrosoft Decision Trees algorithm to analy(e thisinformation) the marketing department can build a model that predicts 'hether a particular



12/21

customer 'ill purchase products) based on the states of kno'n columns about that customer) such as

demographics or past buying patterns+


The $icrosoft Decision Trees algorithm builds a data mining model by creating a series of splits in

the tree+ These splits are represented as nodes+ The algorithm adds a node to the model e%ery timethat an input column is found to be significantly correlated 'ith the predictable column+ The 'aythat the algorithm determines a split is different depending on 'hether it is predicting a continuous

column or a discrete column+The $icrosoft Decision Trees algorithm uses feature seection to guide the selection of the most

useful attributes+ 0eature selection is used by all Analysis Ser%ices data mining algorithms to

impro%e performance and the uality of analysis+ 0eature selection is important to pre%entunimportant attributes from using processor time+ #f you use too many input or predictable attributes

'hen you design a data mining model) the model can take a %ery long time to process) or e%en runout of memory+ $ethods used to determine 'hether to split the tree include industry-standard

metrics for entro!" and ayesian net'orks#

A common problem in data mining models is that the model becomes too sensiti%e to smalldifferences in the training data) in 'hich case it said to be o$er%fitted or o$er%trained+ An o%erfitted

model cannot be generali(ed to other data sets+ To a%oid o%erfitting on any particular set of data) the$icrosoft Decision Trees algorithm uses techniues for controlling the gro'th of the tree+

Credicting Discrete &olumns

The 'ay that the $icrosoft Decision Trees algorithm builds a tree for a discrete predictable columncan be demonstrated by using a histogram+ The follo'ing diagram sho's a histogram that plots a

predictable column) ike uyers) against an input column) Age+ The histogram sho's that the ageof a person helps distinguish 'hether that person 'ill purchase a bicycle+

The correlation that is sho'n in the diagram 'ould cause the $icrosoft Decision Trees algorithm tocreate a ne' node in the model+



13/21

As the algorithm adds ne' nodes to a model) a tree structure is formed+ The top node of the treedescribes the breakdo'n of the predictable column for the o%erall population of customers+ As the

model continues to gro') the algorithm considers all columns+

Credicting &ontinuous &olumnshen the $icrosoft Decision Trees algorithm builds a tree based on a continuous predictable

column) each node contains a regression formula+ A split occurs at a point of non-linearity in theregression formula+ 0or eample) consider the follo'ing diagram+

The diagram contains data that can be modeled either by using a single line or by using t'oconnected lines+ 5o'e%er) a single line 'ould do a poor *ob of representing the data+ #nstead) if you

use t'o lines) the model 'ill do a much better *ob of approimating the data+ The point 'here thet'o lines come together is the point of non-linearity) and is the point 'here a node in a decision tree

model 'ould split+ 0or eample) the node that corresponds to the point of non-linearity in the

pre%ious graph could be represented by the follo'ing diagram+

Data !e"#ire$ %or Decision Tree Mo$elshen you prepare data for use in a decision trees model) you should understand the reuirementsfor the particular algorithm) including ho' much data is needed) and ho' the data is used+The reuirements for a decision trees model are as follo's:


identifies each record+ &ompound keys are not permitted+

• A pre$ictable col#mn ,euires at least one predictable column+ 3ou can include multiple

predictable attributes in a model) and the predictable attributes can be of different types)

either numeric or discrete+ 5o'e%er) increasing the number of predictable attributes canincrease processing time+

• 'np#t col#mns ,euires input columns) 'hich can be discrete or continuous+ #ncreasing

the number of input attributes affects processing time+



14/21

Data Mining Alications

%ata mining is a process that analy'es a large amount of data to #nd new and

hidden information that improves !usiness e4ciency. 5arious industries have

!een adopting data mining to their mission-critical !usiness processes to gain

competitive advantages and help !usiness grows. his tutorial illustrates some

data mining applications in sale6marketing, !anking6#nance, healthcare and

insurance, transportation and medicine.

Data Mining Alications in Sales!Marketing

%ata mining ena!les !usinesses to understand the hidden patterns inside

historical purchasing transaction data, thus helping in planning and launching

new marketing campaigns in prompt and cost e7ective way. he following

illustrates several data mining applications in sale and marketing.

%ata mining is used for market !asket analysis to provide information on what

product com!inations were purchased together when they were !ought and in

what seuence. his information helps !usinesses promote their most

pro#ta!le products and maximi'e the pro#t. In addition, it encouragescustomers to purchase related products that they may have !een missed or

overlooked.

Retail companies use data mining to identify customer8s !ehavior !uying

patterns.

Data Mining Alications in Banking ! "inance

Several data mining techniues e.g., distri!uted data mining have !een

researched, modeled and developed to help credit card fraud detection.

%ata mining is used to identify customers loyalty !y analy'ing the data of

customer8s purchasing activities such as the data of freuency of purchase in a

period of time, a total monetary value of all purchases and when was the last

purchase. fter analy'ing those dimensions, the relative measure is generated

for each customer. he higher of the score, the more relative loyal the

customer is.

o help the !ank to retain credit card customers, data mining is applied. 9y

analy'ing the past data, data mining can help !anks predict customers that


15/21

likely to change their credit card a4liation so they can plan and launch

di7erent special o7ers to retain those customers.

:redit card spending !y customer groups can !e identi#ed !y using data

mining.

he hidden correlation8s !etween di7erent #nancial indicators can !ediscovered !y using data mining.

2rom historical market data, data mining ena!les to identify stock trading rules.

Data Mining Alications in #ealth Care and Insurance

he growth of the insurance industry entirely depends on the a!ility to convert

data into the knowledge, information or intelligence a!out customers,

competitors, and its markets. %ata mining is applied in insurance industry

lately !ut !rought tremendous competitive advantages to the companies who

have implemented it successfully. he data mining applications in insurance

industry are listed !elow

%ata mining is applied in claims analysis such as identifying which medical

procedures are claimed together. %ata mining ena!les to forecasts which

customers will potentially purchase new policies. %ata mining allows insurance

companies to detect risky customers8 !ehavior patterns. %ata mining helps

detect fraudulent !ehavior.

Data Mining Alications in Transortation

%ata mining helps determine the distri!ution schedules among warehouses

and outlets and analy'e loading patterns.

Data Mining Alications in Medicine

%ata mining ena!les to characteri'e patient activities to see incoming o4ce

visits. %ata mining helps identify the patterns of successful medical therapies

for di7erent illnesses. %ata mining applications are continuously developing in

various industries to provide more hidden knowledge that increases !usiness

e4ciency and grows !usinesses.

+inancial Data Anal&sis

The financial data mining in financial industry is generally reliable and of high uality 'hich


16/21

facilitates systematic data analysis and data mining+ Some of the typical cases are as follo's

• Design and construction of data 'arehouses for multidimensional data analysis and data

mining+

• oan payment prediction and customer credit policy analysis+

• &lassification and clustering of customers for targeted marketing+

• Detection of money laundering and other financial crimes+

!etail 'n$#str&

Data $ining has its great application in ,etail #ndustry because it collects large amount of data

from on sales) customer purchasing history) goods transportation) consumption and ser%ices+ #t is

natural that the uantity of data collected 'ill continue to epand rapidly because of the increasing

ease) a%ailability and popularity of the 'eb+

Data mining in retail industry helps in identifying customer buying patterns and trends that lead to

impro%ed uality of customer ser%ice and good customer retention and satisfaction+ 5ere is the list

of eamples of data mining in the retail industry

• Design and &onstruction of data 'arehouses based on the benefits of data mining+

• $ultidimensional analysis of sales) customers) products) time and region+

• Analysis of effecti%eness of sales campaigns+

• &ustomer ,etention+

• Croduct recommendation and cross-referencing of items+

Telecomm#nication 'n$#str&

Today the telecommunication industry is one of the most emerging industries pro%iding %arious

ser%ices such as fa) pager) cellular phone) internet messenger) images) e-mail) 'eb data

transmission) etc+ Due to the de%elopment of ne' computer and communication technologies) the

telecommunication industry is rapidly epanding+ This is the reason 'hy data mining is become

%ery important to help and understand the business+

Data mining in telecommunication industry helps in identifying the telecommunication patterns)

catch fraudulent acti%ities) make better use of resource) and impro%e uality of ser%ice+ 5ere is the

list of eamples for 'hich data mining impro%es telecommunication ser%ices

• $ultidimensional Analysis of Telecommunication data+

• 0raudulent pattern analysis+


17/21

• #dentification of unusual patterns+

• $ultidimensional association and seuential patterns analysis+

• $obile Telecommunication ser%ices+

• Ese of %isuali(ation tools in telecommunication data analysis+

*iological Data Anal&sis

#n recent times) 'e ha%e seen a tremendous gro'th in the field of biology such as genomics)

proteomics) functional Fenomics and biomedical research+ iological data mining is a %ery

important part of ioinformatics+ 0ollo'ing are the aspects in 'hich data mining contributes for

biological data analysis

• Semantic integration of heterogeneous) distributed genomic and proteomic databases+

• Alignment) indeing) similarity search and comparati%e analysis multiple nucleotide

seuences+

• Disco%ery of structural patterns and analysis of genetic net'orks and protein path'ays+

• Association and path analysis+

• ?isuali(ation tools in genetic data analysis+

Other ,cienti%ic Applications

The applications discussed abo%e tend to handle relati%ely small and homogeneous data sets for

'hich the statistical techniues are appropriate+ 5uge amount of data ha%e been collected from

scientific domains such as geosciences) astronomy) etc+ A large amount of data sets is being

generated because of the fast numerical simulations in %arious fields such as climate andecosystem modeling) chemical engineering) fluid dynamics) etc+ 0ollo'ing are the applications of

data mining in the field of Scientific Applications

• Data arehouses and data preprocessing+

• Fraph-based mining+

• ?isuali(ation and domain specific kno'ledge+


18/21

A$)antages an$ Disa$)antages o% Data Mining

%ata ;ining is an important part of knowledge discovery process that we can analy'ean enormous set of data and get hidden and useful knowledge. %ata mining is appliede7ectively not only in the !usiness environment !ut also in other #elds such asweather forecast, medicine, transportation, healthcare, insurance, government


19/21

Security is a !ig issue. 9usinesses own information a!out their employees andcustomers including social security num!er, !irthday, payroll and etc. =owever howproperly this information is taken care is still in uestions. here have !een a lot ofcases that hackers accessed and stole !ig data of customers from the !ig corporationsuch as 2ord ;otor :redit :ompany, Sony< with so much personal and #nancialinformation availa!le, the credit card stolen and identity theft !ecome a !ig pro!lem.

;isuse of information6inaccurate informationInformation is collected through data mining intended for the ethical purposes can !emisused. his information may !e exploited !y unethical people or !usinesses to take!ene#ts of vulnera!le people or discriminate against a group of people.

In addition, data mining techniue is not perfectly accurate. herefore, if inaccurateinformation is used for decision-making, it will cause serious conseuence.


20/21

Concl#sion

Data mining is a tool that is used by go%ernments and corporations to predict and establish trends'ith specific purposes in mind @ Aleander) 2!!G+ The T#A and AD?#S use data mining as ananti-terrorist measure by looking for specific data to identify the terrorists before a terrorist attack

@Solo%e) 2!!4H irrer) 2!!I+ &orporations use data mining to eamine buying patterns and predict

future trends @&ullen) 2!!I+ Ama(on+com uses it to promote its sales by pre-selecting items) usingdata mining algorithms @Arnold) 2!!1+ ibraries use data mining to become more efficient in

de%eloping their collections and management of staff @&ummins) 2!!G+There are pri%acy risks in data mining @irrer) 2!!I+ Fo%ernments recogni(e the pri%acy risks

in%ol%ed 'ith data mining) and ha%e de%eloped legislation to pro%ide recourse to indi%iduals against

potential pri%acy %iolations @C#CDA) 2!!4H 6Security %ie's6) 2!!G+ egislation like &anada.sC#CDA establishes guidelines that corporations and go%ernment agencies must ahere to 'hen they

are collecting personal information @C#CDA) 2!!4+ ike'ise) &orporations 'ho are de%elopingdata mining soft'are are sensiti%e to pri%acy concerns) and are building into their soft'are

measures to limit ho' much personal information is collected @ait) 2!!I+

3our personal information is %aluable) and you gi%e a'ay more of it than you reali(e @Jumagai K&herry) 2!!4+ There are many uses for information) and someone 'ill al'ays be thinking of a

different 'ay to use it @Jumagai K &herry) 2!!4+ Data mining is too %aluable of a tool forgo%ernments and corporations to abandon @irrer) 2!!IH Jumagai K &herry) 2!!4H ipo'ic()

2!!G+ The key is being conscious of ho' you gi%e a'ay your personal information) and in

protecting your personal pri%acy @arkin) 2!!G+ Lpt out of sharing your data 'ith the companiesyou deal 'ith @arkin) 2!!G+ Ask your telephone pro%ider to remo%e call details from your account

@arkin) 2!!G+ ,ead pri%acy policies of 'ebsites and organi(ations you %isit @arkin) 2!!G+ ydoing these small things) you protect yourself and your personal information from potential misuse

@arkin) 2!!G+


21/21

data mining self study report

Documents