data mining self study report
TRANSCRIPT
-
8/17/2019 Data mining self study report
1/21
Self-Study seminar
Topic :- Data mining
Submitted by :-
Akash Shinde2k14/S/!"
-
8/17/2019 Data mining self study report
2/21
ACKNOWLEDGEMENT
I would like to express our greatest gratitude to the people who have helped &support us throughout our Self-Study Seminar Report.
I am grateful to our guide, Ms. Ruchika Malhotra for her continuous supportfor the project, from initial advice to contacts in the early stages of conceptualinception & through ongoing advice & encouragement to this day.
I wish to thank them for their undivided support and interest, which inspiredand encouraged us to go the right way.
t last, !ut not the least, I want to thank my friends who appreciated me for my
work and motivated us" and #nally to $od who made all the things possi!le.
Akash Shinde
-
8/17/2019 Data mining self study report
3/21
Contents
• Abstract
• #ntroduction
• Data $ining Algorithms
• Applications
• Ad%antages and Disad%antages
• &onclusion
-
8/17/2019 Data mining self study report
4/21
Abstract
$ining information and kno'ledge from large databases has been recogni(ed by many researchers
as a key research topic in database systems and machine learning) and by many industrialcompanies as an important area 'ith an opportunity of ma*or re%enues+ ,esearchers in manydifferent fields ha%e sho'n great interest in data mining+ Se%eral emerging applications in
information-pro%iding ser%ices) such as data 'arehousing and online ser%ices o%er the #nternet) also
call for %arious data mining techniues to better understand user beha%ior) to impro%e the ser%icepro%ided and to increase business opportunities+ #n response to such a demand) this article pro%ides
a sur%ey) from a database researcher.s point of %ie') on the data mining techniues de%elopedrecently+ A classification of the a%ailable data mining techniues is pro%ided and a comparati%e
study of such techniues is presented+
-
8/17/2019 Data mining self study report
5/21
Introduction
%ata mining is an interdisciplinary su!#eld of computer science. It is the
computational process of discovering patterns in large data sets involvingmethods at the intersection of arti#cial intelligence, machine learning,
statistics, and data!ase systems. he overall goal of the data mining process is
to extract information from a data set and transform it into an understanda!le
structure for further use. side from the raw analysis step, it involves data!ase
and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-
processing of discovered structures, visuali'ation, and online updating. %ata
mining is the analysis step of the (knowledge discovery in data!ases( process,
or )%%.
he term is a misnomer, !ecause the goal is the extraction of patterns and
knowledge from large amounts of data, not the extraction *mining+ of data
itself. It also is a !u''word and is freuently applied to any form of large-scale
data or information processing *collection, extraction, warehousing, analysis,
and statistics+ as well as any application of computer decision support system,
including arti#cial intelligence, machine learning, and !usiness intelligence.
he !ook %ata mining ractical machine learning tools and techniues with
/ava *which covers mostly machine learning material+ was originally to !e
named just ractical machine learning, and the term data mining was onlyadded for marketing reasons. 0ften the more general terms *large scale+ data
analysis and analytics 1 or, when referring to actual methods, arti#cial
intelligence and machine learning 1 are more appropriate.
he actual data mining task is the automatic or semi-automatic analysis of
large uantities of data to extract previously unknown, interesting patterns
such as groups of data records *cluster analysis+, unusual records *anomaly
detection+, and dependencies *association rule mining+. his usually involves
using data!ase techniues such as spatial indices. hese patterns can then !e
seen as a kind of summary of the input data, and may !e used in furtheranalysis or, for example, in machine learning and predictive analytics. 2or
example, the data mining step might identify multiple groups in the data, which
can then !e used to o!tain more accurate prediction results !y a decision
support system. 3either the data collection, data preparation, nor result
interpretation and reporting is part of the data mining step, !ut do !elong to
the overall )%% process as additional steps.he related terms data dredging,
data #shing, and data snooping refer to the use of data mining methods to
sample parts of a larger population data set that are *or may !e+ too small for
relia!le statistical inferences to !e made a!out the validity of any patternsdiscovered. hese methods can, however, !e used in creating new hypotheses
to test against the larger data populations.
-
8/17/2019 Data mining self study report
6/21
Data Mining Algorithms
Clustering Algorithm
The $icrosoft &lustering algorithm is a segmentation algorithm pro%ided by Analysis Ser%ices+ The
algorithm uses iterati%e techniues to group cases in a dataset into clusters that contain similarcharacteristics+ These groupings are useful for eploring data) identifying anomalies in the data) and
creating predictions+&lustering models identify relationships in a dataset that you might not logically deri%e through
casual obser%ation+ 0or eample) you can logically discern that people 'ho commute to their *obs
by bicycle do not typically li%e a long distance from 'here they 'ork+ The algorithm) ho'e%er) canfind other characteristics about bicycle commuters that are not as ob%ious+ #n the follo'ing diagram)
cluster A represents data about people 'ho tend to dri%e to 'ork) 'hile cluster represents dataabout people 'ho tend to ride bicycles to 'ork+
The clustering algorithm differs from other data mining algorithms) such as the $icrosoft Decision
Trees algorithm) in that you do not ha%e to designate a predictable column to be able to build a
clustering model+ The clustering algorithm trains the model strictly from the relationships that eistin the data and from the clusters that the algorithm identifies+
Example&onsider a group of people 'ho share similar demographic information and 'ho buy similar
products from the Ad%enture orks company+ This group of people represents a cluster of data+Se%eral such clusters may eist in a database+ y obser%ing the columns that make up a cluster) you
can more clearly see ho' records in a dataset are related to one another+
How the Algorithm Wors
The $icrosoft &lustering algorithm first identifies relationships in a dataset and generates a seriesof clusters based on those relationships+ A scatter plot is a useful 'ay to %isually represent ho' the
algorithm groups data) as sho'n in the follo'ing diagram+ The scatter plot represents all the casesin the dataset) and each case is a point on the graph+ The clusters group points on the graph and
illustrate the relationships that the algorithm identifies+
After first defining the clusters) the algorithm calculates ho' 'ell the clusters represent groupingsof the points) and then tries to redefine the groupings to create clusters that better represent the data+
http://void%280%29/http://void%280%29/
-
8/17/2019 Data mining self study report
7/21
The algorithm iterates through this process until it cannot impro%e the results more by redefining
the clusters+3ou can customi(e the 'ay the algorithm 'orks by selecting a specifying a clustering techniue)
limiting the maimum number of clusters) or changing the amount of support reuired to create a
cluster+
Data !e"#ire$ %or Cl#stering Mo$els
hen you prepare data for use in training a clustering model) you should understand thereuirements for the particular algorithm) including ho' much data is needed) and ho' the data is
used+
The reuirements for a clustering model are as follo's:
• A single e& col#mn ach model must contain one numeric or tet column that uniuely
identifies each record+ &ompound keys are not allo'ed+
• 'np#t col#mns ach model must contain at least one input column that contains the %alues
that are used to build the clusters+ 3ou can ha%e as many input columns as you 'ant) but
depending on the number of %alues in each column) the addition of etra columns can
increase the time it takes to train the model+
• Optional pre$ictable col#mn The algorithm does not need a predictable column to build
the model) but you can add a predictable column of almost any data type+ The %alues of thepredictable column can be treated as input to the clustering model) or you can specify that it
be used for prediction only+ 0or eample) if you 'ant to predict customer income by
clustering on demographics such as region or age) you 'ould specify incomeas (re$ictOnl& and add all the other columns) such as region or age) as inputs+
Association Algorithm
Association algorithm is an association algorithm pro%ided by Analysis Ser%ices that is useful forrecommendation engines+ A recommendation engine recommends products to customers based on
items they ha%e already bought) or in 'hich they ha%e indicated an interest+ The $icrosoftAssociation algorithm is also useful for market basket analysis+
Association models are built on datasets that contain identifiers both for indi%idual cases and for the
items that the cases contain+ A group of items in a case is called an itemset + An association modelconsists of a series of itemsets and the rules that describe ho' those items are grouped together
'ithin the cases+ The rules that the algorithm identifies can be used to predict a customer.s likelyfuture purchases) based on the items that already eist in the customer.s shopping cart+ The
follo'ing diagram sho's a series of rules in an itemset+
As the diagram illustrates) the $icrosoft Association algorithm can potentially find many rules
'ithin a dataset+ The algorithm uses t'o parameters) support and probability) to describe theitemsets and rules that it generates+ 0or eample) if and 3 represent t'o items that could be in a
shopping cart) the support parameter is the number of cases in the dataset that contain the
combination of items) and 3+ y using the support parameter in combination 'ith the user-defined parameters) MINIMUM_SUPPORT and MAXIMUM_SUPPORT, the algorithm controls the
number of itemsets that are generated+ The probability parameter) also named confidence) representsthe fraction of cases in the dataset that contain and that also contain 3+ y using the probability
parameter in combination 'ith the MINIMUM_PROBABILITY parameter) the algorithm controls
the number of rules that are generated+
Example
http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/
-
8/17/2019 Data mining self study report
8/21
The Ad%enture orks &ycle company is redesigning the functionality of its eb site+ The goal of
the redesign is to increase sell-through of products+ ecause the company records each sale in atransactional database) they can use the $icrosoft Association algorithm to identify sets of products
that tend to be purchased together+ They can then predict additional items that a customer might be
interested in) based on items that are already in the customer.s shopping basket+
5o' the Algorithm orks
The $icrosoft Association algorithm tra%erses a dataset to find items that appear together in a case+The algorithm then groups into itemsets any associated items that appear) at a minimum) in the
number of cases that are specified by the MINIMUM_SUPPORT parameter+ 0or eample) anitemset could be 6$ountain 2!!7isting) Sport 1!!7isting6) and could ha%e a support of 81!+
The algorithm then generates rules from the itemsets+ These rules are used to predict the presence ofan item in the database) based on the presence of other specific items that the algorithm identifies as
important+ 0or eample) a rule could be 6if Touring 1!!!7eisting and ,oad bottle cage7eisting)
then ater bottle7eisting6) and could ha%e a probability of !+912+ #n this eample) the algorithmidentifies that the presence in the basket of the Touring 1!!! tire and the 'ater bottle cage predicts
that a 'ater bottle 'ould also likely be in the basket+
Data !e"#ire$ %or Association Mo$els
hen you prepare data for use in an association rules model) you should understand the
reuirements for the particular algorithm) including ho' much data is needed) and ho' the data is
used+The reuirements for an association rules model are as follo's:
• A single e& col#mn ach model must contain one numeric or tet column that uniuely
identifies each record+ compound keys not permitted+
• A single pre$ictable col#mn An association model can ha%e only one predictable column+
Typically it is the key column of the nested table) such as the filed that lists the products that
'ere purchased+ The %alues must be discrete or discreti(ed+• 'np#t col#mns + The input columns must be discrete+ The input data for an association
model often is contained in t'o tables+ 0or eample) one table might contain customerinformation 'hile another table contains customer purchases+ 3ou can input this data into
the model by using a nested table+
Linear Regression Algorithm
inear ,egression algorithm is a %ariation of the $icrosoft Decision Trees algorithm that helps youcalculate a linear relationship bet'een a dependent and independent %ariable) and then use that
relationship for prediction+
The relationship takes the form of an euation for a line that best represents a series of data+ 0oreample) the line in the follo'ing diagram is the best possible linear representation of the data+
http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/
-
8/17/2019 Data mining self study report
9/21
ach data point in the diagram has an error associated 'ith its distance from the regression line+ The
coefficients a and b in the regression euation ad*ust the angle and location of the regression line+
3ou can obtain the regression euation by ad*usting a and b until the sum of the errors that areassociated 'ith all the points reaches its minimum+
There are other kinds of regression that use multiple %ariables) and also nonlinear methods ofregression+ 5o'e%er) linear regression is a useful and 'ell-kno'n method for modeling a response
to a change in some underlying factor+
Example
3ou can use linear regression to determine a relationship bet'een t'o continuous columns+ 0oreample) you can use linear regression to compute a trend line from manufacturing or sales data+
3ou could also use the linear regression as a precursor to de%elopment of more comple datamining models) to assess the relationships among data columns+
Although there are many 'ays to compute linear regression that do not reuire data mining tools)the ad%antage of using the $icrosoft inear ,egression algorithm for this task is that all thepossible relationships among the %ariables are automatically computed and tested+ 3ou do not ha%e
to select a computation method) such as sol%ing for least suares+ 5o'e%er) linear regression mighto%ersimplify the relationships in scenarios 'here multiple factors affect the outcome+
How the Algorithm Wors
The $icrosoft inear ,egression algorithm is a %ariation of the $icrosoft Decision Trees
algorithm+ hen you select the $icrosoft inear ,egression algorithm) a special case of the$icrosoft Decision Trees algorithm is in%oked) 'ith parameters that constrain the beha%ior of the
algorithm and reuire certain input data types+ $oreo%er) in a linear regression model) the 'holedata set is used for computing relationships in the initial pass) 'hereas a standard decision treesmodel splits the data repeatedly into smaller subsets or trees+
Data !e"#ire$ %or Linear !egression Mo$elshen you prepare data for use in a linear regression model) you should understand the reuirements
for the particular algorithm+ This includes ho' much data is needed) and ho' the data is used+ The
reuirements for this model type are as follo's:• A single e& col#mn ach model must contain one numeric or tet column that uniuely
identifies each record+ &ompound keys are not permitted+
http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/
-
8/17/2019 Data mining self study report
10/21
-
8/17/2019 Data mining self study report
11/21
probability that they 'ill not buy a bike is !+298+ #n this eample) the algorithm uses the numeric
information) deri%ed from customer characteristics @such as commute distance) to predict 'hether acustomer 'ill buy a bike+
Data !e"#ire$ %or Nai)e *a&es Mo$els
hen you prepare data for use in training a ;ai%e ayes model) you should understand thereuirements for the algorithm) including ho' much data is needed) and ho' the data is used+The reuirements for a ;ai%e ayes model are as follo's:
• A single e& col#mn ach model must contain one numeric or tet column that uniuely
identifies each record+ &ompound keys are not allo'ed+
• 'np#t col#mns #n a ;ai%e ayes model) all columns must be either discrete or discreti(ed
columns+
0or a ;ai%e ayes model) it is also important to ensure that the input attributes areindependent of each other+ This is particularly important 'hen you use the model for
prediction+The reason is that) if you use t'o columns of data that are already closely related) the effect
'ould be to multiply the influence of those columns) 'hich can obscure other factors that
influence the outcome+&on%ersely) the ability of the algorithm to identify correlations among %ariables is useful
'hen you are eploring a model or dataset) to identify relationships among inputs+
• At least one pre$ictable col#mn The predictable attribute must contain discrete or
discreti(ed %alues+
The %alues of the predictable column can be treated as inputs+ This practice can be useful'hen you are eploring a ne' dataset) to find relationships among the columns+
Decision Trees Algorithm
Decision Trees algorithm is a classification and regression algorithm pro%ided by $icrosoft S=
Ser%er Analysis Ser%ices for use in predicti%e modeling of both discrete and continuous attributes+0or discrete attributes) the algorithm makes predictions based on the relationships bet'een input
columns in a dataset+ #t uses the %alues) kno'n as states) of those columns to predict the states of a
column that you designate as predictable+ Specifically) the algorithm identifies the input columnsthat are correlated 'ith the predictable column+ 0or eample) in a scenario to predict 'hich
customers are likely to purchase a bicycle) if nine out of ten younger customers buy a bicycle) butonly t'o out of ten older customers do so) the algorithm infers that age is a good predictor of
bicycle purchase+ The decision tree makes predictions based on this tendency to'ard a particularoutcome+0or continuous attributes) the algorithm uses linear regression to determine 'here a decision tree
splits+#f more than one column is set to predictable) or if the input data contains a nested table that is set to
predictable) the algorithm builds a separate decision tree for each predictable column+
Example
The marketing department of the Ad%enture orks &ycles company 'ants to identify thecharacteristics of pre%ious customers that might indicate 'hether those customers are likely to buy a
product in the future+ The Ad%entureorks2!12 database stores demographic information that
describes pre%ious customers+ y using the $icrosoft Decision Trees algorithm to analy(e thisinformation) the marketing department can build a model that predicts 'hether a particular
http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/
-
8/17/2019 Data mining self study report
12/21
customer 'ill purchase products) based on the states of kno'n columns about that customer) such as
demographics or past buying patterns+
How the Algorithm Wors
The $icrosoft Decision Trees algorithm builds a data mining model by creating a series of splits in
the tree+ These splits are represented as nodes+ The algorithm adds a node to the model e%ery timethat an input column is found to be significantly correlated 'ith the predictable column+ The 'aythat the algorithm determines a split is different depending on 'hether it is predicting a continuous
column or a discrete column+The $icrosoft Decision Trees algorithm uses feature seection to guide the selection of the most
useful attributes+ 0eature selection is used by all Analysis Ser%ices data mining algorithms to
impro%e performance and the uality of analysis+ 0eature selection is important to pre%entunimportant attributes from using processor time+ #f you use too many input or predictable attributes
'hen you design a data mining model) the model can take a %ery long time to process) or e%en runout of memory+ $ethods used to determine 'hether to split the tree include industry-standard
metrics for entro!" and ayesian net'orks#
A common problem in data mining models is that the model becomes too sensiti%e to smalldifferences in the training data) in 'hich case it said to be o$er%fitted or o$er%trained+ An o%erfitted
model cannot be generali(ed to other data sets+ To a%oid o%erfitting on any particular set of data) the$icrosoft Decision Trees algorithm uses techniues for controlling the gro'th of the tree+
Credicting Discrete &olumns
The 'ay that the $icrosoft Decision Trees algorithm builds a tree for a discrete predictable columncan be demonstrated by using a histogram+ The follo'ing diagram sho's a histogram that plots a
predictable column) ike uyers) against an input column) Age+ The histogram sho's that the ageof a person helps distinguish 'hether that person 'ill purchase a bicycle+
The correlation that is sho'n in the diagram 'ould cause the $icrosoft Decision Trees algorithm tocreate a ne' node in the model+
http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/
-
8/17/2019 Data mining self study report
13/21
As the algorithm adds ne' nodes to a model) a tree structure is formed+ The top node of the treedescribes the breakdo'n of the predictable column for the o%erall population of customers+ As the
model continues to gro') the algorithm considers all columns+
Credicting &ontinuous &olumnshen the $icrosoft Decision Trees algorithm builds a tree based on a continuous predictable
column) each node contains a regression formula+ A split occurs at a point of non-linearity in theregression formula+ 0or eample) consider the follo'ing diagram+
The diagram contains data that can be modeled either by using a single line or by using t'oconnected lines+ 5o'e%er) a single line 'ould do a poor *ob of representing the data+ #nstead) if you
use t'o lines) the model 'ill do a much better *ob of approimating the data+ The point 'here thet'o lines come together is the point of non-linearity) and is the point 'here a node in a decision tree
model 'ould split+ 0or eample) the node that corresponds to the point of non-linearity in the
pre%ious graph could be represented by the follo'ing diagram+
Data !e"#ire$ %or Decision Tree Mo$elshen you prepare data for use in a decision trees model) you should understand the reuirementsfor the particular algorithm) including ho' much data is needed) and ho' the data is used+The reuirements for a decision trees model are as follo's:
• A single e& col#mn ach model must contain one numeric or tet column that uniuely
identifies each record+ &ompound keys are not permitted+
• A pre$ictable col#mn ,euires at least one predictable column+ 3ou can include multiple
predictable attributes in a model) and the predictable attributes can be of different types)
either numeric or discrete+ 5o'e%er) increasing the number of predictable attributes canincrease processing time+
• 'np#t col#mns ,euires input columns) 'hich can be discrete or continuous+ #ncreasing
the number of input attributes affects processing time+
http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/
-
8/17/2019 Data mining self study report
14/21
Data Mining Alications
%ata mining is a process that analy'es a large amount of data to #nd new and
hidden information that improves !usiness e4ciency. 5arious industries have
!een adopting data mining to their mission-critical !usiness processes to gain
competitive advantages and help !usiness grows. his tutorial illustrates some
data mining applications in sale6marketing, !anking6#nance, healthcare and
insurance, transportation and medicine.
Data Mining Alications in Sales!Marketing
%ata mining ena!les !usinesses to understand the hidden patterns inside
historical purchasing transaction data, thus helping in planning and launching
new marketing campaigns in prompt and cost e7ective way. he following
illustrates several data mining applications in sale and marketing.
%ata mining is used for market !asket analysis to provide information on what
product com!inations were purchased together when they were !ought and in
what seuence. his information helps !usinesses promote their most
pro#ta!le products and maximi'e the pro#t. In addition, it encouragescustomers to purchase related products that they may have !een missed or
overlooked.
Retail companies use data mining to identify customer8s !ehavior !uying
patterns.
Data Mining Alications in Banking ! "inance
Several data mining techniues e.g., distri!uted data mining have !een
researched, modeled and developed to help credit card fraud detection.
%ata mining is used to identify customers loyalty !y analy'ing the data of
customer8s purchasing activities such as the data of freuency of purchase in a
period of time, a total monetary value of all purchases and when was the last
purchase. fter analy'ing those dimensions, the relative measure is generated
for each customer. he higher of the score, the more relative loyal the
customer is.
o help the !ank to retain credit card customers, data mining is applied. 9y
analy'ing the past data, data mining can help !anks predict customers that
-
8/17/2019 Data mining self study report
15/21
likely to change their credit card a4liation so they can plan and launch
di7erent special o7ers to retain those customers.
:redit card spending !y customer groups can !e identi#ed !y using data
mining.
he hidden correlation8s !etween di7erent #nancial indicators can !ediscovered !y using data mining.
2rom historical market data, data mining ena!les to identify stock trading rules.
Data Mining Alications in #ealth Care and Insurance
he growth of the insurance industry entirely depends on the a!ility to convert
data into the knowledge, information or intelligence a!out customers,
competitors, and its markets. %ata mining is applied in insurance industry
lately !ut !rought tremendous competitive advantages to the companies who
have implemented it successfully. he data mining applications in insurance
industry are listed !elow
%ata mining is applied in claims analysis such as identifying which medical
procedures are claimed together. %ata mining ena!les to forecasts which
customers will potentially purchase new policies. %ata mining allows insurance
companies to detect risky customers8 !ehavior patterns. %ata mining helps
detect fraudulent !ehavior.
Data Mining Alications in Transortation
%ata mining helps determine the distri!ution schedules among warehouses
and outlets and analy'e loading patterns.
Data Mining Alications in Medicine
%ata mining ena!les to characteri'e patient activities to see incoming o4ce
visits. %ata mining helps identify the patterns of successful medical therapies
for di7erent illnesses. %ata mining applications are continuously developing in
various industries to provide more hidden knowledge that increases !usiness
e4ciency and grows !usinesses.
+inancial Data Anal&sis
The financial data mining in financial industry is generally reliable and of high uality 'hich
-
8/17/2019 Data mining self study report
16/21
facilitates systematic data analysis and data mining+ Some of the typical cases are as follo's
• Design and construction of data 'arehouses for multidimensional data analysis and data
mining+
• oan payment prediction and customer credit policy analysis+
• &lassification and clustering of customers for targeted marketing+
• Detection of money laundering and other financial crimes+
!etail 'n$#str&
Data $ining has its great application in ,etail #ndustry because it collects large amount of data
from on sales) customer purchasing history) goods transportation) consumption and ser%ices+ #t is
natural that the uantity of data collected 'ill continue to epand rapidly because of the increasing
ease) a%ailability and popularity of the 'eb+
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to
impro%ed uality of customer ser%ice and good customer retention and satisfaction+ 5ere is the list
of eamples of data mining in the retail industry
• Design and &onstruction of data 'arehouses based on the benefits of data mining+
• $ultidimensional analysis of sales) customers) products) time and region+
• Analysis of effecti%eness of sales campaigns+
• &ustomer ,etention+
• Croduct recommendation and cross-referencing of items+
Telecomm#nication 'n$#str&
Today the telecommunication industry is one of the most emerging industries pro%iding %arious
ser%ices such as fa) pager) cellular phone) internet messenger) images) e-mail) 'eb data
transmission) etc+ Due to the de%elopment of ne' computer and communication technologies) the
telecommunication industry is rapidly epanding+ This is the reason 'hy data mining is become
%ery important to help and understand the business+
Data mining in telecommunication industry helps in identifying the telecommunication patterns)
catch fraudulent acti%ities) make better use of resource) and impro%e uality of ser%ice+ 5ere is the
list of eamples for 'hich data mining impro%es telecommunication ser%ices
• $ultidimensional Analysis of Telecommunication data+
• 0raudulent pattern analysis+
-
8/17/2019 Data mining self study report
17/21
• #dentification of unusual patterns+
• $ultidimensional association and seuential patterns analysis+
• $obile Telecommunication ser%ices+
• Ese of %isuali(ation tools in telecommunication data analysis+
*iological Data Anal&sis
#n recent times) 'e ha%e seen a tremendous gro'th in the field of biology such as genomics)
proteomics) functional Fenomics and biomedical research+ iological data mining is a %ery
important part of ioinformatics+ 0ollo'ing are the aspects in 'hich data mining contributes for
biological data analysis
• Semantic integration of heterogeneous) distributed genomic and proteomic databases+
• Alignment) indeing) similarity search and comparati%e analysis multiple nucleotide
seuences+
• Disco%ery of structural patterns and analysis of genetic net'orks and protein path'ays+
• Association and path analysis+
• ?isuali(ation tools in genetic data analysis+
Other ,cienti%ic Applications
The applications discussed abo%e tend to handle relati%ely small and homogeneous data sets for
'hich the statistical techniues are appropriate+ 5uge amount of data ha%e been collected from
scientific domains such as geosciences) astronomy) etc+ A large amount of data sets is being
generated because of the fast numerical simulations in %arious fields such as climate andecosystem modeling) chemical engineering) fluid dynamics) etc+ 0ollo'ing are the applications of
data mining in the field of Scientific Applications
• Data arehouses and data preprocessing+
• Fraph-based mining+
• ?isuali(ation and domain specific kno'ledge+
-
8/17/2019 Data mining self study report
18/21
A$)antages an$ Disa$)antages o% Data Mining
%ata ;ining is an important part of knowledge discovery process that we can analy'ean enormous set of data and get hidden and useful knowledge. %ata mining is appliede7ectively not only in the !usiness environment !ut also in other #elds such asweather forecast, medicine, transportation, healthcare, insurance, government
-
8/17/2019 Data mining self study report
19/21
Security is a !ig issue. 9usinesses own information a!out their employees andcustomers including social security num!er, !irthday, payroll and etc. =owever howproperly this information is taken care is still in uestions. here have !een a lot ofcases that hackers accessed and stole !ig data of customers from the !ig corporationsuch as 2ord ;otor :redit :ompany, Sony< with so much personal and #nancialinformation availa!le, the credit card stolen and identity theft !ecome a !ig pro!lem.
;isuse of information6inaccurate informationInformation is collected through data mining intended for the ethical purposes can !emisused. his information may !e exploited !y unethical people or !usinesses to take!ene#ts of vulnera!le people or discriminate against a group of people.
In addition, data mining techniue is not perfectly accurate. herefore, if inaccurateinformation is used for decision-making, it will cause serious conseuence.
-
8/17/2019 Data mining self study report
20/21
Concl#sion
Data mining is a tool that is used by go%ernments and corporations to predict and establish trends'ith specific purposes in mind @ Aleander) 2!!G+ The T#A and AD?#S use data mining as ananti-terrorist measure by looking for specific data to identify the terrorists before a terrorist attack
@Solo%e) 2!!4H irrer) 2!!I+ &orporations use data mining to eamine buying patterns and predict
future trends @&ullen) 2!!I+ Ama(on+com uses it to promote its sales by pre-selecting items) usingdata mining algorithms @Arnold) 2!!1+ ibraries use data mining to become more efficient in
de%eloping their collections and management of staff @&ummins) 2!!G+There are pri%acy risks in data mining @irrer) 2!!I+ Fo%ernments recogni(e the pri%acy risks
in%ol%ed 'ith data mining) and ha%e de%eloped legislation to pro%ide recourse to indi%iduals against
potential pri%acy %iolations @C#CDA) 2!!4H 6Security %ie's6) 2!!G+ egislation like &anada.sC#CDA establishes guidelines that corporations and go%ernment agencies must ahere to 'hen they
are collecting personal information @C#CDA) 2!!4+ ike'ise) &orporations 'ho are de%elopingdata mining soft'are are sensiti%e to pri%acy concerns) and are building into their soft'are
measures to limit ho' much personal information is collected @ait) 2!!I+
3our personal information is %aluable) and you gi%e a'ay more of it than you reali(e @Jumagai K&herry) 2!!4+ There are many uses for information) and someone 'ill al'ays be thinking of a
different 'ay to use it @Jumagai K &herry) 2!!4+ Data mining is too %aluable of a tool forgo%ernments and corporations to abandon @irrer) 2!!IH Jumagai K &herry) 2!!4H ipo'ic()
2!!G+ The key is being conscious of ho' you gi%e a'ay your personal information) and in
protecting your personal pri%acy @arkin) 2!!G+ Lpt out of sharing your data 'ith the companiesyou deal 'ith @arkin) 2!!G+ Ask your telephone pro%ider to remo%e call details from your account
@arkin) 2!!G+ ,ead pri%acy policies of 'ebsites and organi(ations you %isit @arkin) 2!!G+ ydoing these small things) you protect yourself and your personal information from potential misuse
@arkin) 2!!G+
-
8/17/2019 Data mining self study report
21/21