data mining self study report

Upload: akash

Post on 06-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Data mining self study report

    1/21

    Self-Study seminar

    Topic :- Data mining

    Submitted by :-

    Akash Shinde2k14/S/!"

  • 8/17/2019 Data mining self study report

    2/21

      ACKNOWLEDGEMENT

    I would like to express our greatest gratitude to the people who have helped &support us throughout our Self-Study Seminar Report.

    I am grateful to our guide, Ms. Ruchika Malhotra for her continuous supportfor the project, from initial advice to contacts in the early stages of conceptualinception & through ongoing advice & encouragement to this day.

    I wish to thank them for their undivided support and interest, which inspiredand encouraged us to go the right way.

    t last, !ut not the least, I want to thank my friends who appreciated me for my

    work and motivated us" and #nally to $od who made all the things possi!le.

    Akash Shinde

  • 8/17/2019 Data mining self study report

    3/21

    Contents

    • Abstract

    • #ntroduction

    • Data $ining Algorithms

    • Applications

    • Ad%antages and Disad%antages

    • &onclusion

  • 8/17/2019 Data mining self study report

    4/21

      Abstract

    $ining information and kno'ledge from large databases has been recogni(ed by many researchers

    as a key research topic in database systems and machine learning) and by many industrialcompanies as an important area 'ith an opportunity of ma*or re%enues+ ,esearchers in manydifferent fields ha%e sho'n great interest in data mining+ Se%eral emerging applications in

    information-pro%iding ser%ices) such as data 'arehousing and online ser%ices o%er the #nternet) also

    call for %arious data mining techniues to better understand user beha%ior) to impro%e the ser%icepro%ided and to increase business opportunities+ #n response to such a demand) this article pro%ides

    a sur%ey) from a database researcher.s point of %ie') on the data mining techniues de%elopedrecently+ A classification of the a%ailable data mining techniues is pro%ided and a comparati%e

    study of such techniues is presented+

  • 8/17/2019 Data mining self study report

    5/21

      Introduction

    %ata mining is an interdisciplinary su!#eld of computer science. It is the

    computational process of discovering patterns in large data sets involvingmethods at the intersection of arti#cial intelligence, machine learning,

    statistics, and data!ase systems. he overall goal of the data mining process is

    to extract information from a data set and transform it into an understanda!le

    structure for further use. side from the raw analysis step, it involves data!ase

    and data management aspects, data pre-processing, model and inference

    considerations, interestingness metrics, complexity considerations, post-

    processing of discovered structures, visuali'ation, and online updating. %ata

    mining is the analysis step of the (knowledge discovery in data!ases( process,

    or )%%.

     he term is a misnomer, !ecause the goal is the extraction of patterns and

    knowledge from large amounts of data, not the extraction *mining+ of data

    itself. It also is a !u''word and is freuently applied to any form of large-scale

    data or information processing *collection, extraction, warehousing, analysis,

    and statistics+ as well as any application of computer decision support system,

    including arti#cial intelligence, machine learning, and !usiness intelligence.

     he !ook %ata mining ractical machine learning tools and techniues with

     /ava *which covers mostly machine learning material+ was originally to !e

    named just ractical machine learning, and the term data mining was onlyadded for marketing reasons. 0ften the more general terms *large scale+ data

    analysis and analytics 1 or, when referring to actual methods, arti#cial

    intelligence and machine learning 1 are more appropriate.

     he actual data mining task is the automatic or semi-automatic analysis of 

    large uantities of data to extract previously unknown, interesting patterns

    such as groups of data records *cluster analysis+, unusual records *anomaly

    detection+, and dependencies *association rule mining+. his usually involves

    using data!ase techniues such as spatial indices. hese patterns can then !e

    seen as a kind of summary of the input data, and may !e used in furtheranalysis or, for example, in machine learning and predictive analytics. 2or

    example, the data mining step might identify multiple groups in the data, which

    can then !e used to o!tain more accurate prediction results !y a decision

    support system. 3either the data collection, data preparation, nor result

    interpretation and reporting is part of the data mining step, !ut do !elong to

    the overall )%% process as additional steps.he related terms data dredging,

    data #shing, and data snooping refer to the use of data mining methods to

    sample parts of a larger population data set that are *or may !e+ too small for

    relia!le statistical inferences to !e made a!out the validity of any patternsdiscovered. hese methods can, however, !e used in creating new hypotheses

    to test against the larger data populations.

  • 8/17/2019 Data mining self study report

    6/21

    Data Mining Algorithms

    Clustering Algorithm

    The $icrosoft &lustering algorithm is a segmentation algorithm pro%ided by Analysis Ser%ices+ The

    algorithm uses iterati%e techniues to group cases in a dataset into clusters that contain similarcharacteristics+ These groupings are useful for eploring data) identifying anomalies in the data) and

    creating predictions+&lustering models identify relationships in a dataset that you might not logically deri%e through

    casual obser%ation+ 0or eample) you can logically discern that people 'ho commute to their *obs

    by bicycle do not typically li%e a long distance from 'here they 'ork+ The algorithm) ho'e%er) canfind other characteristics about bicycle commuters that are not as ob%ious+ #n the follo'ing diagram)

    cluster A represents data about people 'ho tend to dri%e to 'ork) 'hile cluster represents dataabout people 'ho tend to ride bicycles to 'ork+

    The clustering algorithm differs from other data mining algorithms) such as the $icrosoft Decision

    Trees algorithm) in that you do not ha%e to designate a predictable column to be able to build a

    clustering model+ The clustering algorithm trains the model strictly from the relationships that eistin the data and from the clusters that the algorithm identifies+

    Example&onsider a group of people 'ho share similar demographic information and 'ho buy similar

    products from the Ad%enture orks company+ This group of people represents a cluster of data+Se%eral such clusters may eist in a database+ y obser%ing the columns that make up a cluster) you

    can more clearly see ho' records in a dataset are related to one another+

    How the Algorithm Wors

    The $icrosoft &lustering algorithm first identifies relationships in a dataset and generates a seriesof clusters based on those relationships+ A scatter plot is a useful 'ay to %isually represent ho' the

    algorithm groups data) as sho'n in the follo'ing diagram+ The scatter plot represents all the casesin the dataset) and each case is a point on the graph+ The clusters group points on the graph and

    illustrate the relationships that the algorithm identifies+

    After first defining the clusters) the algorithm calculates ho' 'ell the clusters represent groupingsof the points) and then tries to redefine the groupings to create clusters that better represent the data+

    http://void%280%29/http://void%280%29/

  • 8/17/2019 Data mining self study report

    7/21

    The algorithm iterates through this process until it cannot impro%e the results more by redefining

    the clusters+3ou can customi(e the 'ay the algorithm 'orks by selecting a specifying a clustering techniue)

    limiting the maimum number of clusters) or changing the amount of support reuired to create a

    cluster+

    Data !e"#ire$ %or Cl#stering Mo$els

    hen you prepare data for use in training a clustering model) you should understand thereuirements for the particular algorithm) including ho' much data is needed) and ho' the data is

    used+

    The reuirements for a clustering model are as follo's:

    • A single e& col#mn  ach model must contain one numeric or tet column that uniuely

    identifies each record+ &ompound keys are not allo'ed+

    • 'np#t col#mns  ach model must contain at least one input column that contains the %alues

    that are used to build the clusters+ 3ou can ha%e as many input columns as you 'ant) but

    depending on the number of %alues in each column) the addition of etra columns can

    increase the time it takes to train the model+

    • Optional pre$ictable col#mn  The algorithm does not need a predictable column to build

    the model) but you can add a predictable column of almost any data type+ The %alues of thepredictable column can be treated as input to the clustering model) or you can specify that it

    be used for prediction only+ 0or eample) if you 'ant to predict customer income by

    clustering on demographics such as region or age) you 'ould specify incomeas (re$ictOnl& and add all the other columns) such as region or age) as inputs+

    Association Algorithm

    Association algorithm is an association algorithm pro%ided by Analysis Ser%ices that is useful forrecommendation engines+ A recommendation engine recommends products to customers based on

    items they ha%e already bought) or in 'hich they ha%e indicated an interest+ The $icrosoftAssociation algorithm is also useful for market basket analysis+

    Association models are built on datasets that contain identifiers both for indi%idual cases and for the

    items that the cases contain+ A group of items in a case is called an itemset + An association modelconsists of a series of itemsets and the rules that describe ho' those items are grouped together

    'ithin the cases+ The rules that the algorithm identifies can be used to predict a customer.s likelyfuture purchases) based on the items that already eist in the customer.s shopping cart+ The

    follo'ing diagram sho's a series of rules in an itemset+

    As the diagram illustrates) the $icrosoft Association algorithm can potentially find many rules

    'ithin a dataset+ The algorithm uses t'o parameters) support and probability) to describe theitemsets and rules that it generates+ 0or eample) if and 3 represent t'o items that could be in a

    shopping cart) the support parameter is the number of cases in the dataset that contain the

    combination of items) and 3+ y using the support parameter in combination 'ith the user-defined parameters) MINIMUM_SUPPORT  and MAXIMUM_SUPPORT, the algorithm controls the

    number of itemsets that are generated+ The probability parameter) also named confidence) representsthe fraction of cases in the dataset that contain and that also contain 3+ y using the probability

    parameter in combination 'ith the MINIMUM_PROBABILITY  parameter) the algorithm controls

    the number of rules that are generated+

    Example

    http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/

  • 8/17/2019 Data mining self study report

    8/21

    The Ad%enture orks &ycle company is redesigning the functionality of its eb site+ The goal of

    the redesign is to increase sell-through of products+ ecause the company records each sale in atransactional database) they can use the $icrosoft Association algorithm to identify sets of products

    that tend to be purchased together+ They can then predict additional items that a customer might be

    interested in) based on items that are already in the customer.s shopping basket+

    5o' the Algorithm orks

    The $icrosoft Association algorithm tra%erses a dataset to find items that appear together in a case+The algorithm then groups into itemsets any associated items that appear) at a minimum) in the

    number of cases that are specified by the MINIMUM_SUPPORT  parameter+ 0or eample) anitemset could be 6$ountain 2!!7isting) Sport 1!!7isting6) and could ha%e a support of 81!+

    The algorithm then generates rules from the itemsets+ These rules are used to predict the presence ofan item in the database) based on the presence of other specific items that the algorithm identifies as

    important+ 0or eample) a rule could be 6if Touring 1!!!7eisting and ,oad bottle cage7eisting)

    then ater bottle7eisting6) and could ha%e a probability of !+912+ #n this eample) the algorithmidentifies that the presence in the basket of the Touring 1!!! tire and the 'ater bottle cage predicts

    that a 'ater bottle 'ould also likely be in the basket+

    Data !e"#ire$ %or Association Mo$els

    hen you prepare data for use in an association rules model) you should understand the

    reuirements for the particular algorithm) including ho' much data is needed) and ho' the data is

    used+The reuirements for an association rules model are as follo's:

    • A single e& col#mn  ach model must contain one numeric or tet column that uniuely

    identifies each record+ compound keys not permitted+

    • A single pre$ictable col#mn  An association model can ha%e only one predictable column+

    Typically it is the key column of the nested table) such as the filed that lists the products that

    'ere purchased+ The %alues must be discrete or discreti(ed+• 'np#t col#mns  + The input columns must be discrete+ The input data for an association

    model often is contained in t'o tables+ 0or eample) one table might contain customerinformation 'hile another table contains customer purchases+ 3ou can input this data into

    the model by using a nested table+

    Linear Regression Algorithm

    inear ,egression algorithm is a %ariation of the $icrosoft Decision Trees algorithm that helps youcalculate a linear relationship bet'een a dependent and independent %ariable) and then use that

    relationship for prediction+

    The relationship takes the form of an euation for a line that best represents a series of data+ 0oreample) the line in the follo'ing diagram is the best possible linear representation of the data+

    http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/

  • 8/17/2019 Data mining self study report

    9/21

    ach data point in the diagram has an error associated 'ith its distance from the regression line+ The

    coefficients a and b in the regression euation ad*ust the angle and location of the regression line+

    3ou can obtain the regression euation by ad*usting a and b until the sum of the errors that areassociated 'ith all the points reaches its minimum+

    There are other kinds of regression that use multiple %ariables) and also nonlinear methods ofregression+ 5o'e%er) linear regression is a useful and 'ell-kno'n method for modeling a response

    to a change in some underlying factor+

    Example

    3ou can use linear regression to determine a relationship bet'een t'o continuous columns+ 0oreample) you can use linear regression to compute a trend line from manufacturing or sales data+

    3ou could also use the linear regression as a precursor to de%elopment of more comple datamining models) to assess the relationships among data columns+

    Although there are many 'ays to compute linear regression that do not reuire data mining tools)the ad%antage of using the $icrosoft inear ,egression algorithm for this task is that all thepossible relationships among the %ariables are automatically computed and tested+ 3ou do not ha%e

    to select a computation method) such as sol%ing for least suares+ 5o'e%er) linear regression mighto%ersimplify the relationships in scenarios 'here multiple factors affect the outcome+

    How the Algorithm Wors

    The $icrosoft inear ,egression algorithm is a %ariation of the $icrosoft Decision Trees

    algorithm+ hen you select the $icrosoft inear ,egression algorithm) a special case of the$icrosoft Decision Trees algorithm is in%oked) 'ith parameters that constrain the beha%ior of the

    algorithm and reuire certain input data types+ $oreo%er) in a linear regression model) the 'holedata set is used for computing relationships in the initial pass) 'hereas a standard decision treesmodel splits the data repeatedly into smaller subsets or trees+

    Data !e"#ire$ %or Linear !egression Mo$elshen you prepare data for use in a linear regression model) you should understand the reuirements

    for the particular algorithm+ This includes ho' much data is needed) and ho' the data is used+ The

    reuirements for this model type are as follo's:• A single e& col#mn  ach model must contain one numeric or tet column that uniuely

    identifies each record+ &ompound keys are not permitted+

    http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/

  • 8/17/2019 Data mining self study report

    10/21

  • 8/17/2019 Data mining self study report

    11/21

    probability that they 'ill not buy a bike is !+298+ #n this eample) the algorithm uses the numeric

    information) deri%ed from customer characteristics @such as commute distance) to predict 'hether acustomer 'ill buy a bike+

    Data !e"#ire$ %or Nai)e *a&es Mo$els

    hen you prepare data for use in training a ;ai%e ayes model) you should understand thereuirements for the algorithm) including ho' much data is needed) and ho' the data is used+The reuirements for a ;ai%e ayes model are as follo's:

    • A single e& col#mn  ach model must contain one numeric or tet column that uniuely

    identifies each record+ &ompound keys are not allo'ed+

    • 'np#t col#mns  #n a ;ai%e ayes model) all columns must be either discrete or discreti(ed

    columns+

    0or a ;ai%e ayes model) it is also important to ensure that the input attributes areindependent of each other+ This is particularly important 'hen you use the model for

    prediction+The reason is that) if you use t'o columns of data that are already closely related) the effect

    'ould be to multiply the influence of those columns) 'hich can obscure other factors that

    influence the outcome+&on%ersely) the ability of the algorithm to identify correlations among %ariables is useful

    'hen you are eploring a model or dataset) to identify relationships among inputs+

    • At least one pre$ictable col#mn  The predictable attribute must contain discrete or

    discreti(ed %alues+

    The %alues of the predictable column can be treated as inputs+ This practice can be useful'hen you are eploring a ne' dataset) to find relationships among the columns+

    Decision Trees Algorithm

    Decision Trees algorithm is a classification and regression algorithm pro%ided by $icrosoft S=

    Ser%er Analysis Ser%ices for use in predicti%e modeling of both discrete and continuous attributes+0or discrete attributes) the algorithm makes predictions based on the relationships bet'een input

    columns in a dataset+ #t uses the %alues) kno'n as states) of those columns to predict the states of a

    column that you designate as predictable+ Specifically) the algorithm identifies the input columnsthat are correlated 'ith the predictable column+ 0or eample) in a scenario to predict 'hich

    customers are likely to purchase a bicycle) if nine out of ten younger customers buy a bicycle) butonly t'o out of ten older customers do so) the algorithm infers that age is a good predictor of

    bicycle purchase+ The decision tree makes predictions based on this tendency to'ard a particularoutcome+0or continuous attributes) the algorithm uses linear regression to determine 'here a decision tree

    splits+#f more than one column is set to predictable) or if the input data contains a nested table that is set to

    predictable) the algorithm builds a separate decision tree for each predictable column+

    Example

    The marketing department of the Ad%enture orks &ycles company 'ants to identify thecharacteristics of pre%ious customers that might indicate 'hether those customers are likely to buy a

    product in the future+ The Ad%entureorks2!12 database stores demographic information that

    describes pre%ious customers+ y using the $icrosoft Decision Trees algorithm to analy(e thisinformation) the marketing department can build a model that predicts 'hether a particular

    http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/

  • 8/17/2019 Data mining self study report

    12/21

    customer 'ill purchase products) based on the states of kno'n columns about that customer) such as

    demographics or past buying patterns+

    How the Algorithm Wors

    The $icrosoft Decision Trees algorithm builds a data mining model by creating a series of splits in

    the tree+ These splits are represented as nodes+ The algorithm adds a node to the model e%ery timethat an input column is found to be significantly correlated 'ith the predictable column+ The 'aythat the algorithm determines a split is different depending on 'hether it is predicting a continuous

    column or a discrete column+The $icrosoft Decision Trees algorithm uses feature seection to guide the selection of the most

    useful attributes+ 0eature selection is used by all Analysis Ser%ices data mining algorithms to

    impro%e performance and the uality of analysis+ 0eature selection is important to pre%entunimportant attributes from using processor time+ #f you use too many input or predictable attributes

    'hen you design a data mining model) the model can take a %ery long time to process) or e%en runout of memory+ $ethods used to determine 'hether to split the tree include industry-standard

    metrics for entro!" and ayesian net'orks#

    A common problem in data mining models is that the model becomes too sensiti%e to smalldifferences in the training data) in 'hich case it said to be o$er%fitted or o$er%trained+ An o%erfitted

    model cannot be generali(ed to other data sets+ To a%oid o%erfitting on any particular set of data) the$icrosoft Decision Trees algorithm uses techniues for controlling the gro'th of the tree+

    Credicting Discrete &olumns

    The 'ay that the $icrosoft Decision Trees algorithm builds a tree for a discrete predictable columncan be demonstrated by using a histogram+ The follo'ing diagram sho's a histogram that plots a

    predictable column) ike uyers) against an input column) Age+ The histogram sho's that the ageof a person helps distinguish 'hether that person 'ill purchase a bicycle+

    The correlation that is sho'n in the diagram 'ould cause the $icrosoft Decision Trees algorithm tocreate a ne' node in the model+

    http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/

  • 8/17/2019 Data mining self study report

    13/21

    As the algorithm adds ne' nodes to a model) a tree structure is formed+ The top node of the treedescribes the breakdo'n of the predictable column for the o%erall population of customers+ As the

    model continues to gro') the algorithm considers all columns+

    Credicting &ontinuous &olumnshen the $icrosoft Decision Trees algorithm builds a tree based on a continuous predictable

    column) each node contains a regression formula+ A split occurs at a point of non-linearity in theregression formula+ 0or eample) consider the follo'ing diagram+

    The diagram contains data that can be modeled either by using a single line or by using t'oconnected lines+ 5o'e%er) a single line 'ould do a poor *ob of representing the data+ #nstead) if you

    use t'o lines) the model 'ill do a much better *ob of approimating the data+ The point 'here thet'o lines come together is the point of non-linearity) and is the point 'here a node in a decision tree

    model 'ould split+ 0or eample) the node that corresponds to the point of non-linearity in the

    pre%ious graph could be represented by the follo'ing diagram+

    Data !e"#ire$ %or Decision Tree Mo$elshen you prepare data for use in a decision trees model) you should understand the reuirementsfor the particular algorithm) including ho' much data is needed) and ho' the data is used+The reuirements for a decision trees model are as follo's:

    • A single e& col#mn  ach model must contain one numeric or tet column that uniuely

    identifies each record+ &ompound keys are not permitted+

    • A pre$ictable col#mn  ,euires at least one predictable column+ 3ou can include multiple

    predictable attributes in a model) and the predictable attributes can be of different types)

    either numeric or discrete+ 5o'e%er) increasing the number of predictable attributes canincrease processing time+

    • 'np#t col#mns  ,euires input columns) 'hich can be discrete or continuous+ #ncreasing

    the number of input attributes affects processing time+

    http://void%280%29/http://void%280%29/http://void%280%29/http://void%280%29/

  • 8/17/2019 Data mining self study report

    14/21

    Data Mining Alications

    %ata mining is a process that analy'es a large amount of data to #nd new and

    hidden information that improves !usiness e4ciency. 5arious industries have

    !een adopting data mining to their mission-critical !usiness processes to gain

    competitive advantages and help !usiness grows. his tutorial illustrates some

    data mining applications in sale6marketing, !anking6#nance, healthcare and

    insurance, transportation and medicine.

    Data Mining Alications in Sales!Marketing

    %ata mining ena!les !usinesses to understand the hidden patterns inside

    historical purchasing transaction data, thus helping in planning and launching

    new marketing campaigns in prompt and cost e7ective way. he following

    illustrates several data mining applications in sale and marketing.

    %ata mining is used for market !asket analysis to provide information on what

    product com!inations were purchased together when they were !ought and in

    what seuence. his information helps !usinesses promote their most

    pro#ta!le products and maximi'e the pro#t. In addition, it encouragescustomers to purchase related products that they may have !een missed or

    overlooked.

    Retail companies use data mining to identify customer8s !ehavior !uying

    patterns.

    Data Mining Alications in Banking ! "inance

    Several data mining techniues e.g., distri!uted data mining have !een

    researched, modeled and developed to help credit card fraud detection.

    %ata mining is used to identify customers loyalty !y analy'ing the data of 

    customer8s purchasing activities such as the data of freuency of purchase in a

    period of time, a total monetary value of all purchases and when was the last

    purchase. fter analy'ing those dimensions, the relative measure is generated

    for each customer. he higher of the score, the more relative loyal the

    customer is.

     o help the !ank to retain credit card customers, data mining is applied. 9y

    analy'ing the past data, data mining can help !anks predict customers that

  • 8/17/2019 Data mining self study report

    15/21

    likely to change their credit card a4liation so they can plan and launch

    di7erent special o7ers to retain those customers.

    :redit card spending !y customer groups can !e identi#ed !y using data

    mining.

     he hidden correlation8s !etween di7erent #nancial indicators can !ediscovered !y using data mining.

    2rom historical market data, data mining ena!les to identify stock trading rules.

    Data Mining Alications in #ealth Care and Insurance

     he growth of the insurance industry entirely depends on the a!ility to convert

    data into the knowledge, information or intelligence a!out customers,

    competitors, and its markets. %ata mining is applied in insurance industry

    lately !ut !rought tremendous competitive advantages to the companies who

    have implemented it successfully. he data mining applications in insurance

    industry are listed !elow

    %ata mining is applied in claims analysis such as identifying which medical

    procedures are claimed together. %ata mining ena!les to forecasts which

    customers will potentially purchase new policies. %ata mining allows insurance

    companies to detect risky customers8 !ehavior patterns. %ata mining helps

    detect fraudulent !ehavior.

    Data Mining Alications in Transortation

    %ata mining helps determine the distri!ution schedules among warehouses

    and outlets and analy'e loading patterns.

    Data Mining Alications in Medicine

    %ata mining ena!les to characteri'e patient activities to see incoming o4ce

    visits. %ata mining helps identify the patterns of successful medical therapies

    for di7erent illnesses. %ata mining applications are continuously developing in

    various industries to provide more hidden knowledge that increases !usiness

    e4ciency and grows !usinesses.

    +inancial Data Anal&sis

    The financial data mining in financial industry is generally reliable and of high uality 'hich

  • 8/17/2019 Data mining self study report

    16/21

    facilitates systematic data analysis and data mining+ Some of the typical cases are as follo's

    • Design and construction of data 'arehouses for multidimensional data analysis and data

    mining+

    • oan payment prediction and customer credit policy analysis+

    • &lassification and clustering of customers for targeted marketing+

    • Detection of money laundering and other financial crimes+

    !etail 'n$#str&

    Data $ining has its great application in ,etail #ndustry because it collects large amount of data

    from on sales) customer purchasing history) goods transportation) consumption and ser%ices+ #t is

    natural that the uantity of data collected 'ill continue to epand rapidly because of the increasing

    ease) a%ailability and popularity of the 'eb+

    Data mining in retail industry helps in identifying customer buying patterns and trends that lead to

    impro%ed uality of customer ser%ice and good customer retention and satisfaction+ 5ere is the list

    of eamples of data mining in the retail industry

    • Design and &onstruction of data 'arehouses based on the benefits of data mining+

    • $ultidimensional analysis of sales) customers) products) time and region+

    • Analysis of effecti%eness of sales campaigns+

    • &ustomer ,etention+

    • Croduct recommendation and cross-referencing of items+

    Telecomm#nication 'n$#str&

    Today the telecommunication industry is one of the most emerging industries pro%iding %arious

    ser%ices such as fa) pager) cellular phone) internet messenger) images) e-mail) 'eb data

    transmission) etc+ Due to the de%elopment of ne' computer and communication technologies) the

    telecommunication industry is rapidly epanding+ This is the reason 'hy data mining is become

    %ery important to help and understand the business+

    Data mining in telecommunication industry helps in identifying the telecommunication patterns)

    catch fraudulent acti%ities) make better use of resource) and impro%e uality of ser%ice+ 5ere is the

    list of eamples for 'hich data mining impro%es telecommunication ser%ices

    • $ultidimensional Analysis of Telecommunication data+

    • 0raudulent pattern analysis+

  • 8/17/2019 Data mining self study report

    17/21

    • #dentification of unusual patterns+

    • $ultidimensional association and seuential patterns analysis+

    • $obile Telecommunication ser%ices+

    • Ese of %isuali(ation tools in telecommunication data analysis+

    *iological Data Anal&sis

    #n recent times) 'e ha%e seen a tremendous gro'th in the field of biology such as genomics)

    proteomics) functional Fenomics and biomedical research+ iological data mining is a %ery

    important part of ioinformatics+ 0ollo'ing are the aspects in 'hich data mining contributes for

    biological data analysis

    • Semantic integration of heterogeneous) distributed genomic and proteomic databases+

    • Alignment) indeing) similarity search and comparati%e analysis multiple nucleotide

    seuences+

    • Disco%ery of structural patterns and analysis of genetic net'orks and protein path'ays+

    • Association and path analysis+

    • ?isuali(ation tools in genetic data analysis+

    Other ,cienti%ic Applications

    The applications discussed abo%e tend to handle relati%ely small and homogeneous data sets for

    'hich the statistical techniues are appropriate+ 5uge amount of data ha%e been collected from

    scientific domains such as geosciences) astronomy) etc+ A large amount of data sets is being

    generated because of the fast numerical simulations in %arious fields such as climate andecosystem modeling) chemical engineering) fluid dynamics) etc+ 0ollo'ing are the applications of

    data mining in the field of Scientific Applications

    • Data arehouses and data preprocessing+

    • Fraph-based mining+

    • ?isuali(ation and domain specific kno'ledge+

     

  • 8/17/2019 Data mining self study report

    18/21

      A$)antages an$ Disa$)antages o% Data Mining

    %ata ;ining is an important part of knowledge discovery process that we can analy'ean enormous set of data and get hidden and useful knowledge. %ata mining is appliede7ectively not only in the !usiness environment !ut also in other #elds such asweather forecast, medicine, transportation, healthcare, insurance, government

  • 8/17/2019 Data mining self study report

    19/21

    Security is a !ig issue. 9usinesses own information a!out their employees andcustomers including social security num!er, !irthday, payroll and etc. =owever howproperly this information is taken care is still in uestions. here have !een a lot ofcases that hackers accessed and stole !ig data of customers from the !ig corporationsuch as 2ord ;otor :redit :ompany, Sony< with so much personal and #nancialinformation availa!le, the credit card stolen and identity theft !ecome a !ig pro!lem.

    ;isuse of information6inaccurate informationInformation is collected through data mining intended for the ethical purposes can !emisused. his information may !e exploited !y unethical people or !usinesses to take!ene#ts of vulnera!le people or discriminate against a group of people.

    In addition, data mining techniue is not perfectly accurate. herefore, if inaccurateinformation is used for decision-making, it will cause serious conseuence.

  • 8/17/2019 Data mining self study report

    20/21

     Concl#sion

    Data mining is a tool that is used by go%ernments and corporations to predict and establish trends'ith specific purposes in mind @ Aleander) 2!!G+ The T#A and AD?#S use data mining as ananti-terrorist measure by looking for specific data to identify the terrorists before a terrorist attack

    @Solo%e) 2!!4H irrer) 2!!I+ &orporations use data mining to eamine buying patterns and predict

    future trends @&ullen) 2!!I+ Ama(on+com uses it to promote its sales by pre-selecting items) usingdata mining algorithms @Arnold) 2!!1+ ibraries use data mining to become more efficient in

    de%eloping their collections and management of staff @&ummins) 2!!G+There are pri%acy risks in data mining @irrer) 2!!I+ Fo%ernments recogni(e the pri%acy risks

    in%ol%ed 'ith data mining) and ha%e de%eloped legislation to pro%ide recourse to indi%iduals against

    potential pri%acy %iolations @C#CDA) 2!!4H 6Security %ie's6) 2!!G+ egislation like &anada.sC#CDA establishes guidelines that corporations and go%ernment agencies must ahere to 'hen they

    are collecting personal information @C#CDA) 2!!4+ ike'ise) &orporations 'ho are de%elopingdata mining soft'are are sensiti%e to pri%acy concerns) and are building into their soft'are

    measures to limit ho' much personal information is collected @ait) 2!!I+

    3our personal information is %aluable) and you gi%e a'ay more of it than you reali(e @Jumagai K&herry) 2!!4+ There are many uses for information) and someone 'ill al'ays be thinking of a

    different 'ay to use it @Jumagai K &herry) 2!!4+ Data mining is too %aluable of a tool forgo%ernments and corporations to abandon @irrer) 2!!IH Jumagai K &herry) 2!!4H ipo'ic()

    2!!G+ The key is being conscious of ho' you gi%e a'ay your personal information) and in

    protecting your personal pri%acy @arkin) 2!!G+ Lpt out of sharing your data 'ith the companiesyou deal 'ith @arkin) 2!!G+ Ask your telephone pro%ider to remo%e call details from your account

    @arkin) 2!!G+ ,ead pri%acy policies of 'ebsites and organi(ations you %isit @arkin) 2!!G+ ydoing these small things) you protect yourself and your personal information from potential misuse

    @arkin) 2!!G+

     

  • 8/17/2019 Data mining self study report

    21/21