10/30/021 me data mining overview margaret h. dunham cse department southern methodist university...
TRANSCRIPT
![Page 1: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/1.jpg)
10/30/02 1
MEMEDATA MINING OVERVIEWDATA MINING OVERVIEW
Margaret H. DunhamMargaret H. Dunham
CSE DepartmentCSE Department
Southern Methodist UniversitySouthern Methodist University
Dallas, Texas 75275Dallas, Texas 75275
![Page 2: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/2.jpg)
10/30/02 2
Data is growing at a phenomenal rate Users expect more sophisticated
information How?
UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATION
DATA MININGDATA MINING
![Page 3: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/3.jpg)
10/30/02 3
Data Mining Definition
Finding hidden information in a database Fit data to a model Similar terms
Exploratory data analysis Data driven discovery Deductive learning
![Page 4: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/4.jpg)
10/30/02 4
Database Processing vs. Data Mining Processing
QueryQuery Well definedWell defined SQLSQL
QueryQueryPoorly definedPoorly definedNo precise query languageNo precise query language
DataData Operational dataOperational data
OutputOutput PrecisePrecise Subset of databaseSubset of database
DataData Not operational dataNot operational data
OutputOutput FuzzyFuzzy Not a subset of databaseNot a subset of database
![Page 5: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/5.jpg)
10/30/02 5
Data Mining Development
![Page 6: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/6.jpg)
10/30/02 6
KDD Process
Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format.
Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in
meaningful manner.
Modified from [FPSS96C]
![Page 7: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/7.jpg)
10/30/02 7
KDD Process Ex: Web Log
Selection: Select log data (dates and locations) to use
Preprocessing: Remove identifying URLs Remove error logs
Transformation: Sessionize (sort and group)
Data Mining: Identify and count patterns Construct data structure
Interpretation/Evaluation: Identify and display frequently accessed sequences.
Potential User Applications: Cache prediction Personalization
![Page 8: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/8.jpg)
10/30/02 8
Basic Data Mining Tasks
Classification maps data into predefined groups Pattern Recognition Regression
Clustering partitions database into groups Groups not known apriori Determined by the data (similarity)
Link Analysis uncovers relationships among data Association Rules
• Ex: 60% of the time bread is sold so is peanut butter Sequence Analysis
• Ex: Most people who purchase CD players will purchase a CD within one week
Not causal Not functional dependencies
![Page 9: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/9.jpg)
10/30/02 9
Survey of Data Mining Tasks
Classification• Decision Trees• Neural Networks
Clustering• Agglomerative• Partitional
Association Rules Web Mining
![Page 10: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/10.jpg)
10/30/02 10
Classification Problem
Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DC where each ti is assigned to one class.
Actually divides D into equivalence classes. Prediction is similar, but may be viewed as
having infinite number of classes.
![Page 11: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/11.jpg)
10/30/02 11
Classification Examples
Pattern matching Fraud detection Identification of plant/animal specifies Profiling (this is not a bad word) Predicting terrorists or potential
terrorist events Web searches (Information Retrieval)
![Page 12: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/12.jpg)
10/30/02 12
Defining Classes
Partitioning Based
Distance Based
![Page 13: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/13.jpg)
10/30/02 13
Decision Trees
Decision Tree (DT): Tree where the root and each internal node is labeled
with a question. The arcs represent each possible answer to the
associated question. Each leaf node represents a prediction of a solution to
the problem. Popular technique for classification; Leaf node indicates
class to which the corresponding tuple belongs.
![Page 14: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/14.jpg)
10/30/02 14
Decision Tree Example
![Page 15: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/15.jpg)
10/30/02 15
Neural Networks
Based on observed functioning of human brain. (Artificial Neural Networks (ANN) Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical
viewpoint. Alternatively, a NN may be viewed from the
perspective of matrices. Used in pattern recognition, speech recognition,
computer vision, and classification.
![Page 16: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/16.jpg)
10/30/02 16
Classification Using Neural Networks
Typical NN structure for classification: One output node per class Output value is class membership function
value Supervised learning For each tuple in training set, propagate it
through NN. Adjust weights on edges to improve future classification.
Algorithms: Propagation, Backpropagation, Gradient Descent
![Page 17: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/17.jpg)
10/30/02 17
Neural Network Example
![Page 18: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/18.jpg)
10/30/02 18
Propagation
Tuple Input
Output
![Page 19: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/19.jpg)
10/30/02 19
Backpropagation
Error
![Page 20: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/20.jpg)
10/30/02 20
Clustering Problem
Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:D{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k.
A Cluster, Kj, contains precisely those tuples mapped to it.
Unlike classification problem, clusters are not known a priori.
![Page 21: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/21.jpg)
10/30/02 21
Clustering Examples
Segment customer database based on similar buying patterns.
Group houses in a town into neighborhoods based on similar features.
Identify new plant species Identify similar Web usage patterns
![Page 22: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/22.jpg)
10/30/02 22
Agglomerative Example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
BA
E C
D
4
Threshold of
2 3 51
A B C D E
![Page 23: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/23.jpg)
10/30/02 23
Association Rule Problem
Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence.
Link Analysis NOTE: Support of X Y is same as support of
X Y.
![Page 24: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/24.jpg)
10/30/02 24
Example: Market Basket Data
Items frequently purchased together:
Bread PeanutButter Uses:
Placement Advertising Sales Coupons
Objective: increase sales and reduce costs
![Page 25: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/25.jpg)
10/30/02 25
Association Rule Definitions
Set of items: I={I1,I2,…,Im}
Transactions: D={t1,t2, …, tn}, tj I
Itemset: {Ii1,Ii2, …, Iik} I
Support of an itemset: Percentage of transactions which contain that itemset.
Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.
![Page 26: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/26.jpg)
10/30/02 26
Association Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
![Page 27: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/27.jpg)
10/30/02 27
Web Data
Web pages Intra-page structures Inter-page structures Usage data Supplemental data
Profiles Registration information Cookies
![Page 28: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/28.jpg)
10/30/02 28
Web Structure Mining
Mine structure (links, graph) of the Web PageRank Create a model of the Web organization. May be combined with content mining to more effectively
retrieve important pages.
![Page 29: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/29.jpg)
10/30/02 29
PageRank
Used by Google Prioritize pages returned from search by looking at
Web structure. Importance of page is calculated based on number of
pages which point to it – Backlinks. Weighting is used to provide more importance to
backlinks coming form important pages. PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
PR(i): PageRank for a page i which points to target page p.
Ni: number of links coming out of page i
![Page 30: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/30.jpg)
10/30/02 30
Web Usage Mining
Extends work of basic search engines Search Engines
IR application Keyword based Similarity between query and document Crawlers Indexing Profiles Link analysis
![Page 31: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e205503460f94b0c7d1/html5/thumbnails/31.jpg)
10/30/02 31
Web Usage Mining Applications
Personalization Improve structure of a site’s Web
pages Aid in caching and prediction of future
page references Improve design of individual pages Improve effectiveness of e-commerce
(sales and advertising)