lecture 01, introduction
DESCRIPTION
INTRODUCTION TO DATA MINING by Hari Sundaram Adapted from: slides by Jiawei Han and Kevin ChangTRANSCRIPT
![Page 1: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/1.jpg)
INTRODUCTION TO DATA MINING
Hari Sundaram
adapted from slides by Jiawei Han and Kevin Chang
[email protected]://sundaram.cs.illinois.edu
![Page 2: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/2.jpg)
DAIS@UIUC: DATA MINING,
DATABASE SYSTEMS, TEXT INFORMATION
SYSTEMS, NETWORKS
Different classes in Database and Information Systems
2
Zhai
Parameswaran
Han
Chang
Sundaram
![Page 3: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/3.jpg)
DATA MINING
Intro. to data mining (CS412: Han, Chang, Sundaram, Spring and Fall)
Data mining: Principles and algorithms (CS512: Han, Chang, Spring and Fall)
Seminar: Advanced Topics in Data mining (CS591Han: Fall and Spring, 1 credit)
3
![Page 4: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/4.jpg)
DATABASE SYSTEMS
Introduction to database systems (CS411: Chang, Parameswaran, Sinha — Spring and Fall)
Advanced database systems (CS511: Chang, Parameswaran — Fall or Spring)
Seminar: Human in the Loop Data Management (CS598: Parmeswaran, Fall)
4
![Page 5: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/5.jpg)
TEXT INFORMATION SYSTEMS
Text information system (CS410 Zhai: Spring)
Advanced text information systems (CS510) Zhai: Fall)
5
![Page 6: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/6.jpg)
NETWORKS + ADVERTISING
Seminar: Social & Information Networks (CS598, Sundaram, Fall, every two years)
Social & Information Networks (CS498, Sundaram, Fall, every two years; next class: Fall 2016)
Computational Advertising (CS498, Spring every two years, starting Spring 2017)
6
![Page 7: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/7.jpg)
Keep in MindBIOINFORMATICS
YAHOO DAIS SEMINAR
7
![Page 8: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/8.jpg)
CS412 CLASS MECHANICSEverything you wanted to
know 8
![Page 9: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/9.jpg)
class website:
9
https://wiki.cites.illinois.edu/wiki/display/cs412sp16/Syllabus
![Page 10: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/10.jpg)
lectures are online
10
http://bit.ly/1ZBxRuY
![Page 11: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/11.jpg)
BUT..11
![Page 12: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/12.jpg)
WHY YOU SHOULD COME
A lot of research shows that students who come to class tend to score better in exams
We’ll be solving problems in class that will help with understanding of the material
12
![Page 13: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/13.jpg)
sign up on piazza!
13
piazza.com/illinois/spring2016/cs412
![Page 14: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/14.jpg)
assignments
14
Written assignments (3) 7x3=21 points 5x3=15%
Programming assignments (2) 13.5x2=27 points 10x2=20%
![Page 15: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/15.jpg)
exams
15
Mid Term Exam 41 points 30%
Final Exam 48 points 35%
![Page 16: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/16.jpg)
project
16
Master’s students (mandatory) 34 points 25%
extra credit, undergrad 13.5 points 10%
rest of the grade scaled to 75%
kaggle.com
![Page 17: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/17.jpg)
GRADING
Will grade on a curve
Max points is 137; makes no difference to the grade, but there is empirical evidence to suggest that students are happier.
Will grade undergrads and grads on the same curve— there is no difference in performance.
Grad students taking the 4 credit class will need to do an extra project worth 25% of the grade
17
![Page 18: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/18.jpg)
regular feedback
18
why are you excited about this class?
what are you concerned about?
![Page 19: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/19.jpg)
MEET THE TA’S
Meet the TA’s
19
De Liao
Xiangyu “Joe” Chen
![Page 20: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/20.jpg)
Hari Sundaram Associate Professor, Department of Computer Science
Web: http://sundaram.cs.illinois.edu/ Email: [email protected]
What I do: design algorithms and build systems to understand and to shape crowd behavior in large scale social networks
Current Projects • Belief Evolution • Continuum representations for
networks (Meidani, Dey) • Persuasive message synthesis
(Karahalios) • Data Inference with privacy
policies (Kravets) • Sampling of large graphs • Network Summarization (Chang) • Crowd Clustering (Parmeswaran)
![Page 21: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/21.jpg)
Why are you taking this class?
21
![Page 22: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/22.jpg)
what do you want to do?22
wade
swim
dive
![Page 23: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/23.jpg)
WHY DATA MINING?
23
![Page 24: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/24.jpg)
There is an explosive growth of data: from
terabytes to petabytes
24
![Page 25: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/25.jpg)
MAJOR SOURCES OF DATA
Business:
Web, e-commerce, transactions, stocks, …
Science:
Remote sensing, astronomy, bioinformatics, scientific simulation, …
Society and everyone:
news, digital cameras, YouTube
25
![Page 26: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/26.jpg)
“We are drowning in data, but starving for knowledge
-John Naisbitt, 1982.
![Page 27: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/27.jpg)
WHAT IS DATA
MINING?27
![Page 28: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/28.jpg)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge from huge
amount of data
28
![Page 29: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/29.jpg)
Is Data Mining a misnomer?
29
We don’t mine for data!
![Page 30: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/30.jpg)
Also known as
30
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
![Page 31: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/31.jpg)
Is everything data mining?
31
Simple search and query processing
(Deductive) expert systems
![Page 32: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/32.jpg)
The Knowledge discovery process
32
databases task relevant data pattern evaluation
data mining
Knowledgedata warehouse
data cleaning
database, data warehousing community view
![Page 33: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/33.jpg)
EXAMPLE: WEB MINING
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into knowledge-base
33
![Page 34: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/34.jpg)
Data mining in business intelligence
34
Increasing potentialto supportbusiness decisions End User
BusinessAnalyst
DataAnalyst
DBA
DecisionMaking
Data PresentationVisualization Techniques
Data MiningInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data WarehousesData Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
![Page 35: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/35.jpg)
The Machine Learning / Statistics View
35
input
Knowledgedata preprocessing
Data integration, Normalization, Feature selection, Dimension reduction
data mining
Pattern discovery, Association & correlation, Classification, Clustering, Outlier analysis
post-processing
Pattern evaluation, Pattern selection, Pattern interpretation, Pattern visualization
![Page 36: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/36.jpg)
WHICH VIEW DO YOU PREFER?
Which view do you prefer?
KDD vs. ML/Stat. vs. Business Intelligence
Depending on the data, applications, and your focus
Data Mining vs. Data Exploration
Business intelligence view
Warehouse, data cube, reporting but not much mining
Business objectives vs. data mining tools
Supply chain example: mining vs. OLAP vs. presentation tools
Data presentation vs. data exploration
36
![Page 37: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/37.jpg)
APPLICATIONS OF DATA MINING
37
![Page 38: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/38.jpg)
Web page analysis: from web page classification, clustering to PageRank & HITS algorithms
38
![Page 39: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/39.jpg)
Collaborative analysis & recommender systems
39
![Page 40: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/40.jpg)
Basket data analysis to targeted marketing
40
![Page 41: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/41.jpg)
Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
41
microarray
biological network
![Page 42: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/42.jpg)
Software engineering and
data mining
42
fix bugs
code completioncode optimization
estimate costs
source code
binaries
execution traces
commits
![Page 43: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/43.jpg)
dedicated data mining tools
43
Orcale data mining tools
SAS
MS SQL Server Tools
![Page 44: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/44.jpg)
DATA MINING: A MULTI-
DIMENSIONAL VIEW
44
![Page 45: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/45.jpg)
The Data
45
Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
![Page 46: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/46.jpg)
Data Mining Functions
46
Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
![Page 47: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/47.jpg)
Techniques
47
Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.
![Page 48: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/48.jpg)
Applications
48
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
![Page 49: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/49.jpg)
WHAT KINDS OF DATA?
49
![Page 50: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/50.jpg)
Database-oriented datasets and applications
50
Relational databases, data warehouse, transactional databases
Object-relational databases, Heterogeneous databases and legacy databases
![Page 51: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/51.jpg)
Advanced datasets and
advanced applications
51
Data streams and sensor data
Structure data, graphs, social networks and information networks
Time-series data, temporal data, sequence data (incl. bio-sequences)
Multimedia databases
Spatial data and spatiotemporal data
Text databases The World-Wide Web
![Page 52: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/52.jpg)
WHAT CAN WE DISCOVER
WITH THIS DATA?
52
![Page 53: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/53.jpg)
Generalization
53
Information integration and data warehouse construction
Data cleaning, transformation, integration, and multidimensional data model
![Page 54: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/54.jpg)
Generalization
54
Data cube
Scalable methods for computing (i.e., materializing) multidimensional aggregates
OLAP (online analytical processing)
![Page 55: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/55.jpg)
Generalization
55
Multidimensional concept description: characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
![Page 56: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/56.jpg)
Association and Correlation Analysis
56
Frequent patterns
What items are frequently purchased together at the local grocery store?
![Page 57: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/57.jpg)
Association and Correlation Analysis
57
Association, Correlation, and Causality
there are subtle differences between them!
![Page 58: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/58.jpg)
An Association Rule
58
Diapers ⇒ Beer, [0.5%, 75%]support confidence
How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other applications?
![Page 59: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/59.jpg)
Association and Correlation Analysis
59
correlation measures the linear dependence between two numeric variables
![Page 60: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/60.jpg)
What is the difference between association
and correlation?
60
Association and Correlation Analysis
Diapers ⇒ Beer, [0.5%, 75%]
![Page 61: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/61.jpg)
correlation ≠
causation
61
Association and Correlation Analysis
![Page 62: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/62.jpg)
Classification
62
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
Predict some unknown class labels
climate, gas mileage
![Page 63: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/63.jpg)
Classification
63
Typical methods
Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …
![Page 64: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/64.jpg)
Classification
64
Typical applicationsCredit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …
![Page 65: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/65.jpg)
CLUSTER ANALYSIS
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing interclass similarity
Many methods and applications
65
![Page 66: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/66.jpg)
OUTLIER ANALYSIS
Noise or exception? ― the value of the outlier is application dependent
Methods: byproduct of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
66
A data object that does not comply with the general behavior of the data
![Page 67: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/67.jpg)
SEQUENTIAL PATTERN, TREND AND EVOLUTION ANALYSIS
Trend, time-series, and deviation analysis: e.g., regression and value prediction
Sequential pattern mining
e.g., first buy digital camera, then buy large SD memory cards
Periodicity analysis
Motifs and biological sequence analysis
Approximate and consecutive motifs
Similarity-based analysis
Mining data streams
Ordered, time-varying, potentially infinite, data streams
67
![Page 68: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/68.jpg)
MINING GRAPHS
Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)
68
![Page 69: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/69.jpg)
NETWORK ANALYSIS
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family, classmates, …
Links carry a lot of semantic information: Link mining
69
political blogs
terrorist networks
![Page 70: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/70.jpg)
MINING THE WEB
70
Web community discovery, opinion mining, usage mining, …
![Page 71: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/71.jpg)
EVALUATION
Is all discovered knowledge interesting?
71
May not be representative, be transient, …
One can discover a very large number of “patterns”
Some may fit only certain dimension space (time, location, …)
![Page 72: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/72.jpg)
EVALUATION
Why not discover only interesting knowledge?
72
Coverage
Descriptive vs. predictive
Typicality vs. novelty
Accuracy
Timeliness
![Page 73: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/73.jpg)
WHAT KINDS OF
TECHNOLOGIES ARE USED?
73
![Page 74: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/74.jpg)
A Confluence of Technologies
74
Data MiningMachine Learning
Pattern Recognition
Statistics
Applications
Algorithms
Database Technology
High Performance Computing
Visualization
![Page 75: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/75.jpg)
Why a confluence?
![Page 76: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/76.jpg)
Massive76
Algorithms must be scalable to handle big data
![Page 77: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/77.jpg)
High dimensional
77
Micro-array may have tens of thousands of
dimensions
![Page 78: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/78.jpg)
Complex, Diverse
78
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social and information networks
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
![Page 79: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/79.jpg)
MAJOR ISSUES IN
DATA MINING79
![Page 80: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/80.jpg)
Mining Methodology
80
Mining various and new kinds of knowledge Mining knowledge
in multi-dimensional space
An interdisciplinary effort Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
![Page 81: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/81.jpg)
User Interaction
81
Interactive miningPresentation and visualization of data mining results
Incorporation of background knowledge
![Page 82: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/82.jpg)
Efficiency and Scalability
82
Parallel, distributed, stream, and incremental mining methods
Space and Time complexity of data mining algorithms
![Page 83: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/83.jpg)
Data Type diversity
83
Handling complex data types
Mining dynamic, networked, and global data repositories
![Page 84: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/84.jpg)
Data mining and society
84
what is the social impact of data mining?
Privacy-preserving
Invisible
![Page 85: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/85.jpg)
Privacy is an important issue
![Page 86: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/86.jpg)
Summary
86
Major issues
Data mining: Discovering interesting patterns and knowledge from massive amount of data
A natural evolution of science and information technology, in great demand, with wide applications
A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
Mining can be performed on a variety of data
Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.
Data mining technologies and applications
![Page 87: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/87.jpg)
A BRIEF HISTORY1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.
ACM Transactions on KDD (2007)
87
![Page 88: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/88.jpg)
KDD CONFERENCES
ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)
SIAM Data Mining Conf. (SDM)
(IEEE) Int. Conf. on Data Mining (ICDM)
European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)
Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
Int. Conf. on Web Search and Data Mining (WSDM)
88
![Page 89: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/89.jpg)
RELATED JOURNALS AND CONFERENCES
Journals:
Data Mining and Knowledge Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge and Data Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD
Conferences:
DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, …
Web and IR conferences: WWW, SIGIR, WSDM
ML conferences: ICML, NIPS
PR conferences: CVPR, ICCV
89
![Page 90: Lecture 01, Introduction](https://reader034.vdocument.in/reader034/viewer/2022042721/577c83eb1a28abe054b6ced7/html5/thumbnails/90.jpg)
REFERENCE BOOKSE. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009
B. Liu, Web Data Mining, Springer 2006
Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
90