data mining and visualization
DESCRIPTION
DATA MINING and VISUALIZATION. Instructor: Dr. Matthew Iklé , Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University Fall 2014. COURSE INFORMATION. Course Website: datamined.wordpress.com Instructor email: [email protected] - PowerPoint PPT PresentationTRANSCRIPT
DATA MINING and VISUALIZATION
Instructor: Dr. Matthew Iklé, Adams State University
Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University
Fall 2014
COURSE INFORMATION
Course Website: datamined.wordpress.com Instructor email: [email protected] Instructor cell phone: +1 719-588-4487 Instructor office hours: MWF 10-11 and TR 8:30-9:30 and
by appointment (Mountain time) Required text: Tan, Steinbach, Kumar, Introduction to
Data Mining, ISBN: 0-321-32136-7, Pearson Education, 2006.
Recommended text: Witten, Eibe, Hall, Data Mining, Practical Machine Learning Tools and Techniques, ISBN: 978-0-12-374856-0, Elsevier, 2011.
COURSE REQUIREMENTS
Minimal prerequisites
Modest background in statistics and mathematics
Necessary material integrated into the course
Will utilize basic machine learning toolkits such as WEKA and Waffles
Projects may require elementary programming, but each team will include at least one “programmer”
WHAT IS DATA MINING?
The process of automatically extracting useful information from large amounts of data.
Uses traditional data analysis techniques (statistics) and sophisticated computer algorithms to discover patterns.
Uses machine learning techniques to find structural patterns within the data.
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniquesmay be unsuitable due to Enormity of data High dimensionality
of data Heterogeneous,
distributed nature of data
Origins of Data Mining
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
Two Basic Problem Classes
Prediction Methods Use some variables to predict unknown or future values of
other variables.
Description Methods Find human-interpretable patterns that describe the data.
Basic Types of Data Mining Tasks
Classification (predictive)
Clustering (descriptive)
Association rules (descriptive)
Sequential patterns (descriptive or predictive)
Regression (predictive)
Anomaly Detection (predictive)
Data Mining Techniques
Statistical techniques
Clustering
Decision trees
Subsampling (bootstrapping)
Nearest-neighborhoods
SOM
Bayesian methods
Data Mining Techniques
Artificial Neural Nets
Deep Learning (Google DeepMind)
PCA
Universal Prediction
Reinforcement Learning
“Compression” Sequence Prediction Techniques
Time Series Analysis
Data Mining Techniques
Hidden Markov Models
MLN
PLN
EDA (MOSES)
Random Forests
Feature Engineering
Unsupervised and Semi-Supervised Learning
DATA MINING TECHNIQUES
Entropy methods
Multifractal methods (time series)
Log-linear power laws (crash prediction)
Wavelet transforms
….
….
….
CLASSIFICATION: Definition
Given a collection of records (training set ) Each record contains a set of attributes one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of
the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
CLUSTERING: Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one
another. Data points in separate clusters are less similar to one
another.
Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
ASSOCIATION RULE: Definition
Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will
predict occurrence of an item based on occurrences of other items.
SEQUENTIAL PATTERN: Definition
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.
Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.
REGRESSION: Definition
Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples: Predicting sales amounts of new product based on
advetising expenditure. Predicting wind velocities as a function of
temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
ANOMALY DETECTION: Definition
Detect significant deviations from normal behavior
Applications: Credit Card Fraud Detection
Network Intrusion Detection