statistical data mining - 1 edward j. wegman a short course for interface ‘01
TRANSCRIPT
Statistical Data Mining - 1
Edward J. Wegman
A Short Course for Interface ‘01
Outline of LectureOutline of Lecture
ComplexityData Mining: What is it?Data Preparation
Complexity
Descriptor Data Set Size in Bytes Storage Mode
Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks e.g. RAID StorageMassive 1012 Robotic Magnetic
Tape Storage Silos
The Huber Taxonomy of Data Set Sizes
Complexity
O(r),O(n 1/2) Plot a scatterplotO(n) Calculate means, variances, kernel
density estimates
O(n log(n)) Calculate fast Fourier transformsO(nc) Calculate singular value
decomposition of an rc matrix; solve a multiple linear regression
O(n2) Solve most clustering algorithms.
Algorithmic Complexity
Complexity
Table 2: Number of Operations for Algorithms of VariousComputational Complexities and Various Data Set Sizes
n n1/2 n n log(n) n3/2 n2
tiny 10 102 2x102 103 104
small 102 104 4x104 106 108
medium 103 106 6x106 109 1012
large 104 108 8x108 1012 1016
huge 105 1010 1011 1015 1020
Complexity
Table 4: Computational Feasibility on a Pentium PC10 megaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 10-6
seconds10-5
seconds2x10-5
seconds.0001
seconds.001
seconds
small 10-5
seconds.001
seconds.004
seconds.1
seconds10
seconds
medium .0001seconds
.1seconds
.6seconds
1.67minutes
1.16days
large .001seconds
10seconds
1.3minutes
1.16days
31.7years
huge .01seconds
16.7minutes
2.78hours
3.17years
317,000 years
Complexity
Table 5: Computational Feasibility on a Silicon Graphics Onyx Workstation300 megaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 3.3x10-8
seconds3.3x10-7
seconds6.7x10-7
seconds3.3x10-6
seconds3.3x10-5
seconds
small 3.3x10-7
seconds3.3x10-5
seconds1.3x10-4
seconds3.3x10-3
seconds.33
seconds
medium 3.3x10-6
seconds3.3x10-3
seconds.02
seconds3.3
seconds55
minutes
large 3.3x10-5
seconds.33
seconds2.7
seconds55
minutes1.04years
huge 3.3x10-4
seconds33
seconds5.5
minutes38.2days
10,464years
Complexity
Table 6: Computational Feasibility on an Intel Paragon XP/S A44.2 gigaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 2.4x10-9
seconds2.4x10-8
seconds4.8x10-8
seconds2.4x10-7
seconds2.4x10-6
seconds
small 2.4x10-8
seconds2.4x10-6
seconds9.5x10-6
seconds2.4x10-4
seconds.024
seconds
medium 2.4x10-7
seconds2.4x10-4
seconds.0014
seconds.24
seconds4.0
minutes
large 2.4x10-6
seconds.024
seconds.19
seconds4.0
minutes27.8days
huge 2.4x10-5
seconds2.4
seconds24
seconds66.7
hours761
years
Complexity
Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer1000 gigaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 10-11
seconds10-10
seconds2x10-10
seconds10-9
seconds10-8
seconds
small 10-10
seconds10-8
seconds4x10-8
seconds10-6
seconds10-4
seconds
medium 10-9
seconds10-6
seconds6x10-6
seconds.001
seconds1
second
large 10-8
seconds10-4
seconds8x10-4
seconds1
second2.8
hours
huge 10-7
seconds.01
seconds.1
seconds16.7
minutes3.2
years
Complexity
Table 8: Types of Computers for Interactive FeasibilityResponse Time < 1 second
n n1/2 n n log(n) n3/2 n2
tiny PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
small PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
SuperComputer
medium PersonalComputer
PersonalComputer
PersonalComputer
Super Computer TeraflopComputer
large PersonalComputer
Workstation Super Computer TeraflopComputer
---
huge PersonalComputer
SuperComputer
TeraflopComputer
--- ---
Complexity
Table 9: Types of Computers for FeasibilityResponse Time < 1 week
n n1/2 n n log(n) n3/2 n2
tiny PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
small PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
medium PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
large PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
TeraflopComputer
huge PersonalComputer
PersonalComputer
PersonalComputer
Super Computer ---
Complexity
Table 10: Transfer Rates for a Variety of Data Transfer Regimes
n standardethernet10 mega-bits/sec
fastethernet
100 mega-bits/sec
hard disktransfer
2027 kilo-bytes/sec
cachetransfer @ 200
megahertz
1.25x106
bytes/sec1.25x107
bytes/sec2.027x106
bytes/sec2x108
bytes/sec
tiny 8x10-5
seconds8x10-6
seconds4.9x10-5
seconds5x10-6
seconds
small 8x10-3
seconds8x10-4
seconds4.9x10-3
seconds5x10-5
seconds
medium .8seconds
.08seconds
.49seconds
5x10-3
seconds
large 1.3minutes
8seconds
49seconds
.5seconds
huge 2.2hours
13.3minutes
1.36hours
50seconds
Complexity
Table 11: Resolvable Number of Pixels AcrossScreen for Several Viewing Scenarios
19 inchmonitor @24 inches
25 inchTV @12 feet
15 footscreen @
20 feet
immersion
Angle 39.005o 9.922o 41.112o 140o
5 seconds of arcresolution(Valyus)
28,084 7,144 29,601 100,800
1 minute of arcresolution
2,340 595 2,467 8,400
3.6 minute of arcresolution(Wegman)
650 165 685 2,333
4.38 minutesof arc resolution
(Maar 1)
534 136 563 1,918
.486 minutes ofarc/foveal cone
(Maar 2)
4,815 1,225 5,076 17,284
Complexity
ScenariosTypical high resolution workstations,
1280x1024 = 1.31x106 pixelsRealistic using Wegman, immersion, 4:5 aspect ratio,
2333x1866 = 4.35x106 pixels Very optimistic using 1 minute arc, immersion, 4:5
aspect ratio, 8400x6720 = 5.65x107 pixels
Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio,
17,284x13,828 = 2.39x108 pixels
Massive Data SetsMassive Data Sets
One Terabyte Datasetvs
One Million Megabyte Data Sets
Both difficult to analyze,but for different reasons.
Massive Data Sets: Massive Data Sets: Commonly Used LanguageCommonly Used Language
Data Mining = DMKnowledge Discovery in Databases =
KDDMassive Data Sets = MDData Analysis = DA
Massive Data SetsMassive Data Sets
Data Mining of Massive Data Mining of Massive DatasetsDatasets
Data Mining is Exploratory Data Analysis with Little or No Human Interaction using
Computationally Feasible Techniques, i.e., the Attempt to find Interesting Structure
unknown a priori
Statistical Data MiningStatistical Data Mining
Techniques- Classification
- Clustering- Neural Networks & Genetic Algorithms- CART- Nonparametric Regression- Time Series: Trend & Spectral Estimation- Density Estimation, Bumps and Ridges
Massive Data SetsMassive Data Sets
Major Issues Complexity Non-homogeneity
Examples Huber’s Air Traffic Control Highway Maintenance Ultrasonic NDE
Massive Data SetsMassive Data Sets
Air Traffic Control 6 to 12 Radar stations, several hundred
aircraft, 64-byte record per radar per aircraft per antenna turn
megabyte of data per minute
Massive Data SetsMassive Data Sets
Highway Maintenance Records of maintenance records and
measurements of road quality for several decades
Records of uneven quality Records missing
Massive Data SetsMassive Data Sets
NDE using Ultrasound Inspection of cast iron projectiles Time series of length 256, 360 degrees,
550 levels = 50,688,000 observations per projectile
Several thousand projectiles per day
Massive Data Sets: Massive Data Sets: A DistinctionA Distinction
Human Analysis of the Structure of Data and Pitfalls
vs Human Analysis of the Data Itself
Limits of HVS and computational complexity limit the latter
Former is the basis for design of the analysis engine
Massive Data SetsMassive Data Sets
Data Types Experimental Observational Opportunistic
Data Types Numerical Categorical Image
Data Preparation
Data Preparation
0
10
20
30
40
50
60
ObjectivesDetermination
Data Preparation Data Mining Analysis &Assimilation
Eff
ort
(%
)
Data Preparation
Data Cleaning and QualityTypes of DataCategorical versus Continuous DataProblem of Missing Data
Imputation Missing Data Plots
Problem of OutliersDimension Reduction, Quantization,
Sampling
Data Preparation
Quality Data may not have any statistically significant patterns or
relationships Results may be inconsistent with other data sets Data often of uneven quality, e.g. made up by respondent Opportunistically collected data may have biases or errors Discovered patterns may be too specific or too general to
be useful
Data Preparation
Noise - Incorrect Values Faulty data collection instruments, e.g. sensors Transmission errors, e.g. intermittent errors
from satellite or Internet transmissions Data entry problems Technology limitations Naming conventions misused
Data Preparation
Noise - Incorrect Classification Human judgment Time varying Uncertainty/Probabilistic nature of data
Data Preparation
Redundant/Stale data Variables have different names in different
databases Raw variable in one database is a derived
variable in another Irrelevant variables destroy speed (dimension
reduction needed) Changes in variable over time not reflected in
database
Data Preparation
Data cleaningSelecting and appropriate data set
and/or sampling strategyTransformations
Data Preparation
Data Cleaning Duplicate removal (tool based) Missing value imputation (manual, statistical) Identify and remove data inconsistencies Identify and refresh stale data Create unique record (case) ID
Data Preparation
Categorical versus Continuous Data Most statistical theory, many graphics tools
developed for continuous data Much of the data if not most data in databases
is categorical Computer science view often takes continuous
data into categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations
Data Preparation
Problem of Missing Values Missing values in massive data sets may or
may not be a problemMissing data may be irrelevant to desired result, e.g.
cases with missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics
Massive data sets if acquired by instrumentation may have few missing values anyway
Imputation has model assumptions
Suggest making a Missing Value Plot
Data Preparation
Missing Value Plot A plot of variables by cases Missing values colored red Special case of “color
histogram” with binary data
“Color histogram” also known as “data image”
This example is 67 dimensions by 1000 cases
This example is also fake
Data Preparation
Problem of Outliers Outliers easy to detect in low dimensions A high dimensional outlier may not show up in
low dimensional projections MVE or MCD algorithms are exponentially
computationally complex Fisher Info Matrix and Convex Hull Peeling
more feasible but still too complex for Massive datasets
Data PreparationDatabase Sampling
Exhaustive search may not be practically feasible because of their size
The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined
For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)
Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.
Data Compression
Often data preparation involves data compression Sampling Quantization
Subject of my talk later in the conference. See that talk for more details on this subject.