jun yan geography department suny at buffalo july 29, 2004 geographic knowledge discovery in spatial...
TRANSCRIPT
Jun YanGeography DepartmentSUNY at BuffaloJuly 29, 2004
Geographic Knowledge Discovery in Geographic Knowledge Discovery in
Spatial Interaction With Self-Spatial Interaction With Self-
Organizing MapsOrganizing Maps
Ph.D. Dissertation Defense Ph.D. Dissertation Defense
Dissertation Committee:Dr. Jean-Claude Thill (Chair)Dr. Ling BianDr. David Mark
Background
Spatial Interaction Data
Methodology Self-Organizing Maps
Visual Data Mining
Case studies
Conclusions and Future Research
OutlineOutline
BackgroundBackground
Information technologies
More tools available
More data available
Two Legs!!!
Data-rich vs computation-rich:
challenge?
opportunity !!!
Background (Cont.)Background (Cont.)
Data Mining & Knowledge Discovery: “useful information from large databases”
useful novel valid Understandable
Geographic data mining (GDM) and geographic knowledge discovery (GKD)?
Background (Cont.)Background (Cont.)
Mining techniques: statistics, pattern recognition, machine learning, visualization, high performance computing …
Knowledge discovery processUser Controller
DBMSDB
InterfaceTarget Data
Selection
Data Mining Evaluation Discoveries
DomainKnowledge Knowledge Base
Knowledge discovery process
Data Mining
Background (Cont.)Background (Cont.)
Finding all the patterns autonomously in a database?: unrealistic
because the patterns could be too many but uninteresting
Data mining: an iterative, interactive, semi-automated process
people directs what to be mined
Visualization: Geovisualization (GVis)
visual data mining !!!
Visualization in KDD ProcessVisualization in KDD Process
Selecting Application Domain
Selecting Target Data
Processing Data
Extracting Information/Knowledge
Interpretation and Evaluation
Understanding basic data distribution, selecting meaningful target datasets
Locating missing data, noise removing, data smoothing
Parameters setting, process tracking, process steering
Interpretation, reporting, comparison, validity checking
Background (Cont.)Background (Cont.)
Learning Algorithm
Examples
Background knowledge (sometimes)
Concept description or
Other knowledge
Input layer Output layer
Hidden layer
Inputs Outputs
Machine learning & Neural Networks
Background (Cont.)Background (Cont.)
Objectives: Explore the effectiveness of neural
networks in GKD
Examine the roles of GVis in GKD
What is spatial interaction? Pairs of places
Elemental: trips made by individuals
Aggregate: flows from origins to destinations
Examples: migration, freight shipment, movement of capital & information …
Spatial Interaction DataSpatial Interaction Data
Spatial Interaction Data Spatial Interaction Data (Cont.)(Cont.)
Region 1 Region 2 Region 3
Region1
Region 2
Region 3
Basic O-D matrix
Type 1 Type 2 Type 3
Region1>Region 1
Region1>Region 2
Region1>Region 3
Dyadic O-D matrix
Origin Destination
Distance
Trip 1
Trip 2
Trip 3
Trip table
Elemental level
Aggregate level
Exploring the Patterns of Interaction
Very necessary!!!
Existing Exploratory Data Analysis (EDA): lack of interactivity
Challenges:
a large number of interactions
wide range of interaction magnitudes
multiple semantics
Spatial Interaction Data (Cont.)Spatial Interaction Data (Cont.)
Spatial Interaction Data (Cont.)Spatial Interaction Data (Cont.)
Origin
Destination
Interaction semantics
O-D Matrices
Multidimensionality!!!
Spatial Interaction Data (Cont.)Spatial Interaction Data (Cont.)
Electronic products
Machinery
Vehicle and parts Photographic products
MethodologyMethodology
Self-Organizing Maps (SOM)
Visual Data Mining (VDM):
SOM as core DM engine
Interactivity
Self-Organizing MapsSelf-Organizing Maps
A crucial task of KDD: reduce data complexity
1) Data Quantization: number of records, here number of spatial interactions
2) Data Projection: number of variables, here number of interaction semantics
By reducing data complexity, identification of meaningful geographic structures becomes possible
Traditional multivariate statistical methods share their limitations
Self-Organizing Maps (Cont.)(Cont.)
Losing Node
Winning NodeOutput
Losing Node
Input Layer Competitive Output layer
1. A special type of competitive neural network;
2. Based on some measure of dissimilarity in the attribute space;
3. Capable of reducing data complexity on two dimensions simultaneously
4. Actually an unsupervised pattern classifier.
1. A special type of competitive neural network;
2. Based on some measure of dissimilarity in the attribute space;
3. Capable of reducing data complexity on two dimensions simultaneously
4. Actually an unsupervised pattern classifier.
))()(()()()1( tmxthttmtm kckkk
Self-Organizing Maps (Cont.)(Cont.)
1. Best match unit (BMU) changes its value to fit with the input data;
2. Its neighboring nodes change their values to fit with the input data as well. Only the magnitude decreases with distance;
3. Like a flexible net;
4. Similar data will locate close to each other in the mapping
1. Best match unit (BMU) changes its value to fit with the input data;
2. Its neighboring nodes change their values to fit with the input data as well. Only the magnitude decreases with distance;
3. Like a flexible net;
4. Similar data will locate close to each other in the mapping
Visual Data MiningVisual Data Mining
Visualization Forms
Assignment
Focusing
Brushing
Colormap manipulation
Dynamic linking
Interaction Forms
Operation
Framework
Visualization FormsVisualization Forms
Case StudiesCase Studies
Airline Origin and Destination Survey Market Table (DB1Market): http://www.bts.org 10% of air flight itineraries
Geographic scale: airport level 280 metros in Contiguous US
Temporal range: 1993 to 2002
Two case studies on DB1BMarket Cross-sectional analysis
Temporal changes
9
8
7
6
5
4 3
21
Clustering AnalysisClustering Analysis
1. A cluster is an area of low values (distance) surrounded by areas of high values (distance).
2. There are several clusters in the feature map
1. A cluster is an area of low values (distance) surrounded by areas of high values (distance).
2. There are several clusters in the feature map
9-1
8
7
6
5
43
2
1
9-2
9-3 9-4
9-5
Clustering Analysis (Cont.)Clustering Analysis (Cont.)
A cluster is a valley in a 3-D mapA cluster is a valley in a 3-D map
Cluster Analysis (Cont.)Cluster Analysis (Cont.)
Market Share
Contribution
Cluster Analysis (Cont.)Cluster Analysis (Cont.)
C #
Cluster Property (Airline)
1 America West (HP)
2 US Air (US)
3 Continental (CO), Continental Express
(RU)
4 Northwest (NW), Mesaba (XJ)
5 Horizon (QX)
6 United (UA)
7 Air Wisconsin (ZW)
8 American (AA), American Eagle
(MQ)
9-1
No dominant airlines
9-2
Southwest (WN)
9-3
Comair (OH)
9-4
Delta (DL)
9-5
Delta (DL), Atlantic Southeast (EV)
Multiple
AA MQ
ZW
UA
QX
NW XJ
CO RU
US
HP
WN
QX DL
DL
EV
Cluster Analysis (Cont.)Cluster Analysis (Cont.)
Markets with US Airways Market Share >= 50%
Markets Represented by Cluster 2
Cluster 2
Cluster Analysis: Cluster Analysis: MarketsMarkets From From NashvilleNashville
AA
US
NW
UADL
CORU
WN
EV
Cluster Analysis: Cluster Analysis: MarketsMarkets From From Nashville (Cont.)Nashville (Cont.)
AA
US
NW
UADL
CORU
WN
EV
Association AnalysisAssociation Analysis
Market Share
Average
Airfare
Association Analysis Association Analysis (Cont.)(Cont.)
American Delta
Association Analysis Association Analysis (Cont.)(Cont.)
Average Airfare, Delta (without competition of Airtran)
Average Airfare, Delta (with competition of Airtran)
Temporal ChangesTemporal Changes
Temporal Changes (Cont.)Temporal Changes (Cont.)
AA 1993
TWA 2001
AA 2001AA
2002
Temporal Changes (Cont.)Temporal Changes (Cont.)
Continental share
Northwest share
Temporal Changes: Temporal Changes: TrajectoryTrajectory
98
00
96
01
93
US Airways share
98
00
96
01
93
Southwest share
98
00
96
01
93
US Airways fare
Market from Buffalo to DC
ConclusionsConclusions
Data rich environment: large databases, and high dimensionality
Data complexity reduction is crucial
Results suggest SOM: summarize well the overall data distribution
capable of detecting clustered structures
can be used to analyze the properties of clustered structures
can be used to study the associations among input variables
Conclusions (Cont.)Conclusions (Cont.)
Interactive visual data mining can: examine subset data more closely
study relationships among interaction types
analyze how detected clusters are distributed in the actual geographic space
Help us gain a better understanding of the factors and spatial processes behind
Future ResearchFuture Research
SOM/VDM analysis DB1BMarket
Other types of spatial interaction data
Data at elemental level
Improved VDM environment Human subject testing
Seemly-coupled
Thank You!Questions? Comments?
Contact: [email protected]
Background (Cont.)Background (Cont.)
Geographic database fits the profile: massive volume: GIS, GPS, Remote
Sensing …
high dimensionality
Geographic data mining (GDM) and geographic knowledge discovery (GKD)?
Current topic in GIS research
Background (Cont.)Background (Cont.)
Exploratory analysis
Knowledge construction
Analysis and modeling
Evaluation of results
Model driven
Data driven
TimeVisual exploration & visual data mining Visual
knowledge construction & refinement
Visual model tracking,
model steering
Data presentation,
visualization of uncertainty
Exploratory analysis
Knowledge construction
Roles of Visualization
Visualization in KDD ProcessVisualization in KDD Process
Selecting Application Domain
Selecting Target Data
Processing Data
Extracting Information/Knowledge
Interpretation and Evaluation
Understanding basic data distribution, selecting meaningful target datasets
Locating missing data, noise removing, data smoothing
Parameters setting, process tracking, process steering
Interpretation, reporting, comparison, validity checking
Modeling Flows
Spatial interaction models: “Gravity Models”
Other geographic factors: Geographic relationships among
origins?
Geographic relationships among destinations?
Association among types of interaction?
Modeling FlowsModeling Flows
Modeling Flows Spatial interaction models: “Gravity
Models”
Push: origin
Pull: destination
Transportation cost: distance decay
Modeling FlowsModeling Flows
Iij = k Pi Pj / dija
= k Pi Pj dij -a
Spatial Interaction Data (Cont.)Spatial Interaction Data (Cont.)
Spatial Interaction Data (Cont.)Spatial Interaction Data (Cont.)
Limitations of Traditional Multivariate Limitations of Traditional Multivariate MethodsMethods Data Projection
Factor analysis Projection pursuit Multi-dimensional
scaling Data Quantization
Partitioning methods Hierarchical methods
o Linearityo Stationaryo Normal distributiono Limited data amounto One dimension
compression
o Non-linearo Non-stationaryo Distribution unknowno Sparseo Large data amounto Multi-dimensional
Visualization FormsVisualization Forms
Interaction FormsInteraction Forms
Interaction FormsInteraction Forms
Data DistributionData Distribution
1. Similar data distributions
2. But greatly reduced number of low values
3. SOM prototype represents original data well
1. Similar data distributions
2. But greatly reduced number of low values
3. SOM prototype represents original data well
Cluster Analysis (Cont.)Cluster Analysis (Cont.)
Markets with Southwest Market Share >= 50%
Markets Represented by Cluster 9-2
Cluster 9-2 Markets with Southwest Market Share >= 20%
Temporal Changes Temporal Changes (Cont.)(Cont.)
US Airways share
American share
Temporal Changes Temporal Changes (Cont.)(Cont.)
Delta shareUnited share
Temporal Changes (Cont.)Temporal Changes (Cont.)
Temporal Trend: Temporal Trend: Trajectory Trajectory (Cont.)(Cont.) Market from Buffalo to NYC
US Airways share
93
96
00
01
JetBlue share
93
96
00
01
US Airways fare
93
96
00
01
Temporal Trend: Temporal Trend: Trajectory Trajectory (Cont.)(Cont.) Market from Buffalo to Atlanta
93
98
Airtran Airways share
Delta share
93
98
Delta fare
93
98