information technology and systems center university of alabama in huntsville
DESCRIPTION
Data Mining in Earth Sciences. Rahul Ramachandran, Sara Graves and Ken Keiser Mathematical Challenges in Scientific Data Mining IPAM January 14-18, 2002. Information Technology and Systems Center University of Alabama in Huntsville http://datamining.itsc.uah.edu. Outline. Introduction - PowerPoint PPT PresentationTRANSCRIPT
Data Mining in Earth SciencesData Mining in Earth Sciences
Rahul Ramachandran, Sara Graves and Ken Keiser
Mathematical Challenges in Scientific Data MiningIPAM January 14-18, 2002
Information Technology and Systems CenterInformation Technology and Systems Center
University of Alabama in HuntsvilleUniversity of Alabama in Huntsville
http://datamining.itsc.uah.eduhttp://datamining.itsc.uah.edu
OutlineOutline
• Introduction
• ADaM System
• Data Mining Taxonomy for Earth Science• Event/Relationship based• Application Examples
• Dimensionality Reduction
• References
Reasons for Data Mining of Earth Reasons for Data Mining of Earth Science DataScience Data
• Greatly increased data volume due to improvements in data collection/access/availability/storage technology (instruments, computational resources, internet…)
• Terra are about 1 terabyte per day - more than can be analyzed by conventional means
• High variability in data formats and content
• Need for high returns on expensive data investments
• Need for improved access/availability of data, information and knowledge
• Need for higher level products for the non-specialist and interdisciplinary/cross-domain researchers
• Questions/queries are getting more complex due, in part, to heterogeneous nature of the data
Characteristics of Earth Science Characteristics of Earth Science DataData
• High variability• Type:
• Geostationary • Polar Orbiting
• Structure• Raster • Vector
• Resolution• Fine – AVHRR 1km • Coarse – SSM/I 20km
• Multi/Hyper Spectral• Processing stage:
• Level 0: Raw data – instrument counts• Level 1: Annotated with Geo-reference information• Level 2: Transformed by algorithm into geophysical parameter• Level 3: Spatial/Temporal resampling• Level 4: Includes additional model data
Characteristics of Earth Science Characteristics of Earth Science DataData
• Need to know physical basis (domain knowledge) before applying statistical techniques
• Multiple time scales
• Wide variety of data formats
• Includes spatial/temporal information
• Typically needs domain-specific algorithms
ADaM HistoryADaM History
• Algorithm Development and Mining (ADaM) System
• The system provides knowledge discovery, feature detection and
content-based searching for data values, as well as for metadata.
• It contains over 120 different operations to be performed on the
input data stream.
• Operations vary from specialized atmospheric science data-set
specific algorithms to different digital image processing
techniques, processing modules for automatic pattern
recognition, machine perception, neural networks and genetic
algorithms.
• Developed a Event/Relationship Search System for the environment
ADaM Engine ArchitectureADaM Engine Architecture
PreprocessedData
PreprocessedData
DataDataTranslated
Data
Patterns/ModelsPatterns/Models
ResultsResults
OutputGIF ImagesHDF-EOSHDF Raster ImagesHDF SDSPolygons (ASCII, DXF)SSM/I MSFC
Brightness TempTIFF ImagesOthers...
Preprocessing AnalysisClustering K Means Isodata MaximumPattern Recognition Bayes Classifier Min. Dist. ClassifierImage Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture OperationsGenetic AlgorithmsNeural NetworksOthers...
Selection and Sampling Subsetting Subsampling Select by Value Coincidence SearchGrid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find HolesImage Processing Cropping Inversion ThresholdingOthers...
Processing
InputHDFHDF-EOSGIF PIP-2SSM/I PathfinderSSM/I TDRSSM/I NESDIS Lvl 1BSSM/I MSFC
Brightness TempUS RainLandsatASCII GrassVectors (ASCII Text)
Intergraph RasterOthers...
ADaM Mining EnvironmentADaM Mining Environment
MiningResults
Mining Engine (ADaM)AnalysisModules
InputModules
OutputModules
Analysis/Vis Tools
Knowledge Base
Distributed Clients
Web-basedWorkstation
basedOther Systems
Common Client API
Data Stores
Data Mining Server
Event/Relationship SearchSystem
Data Mining TaxonomyData Mining Taxonomy
Event-based mining
Relationship-based mining
Event-based MiningEvent-based Mining
• Known events/Known algorithms
• Tropical Cyclones from AMSU-A data
• Known events/Learned algorithms
• Rainfall estimation from SSM/I data
• Lightning Detection from OLS data
• Unknown event/Unknown algorithm
• Target Independent Mining
Known Event/Known AlgorithmKnown Event/Known Algorithm
I know what phenomenato detect and I have the
algorithm to do so!
Results
Add algorithmto Mining Environment
Earth Science Data Sets
• Relationship analysis• Coincidence searches• Input for other algorithms
Tropical Cyclone Detection:Tropical Cyclone Detection:Estimating Maximum Wind SpeedEstimating Maximum Wind Speed
• Scientist: Dr. Roy Spencer (GHCC/MSFC NASA)• Data used: Advanced Microwave Sounding Unit-A• Radiometer can detect temperatures at different levels of the
atmosphere• Surface winds in tropical cyclones are directly related to the warm
middle- and upper-atmosphere temperatures which exist around the cyclone center
• AMSU-A measures this warmth at several frequencies near 55 gigahertz (GHz)
• Calibrated using aircraft reconnaissance measurements in tropical depressions, tropical storms, and hurricanes from the 1998 Atlantic hurricane season
• Tropical cyclone detection based on ice scattering, water vapor and wind speed
Tropical Cyclone Detection:Tropical Cyclone Detection:Estimating Maximum Wind SpeedEstimating Maximum Wind Speed
Advanced Microwave Sounding Unit
(AMSU-A)Data
Calibration/Limb Correction/Converted to Tb
Mining Environment
Data Archive
ResultResults are placed on the web and made available to
National Hurricane Center &Joint Typhoon Warning Center
Hurricane Floyd
• Water cover mask to eliminate land• Laplacian filter to compute temperature gradients• Science Algorithm to estimate wind speed• Contiguous regions with wind speeds above a desired threshold identified• Additional test to eliminate false positives• Maximum wind speed and location produced
Known Event/Learned AlgorithmKnown Event/Learned Algorithm
I know what phenomenaI want to detect but I do
not know the characteristicsof the phenomena
Data MiningSystem
Results
Refine your algorithm
using iteration
Earth Science Data Sets
• Relationship analysis• Coincidence searches• Input for other algorithms
• Scientist: Dr. Steve Goodman (GHCC/MSFC NASA)
• To determine whether generic pattern recognition techniques could be applied to SSM/I data to detect rain
• Minimum Distance Classifier, Back-propagation Neural Network and Discrete Bayes Classifier were compared against a Science Algorithm ( WetNet PIP Algorithm)
• US Composite rainfall product was used as ground truth
Rainfall Estimation and Identification Study Rainfall Estimation and Identification Study using SSM/I datausing SSM/I data
Subsetted SSM/I data
NEXRAD Composite data
Rainfall Estimation and Identification Study Rainfall Estimation and Identification Study using SSM/I datausing SSM/I data
SSM/I and US rain data over southeastern United States for the
period January and July 1995 were compared in the study
SSM/I and Radar data were gridded and registered to establish
spatial and temporal coincidence
BPNN performance was comparable to that of the WetNet PIP SSM/I
rain rate algorithm
Performance of Bayes classifier was not as good as that of the
WetNet PIP SSM/I rain rate algorithm. This is perhaps due to the small
sample size used for estimating density functions of the two classes
(rain and non-rain)
Lightning Detection in Operational Lightning Detection in Operational Linescan System (OLS) ImagesLinescan System (OLS) Images
• Scientist: Dr. Steve Goodman (GHCC/MSFC NASA)
• To identify lightning streaks in night time portions of OLS images
• OLS is carried by DMSP satellites and produces a visible and thermal image
• Lightning shows up as bright horizontal streaks as do city lights and moonlight reflected off the clouds
• Approach based on morphological filtering and gradient detection was selected
• Both visible and thermal band used
Lightning Detection in Operational Linescan Lightning Detection in Operational Linescan System (OLS) ImagesSystem (OLS) Images
Results ( % Accuracy)Results ( % Accuracy)
Erosion and dilation was used to find areas in/near clouds, other areas were removedGradient detection in the direction of satellite propagation was applied to the visible image to extract horizontal streaksTexture measures were used to identify areas of small patchy cloud cover which exhibited small bright streaksGenetic algorithm was used to tune parameters of the classification during training
Correctly Detected
False Positives
FalseNegatives
Training Results 80 0.7 19.2
Test Results 78.2 4.3 17.3
Unknown Event/Unknown AlgorithmUnknown Event/Unknown Algorithm
I want to find anomaliesin the
data sets !
Data MiningSystem
Results
Earth Science Data Sets
Let the miner “discover” it
• Relationship analysis• Coincidence searches• Input for other algorithms
Example: Target Independent MiningExample: Target Independent Mining
Target Independent Mining of Target Independent Mining of SSM/I DataSSM/I Data
• Mine for data in a target independent manner (no specific phenomena under consideration)
• Interested in transient phenomena that move through an area
• Transient phenomena characterized as deviation from normal
• Objective: Data Reduction with minimum loss of information• Size of remotely sensed data prevents it from being maintained on-line• Data is archived in much slower tertiary storage • Need to develop techniques to minimize the need for data access from the
tertiary storage
• Procedure: Overlay the earth’s surface with a constant grid consisting of cells• For each cell a maximum and minimum trend line is computed
• Maximum trend line is computed by forming a set of maximum values for a day over some period (month)
• Median for a series of months is used to form the maximum trend line
• Same procedure used to calculate minimum trend line
Target Independent Mining of Target Independent Mining of SSM/I DataSSM/I Data
Trend Lines Represent What Is Normal
Target Independent Mining of Target Independent Mining of SSM/I DataSSM/I Data
• Extracted metadata not oriented toward any
particular transient phenomena• Laboratory tests show 98% data compression
while preserving 92% of MCSs detectable in raw
data• MCS events represented only 6.7% of extracted
metadata
• Coincident Association
• VARGA Algorithm for multispectral data
• Localized Spatial Association
• Cumulus Cloud Classification in GOES Imagery
• Temporal Association
Relationship-based MiningRelationship-based Mining
Coincident Association MiningCoincident Association Mining
• Use Market Basket analysis to mine for association rules in vector data
• Rule has form [X Y]• Rule characterized by
Support: % of vector instances that have X Y How likely the rule is applicable?
Confidence: What % of vector instances that contain X also contain Y? Estimate of conditional probability
Coincident Association Applied to Coincident Association Applied to Multi-spectral Data MiningMulti-spectral Data Mining
• Developed and implemented Vector Association Rule Generation Algorithm (VARGA) as a modification to market-basket association rule mining.
• Modified to minimize memory usage for large multi-spectral satellite data such as SSM/I (90 megabytes per day uncompressed)
• Example SSM/I Rule:• [19V, 180.0] [37H, 140.0] -> [37V, 200.0] : 0.117037
0.945986
Localized Spatial Association Localized Spatial Association MiningMining
• Extract association rules to characterize texture (Dissertation of Dr. John Rushing)
• Each pixel on an nxn neighborhood is characterized by the triple (X,Y,I)
• The X and Y offsets from the pixel at the neighborhood center
• Its intensity I• Association rules can then be characterized by
relationships between the triples
Association Rule ExampleAssociation Rule Example
•The rule specified in figure can be applied to this image in 9 of the
16 pixel locations due to the pixel offsets in the rule.
•Of these 9 locations, the antecedent matches at 5 locations, and
both the antecedent and consequent match at 3 locations.
•This yields a support of 3/9 = 33.33% and a confidence of 3/5 =
60%.
0,0,12,1,12,0,0
Support: 3/9 = 33.33%
Confidence 3/5 = 60%
Association Rule ExampleAssociation Rule ExampleOriginal Image
Segmented Image
Normal “J” Elements
Mirrored “J” Elements
N o r m a l “ J ” R u l e s
1,1,11,0,11,1,10,1,00,0,01,1,00,1,11,0,11,1,1
0,1,10,0,11,1,11,1,00,0,01,1,01,1,11,0,11,1,1
1,1,11,0,10,1,11,1,00,0,00,1,01,1,11,0,11,1,1
1,1,11,0,11,1,11,1,00,0,01,1,01,1,10,0,10,1,1 M i r r o r e d “ J ” R u l e s 0,1,11,0,11,1,10,1,00,0,01,1,01,1,11,0,11,1,1
1,1,10,0,10,1,11,1,00,0,01,1,01,1,11,0,11,1,1
1,1,11,0,11,1,11,1,00,0,00,1,01,1,11,0,10,1,1
1,1,11,0,11,1,11,1,00,0,01,1,00,1,10,0,11,1,1
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: Why Texture Features?Why Texture Features?
• Cumulus cloud fields have a very characteristic Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagerytexture signature in the GOES visible imagery
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: The NeedThe Need
• Cloud systems are important modulators of earth’s radiation budget
• Large uncertainties are associated with cloud radiative forcing
• Radiative energy budget is impacted by change in distribution of clouds
• Cumulus clouds are a cloud field type that could respond strongly to climate change
• Knowledge of cloud geometry, size and spatial distribution is needed for the representation of cumulus clouds in radiative transfer models
• To derive models of cloud field characteristics, automated cumulus cloud detection schemes are required to analyze large amounts of data
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: Purpose of this studyPurpose of this study
• Compare different techniques for detecting Cumulus cloud fields in Geostationary Operation Environmental Satellite (GOES)
• Comparison based on • Accuracy of detection
• Amount of time required to classify
• Feature measures used along with the Maximum Likelihood Classifier• Texture features
• Gray Level Co-Occurrences Matrix
• Gray Level Run Length Features
• Association Rules
• Edge Detection Features
• Sobel Filter
• Laplacian Filter
• Combination of Sobel and Laplacian Filter
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: Texture Features (1)Texture Features (1)
• Gray Level Co-Occurrence Matrix:• First texture feature vector to be developed
• GLCM is used as a benchmark
• It is based on positional operator
• Positional operator defines relationship between pixels in terms of x,y offset or as a distance, angle offset
• Co-occurrence matrix is an NxN matrix where N is the number of gray levels and functions are computed on the matrix
• Gray Level Run Length Features• Gray level statistical features based on homogeneous gray level
runs
• Run is a series of consecutive pixels of the same intensity
• Run length are at orientations in increments of 45 degrees starting at 0 degrees
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: Texture Features (2)Texture Features (2)
• Association Rules• Often used in business applications to identify relationships in
databases
• Adapted to discriminate textures in images
• Based on frequently occurring local image structures
0,0,12,1,12,0,0
S u p p o r t = 3 / 9 = 3 3 . 3 3 %
C o n f i d e n c e = 3 / 5 = 6 0 %
( a ) ( b )
Triples ( Pos X, Pos Y, Pixel Intensity)
Rule: (0,0,2) ^ (1,1,2) => (1,0,0)Then calculate Support and Confidenceof this Rule
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: Edge Detection FeaturesEdge Detection Features
• These techniques are used for detecting discontinuities in an image
• These techniques apply a local derivative operator on the image
• Sobel Filters• It calculates the magnitude of rate of change of gray level and the direction of this change vector
• Magnitude = | Gx | + |Gy|
• Direction = tan^-1(Gx/Gy)
• Gx = (z7 + 2z8 + z9) – (z1 + 2z2 + z3)
• Gy = (z3 + 2z6 + z9) – (z1 + 2z4 + z7)
• Laplacian Filters• It is a second order derivative
• F(z) = 4z5 – (z2 + z4 + z6 + z8)
z1 z2 z3
z4 z5 z6
z7 z8 z9
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: Experiment ProcessExperiment Process
• Training• Samples selected from 1000x1000 GOES scene
• Only two classes are used: Cumulus and Others ( includes background)
• For validation, samples were labeled by at least two experts and only pixels where experts agreed were used for training
• Maximum likelihood classifier was trained using GLCM, GLRL, Association Rules and Edge detection features
• Window size was varied: 5x5 – 11x11
• Testing• 12 different GOES images (512x512) where used for testing
• Classification results were compared against expert labeled images
• Confusion matrix, classification accuracy and experiment run times were calculated
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: Sample ResultSample Result
Original GLRL Association Rules GLCM
Expert Labeled Sobel Sobel + Laplacian Laplacian
GOES Cumulus Cloud Classification: GOES Cumulus Cloud Classification: ConclusionsConclusions
• Accuracy• Best results using texture features
• GLRL (78%) with a filter size of 11x11
• Association Rules (75%) with a filter size of 5x5
• GLCM gave the worst results (51-55%)
• Best results using edge detection filters
• Sobel Filter (78%) with a filter size of 11x11
• Laplacian (73%) with a filter size of 9x9
• Laplacian and Sobel (75%) with a filter size of 9x9
• Timing Results• Times were calculated on an 933MHz Pentium III processor PC with
512 MB memory
• Texture feature techniques in general required an order of magnitude more time than edge detection filters
Dimensionality Reduction: Dimensionality Reduction: Mesoscale Convective System (MCS) DetectionMesoscale Convective System (MCS) Detection
MiningResults:MCSs
SSM/I Data
Mining EngineAnalysisModules
InputModules
OutputModules
Knowledge Base
Event/Relationship
SearchSystem
•Define the Experiment•Select algorithm (Devlin)•Automatic extraction of MCSs from SSM/I data
ScientistsPopulating Knowledge Base
(reducing data volume )Scientists
Dimensionality Reduction: Research AnalysisDimensionality Reduction: Research Analysis
MiningResults:MCSs
SSM/I Data
Mining EngineAnalysisModules
InputModules
OutputModules
Event/Relationship
SearchSystem
Analysis: •Find MCSs over river basins in Middle East?•Data Sets
•MCSs•River basin data set•Political boundaries
Scientists
•Reduced amount of dataReduced amount of data•Allow scientists to pose questions Allow scientists to pose questions and get “results”and get “results”•Allow easy visualization Allow easy visualization •Maximize knowledge discovery/Maximize knowledge discovery/ minimize data handlingminimize data handling•Scientists can refine theirScientists can refine their knowledge repositoryknowledge repository•Answer the science questionsAnswer the science questions
Knowledge Base
Event/Relationship
SearchSystem
Dimensionality Reduction: Knowledge ReuseDimensionality Reduction: Knowledge Reuse
MiningResults:MCSs
SSM/I Data
Mining EngineAnalysisModules
InputModules
OutputModules
Event/Relationship
SearchSystem
Climatological Study of MCSs:•What is the latitudinal distribution of MCSs?•Which continent has more MCSs?•What is the size distribution of the MCSs for JUN-JUL-AUG?•What is the relationship between the number of MCSs and their intensities?•Do results vary for El-Nino years?
Scientists
Knowledge ReuseKnowledge Reuse
Latitudinal Distribution of MCS for 1998-1999
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of MCS's
La
titu
de
Mar98-Mar99
Latitudinal Distribution of MCS for 1998-1999
-90
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Number of MCS's
La
titu
de
Mar98-Mar99
Knowledge Base
Event/Relationship
SearchSystem
Event/Relationship Search SystemEvent/Relationship Search System
Allows users to conduct coincidence searches and relationship tests between mined phenomena and a variety of parameters
Parameters include geographic regions, political boundaries, or other named phenomena for a specific time period
References• Graves, Sara J., Thomas Hinke, Shanlini Kansal, "Metadata: The Golden Nuggets of
Data Mining", First IEEE Metadata Conference, Bethesda, Maryland, April 16- 18,
1996
• Hinke, Thomas, John Rushing, Shanlini Kansal, Sara J. Graves, Heggere S.
Ranganath, "For Scientific Data Discovery: Why Can't the Archive be More Like the
Web", Proceedings Ninth International Conference on Scientific Database
Management, Evergreen State College, Olympia, Washington, August 11-13, 1997
• Hinke, Thomas, John Rushing, Heggere S. Ranganath, Sara J. Graves, "Techniques
and Experience in Mining Remotely Sensed Satellite Data", Artificial Intelligence
Review 14 (6): Issues on the Application of Data Mining, pp 503-531, December 2000
• Hinke, Thomas, John Rushing, Shanlini Kansal, Sara J. Graves, Heggere S.
Ranganath, Evans Criswell, "Eureka Phenomena Discovery and Phenomena Mining
System", AMS 13th Int’l Conference on Interactive Information and Processing
Systems (IIPS) for Meteorology, Oceanography and Hydrology, 1997
References• Hinke, Thomas, John Rushing, Heggere S. Ranganath, Sara J. Graves, "Target-Independent
Mining for Scientific Data: Capturing Transients and Trends for Phenomena Mining",
Proceedings Third International Conference on Data Mining (KDD-97), Newport Beach,
California, August 14-17, 1997
• Keiser, Ken, John Rushing, Helen Conover, Sara J. Graves, "Data Mining System Toolkit for
Earth Science Data", Earth Observation (EO) & Geo-Spatial (GEO) Web and Internet
Workshop, Washington, D.C., February 1999
• Rushing, John, Heggere S. Ranganath, Thomas Hinke, Sara J. Graves, "Using Association
Rules as Texture Features", IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol 23, No. 8, 845-858, 2001
• Nair, Udaysankar J., John Rushing, Rahul Ramachandran, Kwo-Sen Kuo, Sara J. Graves, Ron
Welch, "Detection of Cumulus Cloud Fields in Satellite Imagery", The International Symposium
on Optical Science, Engineering and Instrumentation, Denver, 1999
• Nair, U., J. Rushing, R. Ramachandran, R. Welch, and S. J. Graves, Detection of boundary
layer cumulus cloud fields in GOES satellite imagery”, submitted to Journal of Applied
Meteorology, September, 2001