regression algorithm
TRANSCRIPT
-
8/7/2019 Regression Algorithm
1/31
Logistic RegressionLogistic Regression Often, the spatial phenomenon underOften, the spatial phenomenon under
investigation can only be described by ainvestigation can only be described by acategorical variable.categorical variable.
Wild fires typically depicted with polygons showingWild fires typically depicted with polygons showingburned vs. not burnedburned vs. not burned
Or, bird distribution indicating presence or absenceOr, bird distribution indicating presence or absenceof birdsof birds
Previous regression technique is not suitablePrevious regression technique is not suitablebecause the dependent variable is neitherbecause the dependent variable is neitherinterval or ratiointerval or ratio
Logistic regression treats the distribution in aLogistic regression treats the distribution in aprobabilistic manner, that is, the occurrence ofprobabilistic manner, that is, the occurrence ofthe study phenomenon is evaluated in terms ofthe study phenomenon is evaluated in terms ofprobabilityprobability
-
8/7/2019 Regression Algorithm
2/31
Logistic RegressionLogistic Regression
If the probability of presence of a phenomenonIf the probability of presence of a phenomenonis Pis Paa, then P, then Pbb represents the absence of therepresents the absence of thephenomenon andphenomenon and
PPaa + P+ Pbb = 1= 1
UUaa == FF00 ++ FF11XX11 ++ FF22XX22 + ++ + FFnnXXnn ++ II
UUaa is the utility function of eventis the utility function of event aa expressed as aexpressed as alinear combination of a number of explanatorylinear combination of a number of explanatoryvariablesvariables XX11,, XX22, .., and, .., and FFnn is the estimatedis the estimatedparameter of variableparameter of variable XXnn
)(1
)(
a
a
a
UEXP
UEXPP
!
-
8/7/2019 Regression Algorithm
3/31
Logistic RegressionLogistic Regression A greater value of UA greater value of Uaa implies a greaterimplies a greater
probability for the event to take place. Whenprobability for the event to take place. When
UUaa approaches infinity, Papproaches infinity, Paa approaches 1,approaches 1,indicating a high likelihood for the event toindicating a high likelihood for the event tooccur. When Uoccur. When Uaa approaches negative infinity,approaches negative infinity,PPaa approaches 0.approaches 0.
When UWhen Uaa equals zero, the probability is .50,equals zero, the probability is .50,implying a 50/50 chance for the event to occur.implying a 50/50 chance for the event to occur.
-
8/7/2019 Regression Algorithm
4/31
Logistic Regression ExampleLogistic Regression Example
Example from ChouExample from Chou
Fires in San Jacinto Ranger District of the SanFires in San Jacinto Ranger District of the SanBernardino National Forest were examined toBernardino National Forest were examined tomap the distribution of fire occurrencemap the distribution of fire occurrenceprobability. The basic model consisted of eightprobability. The basic model consisted of eightindependent variablesindependent variables
Area, perimeter, vegetation, proximity to buildings,Area, perimeter, vegetation, proximity to buildings,proximity to campgrounds, proximity to roads,proximity to campgrounds, proximity to roads,maximum temperature in July, and annualmaximum temperature in July, and annualprecipitationprecipitation
-
8/7/2019 Regression Algorithm
5/31
Variables in Fire DistributionVariables in Fire DistributionStudyStudy
XX11 Area: area of geographic unitArea: area of geographic unitXX22 Perimeter: perimeter of geographic unitPerimeter: perimeter of geographic unitXX33 Vegetation: vegetation computed by rotation periodVegetation: vegetation computed by rotation periodXX44 Building: proximity to structuresBuilding: proximity to structuresXX
55
Campground: proximity to campgroundsCampground: proximity to campgroundsXX66 Road: proximity to roadsRoad: proximity to roadsXX77 Temperature: maximum temperature in JulyTemperature: maximum temperature in JulyXX88 Precipitation: annual precipitationPrecipitation: annual precipitation
Dependent variable is a code indicating whether or not aDependent variable is a code indicating whether or not a
geographic unit is burned or not. Area and perimeter providegeographic unit is burned or not. Area and perimeter providegeneral geometric characteristics. Vegetation, precipitation, andgeneral geometric characteristics. Vegetation, precipitation, andtemperature represent environmental factors, while building,temperature represent environmental factors, while building,campground, and road represent humancampground, and road represent human--related factorsrelated factors
-
8/7/2019 Regression Algorithm
6/31
Results of Logistic RegressionResults of Logistic Regression The model indicatesThe model indicates
that perimeter,that perimeter,vegetation, campground,vegetation, campground,
road, and temperatureroad, and temperatureare variables to beare variables to beincluded in the model.included in the model.Other variables are notOther variables are notincluded as they are notincluded as they are not
statistically different fromstatistically different from00
Variable Coefficient Chi-square P-Value
X0 -6.3246 31.13 0
X1 0 1.42 0.234
X2 -0.0002 8.13 0.0043
X3 1.5577 43.65 0
X4 -1.1451 1.93 0.1648X5 -294.58 4.61 0.0318
X6 -0.5244 4.46 0.0348
X7 0.179 28.19 0
X8 0.0023 0.21 0.6493
Log Likelih -1366
PCE 60Chi-square 0.384 for alpa = .05
-
8/7/2019 Regression Algorithm
7/31
Results of Logistic RegressionResults of Logistic Regression
PercentagePercentage--correctlycorrectly--estimated (PCE)estimated (PCE)
index shows the maximum level ofindex shows the maximum level ofestimation accuracy of a model.estimation accuracy of a model.
In this example, PCE is 60%, not muchIn this example, PCE is 60%, not muchbetter than a random 50/50 chance.better than a random 50/50 chance.
Therefore, another parameter wasTherefore, another parameter wasevaluatedevaluated
-
8/7/2019 Regression Algorithm
8/31
Alternative ModelAlternative Model
Included an additional variable to determineIncluded an additional variable to determinewhether it makes any significant difference inwhether it makes any significant difference inmodel performancemodel performance
New variable represents neighborhood effects, orNew variable represents neighborhood effects, orconditions of the surrounding geographic unitsconditions of the surrounding geographic units
Assumes that fire occurrence probability is not onlyAssumes that fire occurrence probability is not onlyaffected by the environmental and humanaffected by the environmental and human--relatedrelatedvariables listed in the basic model, but by thevariables listed in the basic model, but by thedistribution of fire occurrence probability of adjacentdistribution of fire occurrence probability of adjacentunitsunits
The new spatial term X9 is defined by the percentageThe new spatial term X9 is defined by the percentageof neighboring units that were burned during theof neighboring units that were burned during thestudy periodstudy period
-
8/7/2019 Regression Algorithm
9/31
New ResultsNew Results
Results from the new studyResults from the new studyare quite differentare quite different
Only two variables areOnly two variables arestatistically significant:statistically significant:vegetation and neighborhoodvegetation and neighborhoodeffectseffects
Vegetation appears to be theVegetation appears to be thedetermining environmentaldetermining environmentalvariable in the distribution ofvariable in the distribution ofwildfires in the study areawildfires in the study area
Finally, wildfires areFinally, wildfires areinfluenced by neighborhoodinfluenced by neighborhoodconditionsconditions
X1 0 1.03 0.3106
X2 0.0003 0.97 0.3249
X3 1.6738 6.88 0.0087
X4 0.8416 0.19 0.6669X5 42.28 0 0.9701
X6 1.0241 3 0.0831
X7 0.1121 1 0.3168
X8 0.0127 0.55 0.4597
X9 17.951 2359.3 0
LogLikeli ood 164.788
PCE 97
Chi s are 3.84 fo alpa = .05
-
8/7/2019 Regression Algorithm
10/31
Testing Statistical SignficanceTesting Statistical Signficance
Did the neighborhood effects significantly change theDid the neighborhood effects significantly change themodel? Need to test the chimodel? Need to test the chi--square test of likelihoodsquare test of likelihoodratioratio
Where LWhere L00 denotes the likelihood of the basic model anddenotes the likelihood of the basic model andLL11 denotes the likelihood of the study modeldenotes the likelihood of the study model
Statistical testing suggests that the neighborhoodStatistical testing suggests that the neighborhoodvariable significantly improved the performance of thevariable significantly improved the performance of themodelmodel
11
0!
L
LP
566.23962
914.167197.1366283.1198
10
!
!
!
P
P
Log
LLLog
-
8/7/2019 Regression Algorithm
11/31
Procedure for RegressionProcedure for Regression
Analysis (Barber, p. 448)Analysis (Barber, p. 448) Specify the variables in the model and theSpecify the variables in the model and the
exact form of the relationship between themexact form of the relationship between them
Collect dataCollect data
Estimate the parameters of the modelEstimate the parameters of the model
Statistically test the utility of the developedStatistically test the utility of the developed
model, and check whether the assumptions ofmodel, and check whether the assumptions ofthe simple linear regression model arethe simple linear regression model aresatisfiedsatisfied
Use the model for predictionUse the model for prediction
-
8/7/2019 Regression Algorithm
12/31
Example of DataExample of DataManipulation andManipulation andProgramming in ArcViewProgramming in ArcView
Manipulating Yield Data withManipulating Yield Data withDataManipulation.aveDataManipulation.ave
-
8/7/2019 Regression Algorithm
13/31
Spatial Prediction ofSpatial Prediction ofLandslide Hazard UsingLandslide Hazard Using
Logistic Regression and GISLogistic Regression and GISArt LemboArt Lembo620 Presentation620 Presentation
Based on paper by Gorsevski,Based on paper by Gorsevski,Gessler, and FolzGessler, and Folz
-
8/7/2019 Regression Algorithm
14/31
IntroductionIntroduction
Landslides are natural geologic processesLandslides are natural geologic processes
that cause different types of damage,that cause different types of damage,causing billions of dollars in damage andcausing billions of dollars in damage andthousands of deaths each yearthousands of deaths each year
95% of landslides occur in developing95% of landslides occur in developingcountriescountries
-
8/7/2019 Regression Algorithm
15/31
Causes of LandslidesCauses of Landslides
Human activities, such as deforestation andHuman activities, such as deforestation and
urban expansion, accelerate the process ofurban expansion, accelerate the process oflandslideslandslides
Roads and harvest activities in timberlandsRoads and harvest activities in timberlandsincrease the occurrence of landslidesincrease the occurrence of landslides
In undisturbed forest, soil erosion is generallyIn undisturbed forest, soil erosion is generallynegligiblenegligible
-
8/7/2019 Regression Algorithm
16/31
Clearwater National ForestClearwater National Forest
19951995--19961996
Major landslides occurred during the winter followingMajor landslides occurred during the winter followingheavy rains, snowmelt, and high river flowheavy rains, snowmelt, and high river flow
Over 900 landslides were recorded on the unstableOver 900 landslides were recorded on the unstableslopes of the forestslopes of the forest
Landslide occurrence was widely distributed andLandslide occurrence was widely distributed andincluded artificial slopes such as road cuts and fills, orincluded artificial slopes such as road cuts and fills, ornatural slopes in clearcut areasnatural slopes in clearcut areas
-
8/7/2019 Regression Algorithm
17/31
-
8/7/2019 Regression Algorithm
18/31
Landslide DataLandslide Data
Within the large remote area, a DEM wasWithin the large remote area, a DEM was
used to generate quantitativeused to generate quantitativetopographic attributestopographic attributes
Slope, elevation, aspect, profile, curvature,Slope, elevation, aspect, profile, curvature,tangent curvature, plan curvature, flow path,tangent curvature, plan curvature, flow path,and contributing areaand contributing area
Photo interpretation and field inventoryPhoto interpretation and field inventoryidentified landslide areasidentified landslide areas
-
8/7/2019 Regression Algorithm
19/31
-
8/7/2019 Regression Algorithm
20/31
Considerations in CreatingConsiderations in Creating
Hazard ModelsHazard Models Datasets combined and stored in a GISDatasets combined and stored in a GISdatabasedatabase
Hazard Model assumptionsHazard Model assumptions
Strength of a model depends on the quality of theStrength of a model depends on the quality of thedata collecteddata collected
Data driven models are not appropriate toData driven models are not appropriate toextrapolate to neighboring areasextrapolate to neighboring areas
Climatic conditions may change so that the past isClimatic conditions may change so that the past isnot an indicator of the futurenot an indicator of the future
Uncertainty exists when a hazard map isUncertainty exists when a hazard map isderived from a statistically based modelderived from a statistically based model
-
8/7/2019 Regression Algorithm
21/31
Models Used in StudyModels Used in Study
Logistic regression was used, whichLogistic regression was used, which
correlated the environmental attributescorrelated the environmental attributesand landslide distributionand landslide distribution
Because of the existence of uncertainty,Because of the existence of uncertainty,a Receivera Receiver--Operating Curve curve plotsOperating Curve curve plots
the proportion of false positives againstthe proportion of false positives againstthe true positives at each level of thethe true positives at each level of thecriterioncriterion
-
8/7/2019 Regression Algorithm
22/31
Assessing Landslide HazardAssessing Landslide Hazard
Field inspection using a check list to identifyField inspection using a check list to identifysites susceptible to landslidingsites susceptible to landsliding
Projection of future patterns of instability fromProjection of future patterns of instability fromanalysis of landslide inventoriesanalysis of landslide inventories
Multivariate analysis of factors characterizingMultivariate analysis of factors characterizingobserved sites of slope instabilityobserved sites of slope instability
Stability ranking based on criteria such asStability ranking based on criteria such as
slope, land forms, or geologic structureslope, land forms, or geologic structure Failure probability analysis based on slopeFailure probability analysis based on slope
stability models with stochastic hydrologicstability models with stochastic hydrologicsimulationsimulation
-
8/7/2019 Regression Algorithm
23/31
Preparing the DataPreparing the Data
Primary and secondary attributes are derivedPrimary and secondary attributes are derivedfrom a DEM, reducing the high cost of collectingfrom a DEM, reducing the high cost of collectingthe data (30m)the data (30m)
Landslides assessed through aerialLandslides assessed through aerialreconnaissancereconnaissance
Landslide hazard area are then identified basedLandslide hazard area are then identified basedon spatial correlation between the attributeson spatial correlation between the attributes
Identifying landslide hazard is based on spatialIdentifying landslide hazard is based on spatialcorrelation between the attributes derived fromcorrelation between the attributes derived fromthe DEMthe DEM
ROC curves used for decision makingROC curves used for decision making
-
8/7/2019 Regression Algorithm
24/31
Data SamplingData Sampling 15% of non15% of non--landslide cells were randomlylandslide cells were randomlysampled for an absence of landslidessampled for an absence of landslides
Multivariate subset was derived from the coveragesMultivariate subset was derived from the coverageswhere landslides were absentwhere landslides were absent
The landslide coverage was a point data setThe landslide coverage was a point data setsampled grid cells where landslides weresampled grid cells where landslides werepresentpresent
Both samples were joined together where theBoth samples were joined together where thedependent variable had a binary responsedependent variable had a binary response(present or absent)(present or absent)
Final output stored in ASCII and used in SASFinal output stored in ASCII and used in SAS
-
8/7/2019 Regression Algorithm
25/31
Statistical AnalysisStatistical Analysis Normal plot of data to determine if the dataNormal plot of data to determine if the data
followed a normal distributionfollowed a normal distribution Plot showed that data points do not fall along aPlot showed that data points do not fall along a
straight line. The data is not multivariate normalstraight line. The data is not multivariate normal
Logistic regression is usedLogistic regression is used
when the predictor variableswhen the predictor variablesare not normally distributed,are not normally distributed,and some predictor variablesand some predictor variablesare categoricalare categorical
Factor analysis wasFactor analysis wasapplied to determine theapplied to determine thenumber of underlying variablesnumber of underlying variables
Only significantly loaded variables were consideredOnly significantly loaded variables were considered
-
8/7/2019 Regression Algorithm
26/31
Statistical AnalysisStatistical Analysis
The form of the logistic regression model isThe form of the logistic regression model isdefined as:defined as:
WhereWhere xx is the data vector for a randomlyis the data vector for a randomlyselected experimental unit andselected experimental unit and yy is the value ofis the value ofthe binary outcome variable. Maximumthe binary outcome variable. Maximumlikelihood was used to estimatelikelihood was used to estimate BB for thefor thepredictive equationpredictive equation
Variables not significant at the .1 level wereVariables not significant at the .1 level wereeliminatedeliminated
-
8/7/2019 Regression Algorithm
27/31
Logit ResultsLogit Results
Logit showed that the most important variablesLogit showed that the most important variablescontributing to the slope instability were Flowcontributing to the slope instability were Flow
Path and mean slope of upland areaPath and mean slope of upland area
log (p/(1log (p/(1--p)) = (p)) = (--2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)oror
p = exp (p = exp (--2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)/(1 + exp(2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)/(1 + exp(--2.26422.2642
+ FACTOR8 * 0.4969 + FLPATH * 0.6039)+ FACTOR8 * 0.4969 + FLPATH * 0.6039)____________________________________________________________________________________________________________________________
pp probability of landslide hazardprobability of landslide hazardFACTOR8FACTOR8 factor with underlying characteristics of aspectfactor with underlying characteristics of aspectFLPATHFLPATH Maximum distance of water to the point in the catchmentMaximum distance of water to the point in the catchment
-
8/7/2019 Regression Algorithm
28/31
-
8/7/2019 Regression Algorithm
29/31
Logit ResultsLogit Results
Coefficients of Logit model included positiveCoefficients of Logit model included positivecoefficients. Therefore, higher scores wouldcoefficients. Therefore, higher scores wouldincrease the probability of landslide hazard.increase the probability of landslide hazard.
Logit model assumes a nonlinear relationshipLogit model assumes a nonlinear relationshipbetween the probability and the explanatorybetween the probability and the explanatoryvariablesvariables
Hazard map based on ROC curve techniqueHazard map based on ROC curve techniquegroups the hazard into two classes: Lowgroups the hazard into two classes: LowHazard and High Hazard, showing five classesHazard and High Hazard, showing five classesof probabilities of landslide hazardof probabilities of landslide hazard
-
8/7/2019 Regression Algorithm
30/31
Final ResultsFinal Results
59.1% of the landslides and 69.8% of non59.1% of the landslides and 69.8% of nonlandslides were correctly determinedlandslides were correctly determined
Model can be applied to large geographic areasModel can be applied to large geographic areas
ROC curves are incorporated as a sophisticatedROC curves are incorporated as a sophisticatedtool for decision makers for the spatial predictiontool for decision makers for the spatial predictionof landslide hazardof landslide hazard
-
8/7/2019 Regression Algorithm
31/31