regression algorithm

8/7/2019 Regression Algorithm

1/31

Logistic RegressionLogistic Regression Often, the spatial phenomenon underOften, the spatial phenomenon under

investigation can only be described by ainvestigation can only be described by acategorical variable.categorical variable.

Wild fires typically depicted with polygons showingWild fires typically depicted with polygons showingburned vs. not burnedburned vs. not burned

Or, bird distribution indicating presence or absenceOr, bird distribution indicating presence or absenceof birdsof birds

Previous regression technique is not suitablePrevious regression technique is not suitablebecause the dependent variable is neitherbecause the dependent variable is neitherinterval or ratiointerval or ratio

Logistic regression treats the distribution in aLogistic regression treats the distribution in aprobabilistic manner, that is, the occurrence ofprobabilistic manner, that is, the occurrence ofthe study phenomenon is evaluated in terms ofthe study phenomenon is evaluated in terms ofprobabilityprobability


2/31

Logistic RegressionLogistic Regression

If the probability of presence of a phenomenonIf the probability of presence of a phenomenonis Pis Paa, then P, then Pbb represents the absence of therepresents the absence of thephenomenon andphenomenon and

PPaa + P+ Pbb = 1= 1

UUaa == FF00 ++ FF11XX11 ++ FF22XX22 + ++ + FFnnXXnn ++ II

UUaa is the utility function of eventis the utility function of event aa expressed as aexpressed as alinear combination of a number of explanatorylinear combination of a number of explanatoryvariablesvariables XX11,, XX22, .., and, .., and FFnn is the estimatedis the estimatedparameter of variableparameter of variable XXnn

)(1

)(

a

a

a

UEXP

UEXPP

!


3/31

Logistic RegressionLogistic Regression A greater value of UA greater value of Uaa implies a greaterimplies a greater

probability for the event to take place. Whenprobability for the event to take place. When

UUaa approaches infinity, Papproaches infinity, Paa approaches 1,approaches 1,indicating a high likelihood for the event toindicating a high likelihood for the event tooccur. When Uoccur. When Uaa approaches negative infinity,approaches negative infinity,PPaa approaches 0.approaches 0.

When UWhen Uaa equals zero, the probability is .50,equals zero, the probability is .50,implying a 50/50 chance for the event to occur.implying a 50/50 chance for the event to occur.


4/31

Logistic Regression ExampleLogistic Regression Example

Example from ChouExample from Chou

Fires in San Jacinto Ranger District of the SanFires in San Jacinto Ranger District of the SanBernardino National Forest were examined toBernardino National Forest were examined tomap the distribution of fire occurrencemap the distribution of fire occurrenceprobability. The basic model consisted of eightprobability. The basic model consisted of eightindependent variablesindependent variables

Area, perimeter, vegetation, proximity to buildings,Area, perimeter, vegetation, proximity to buildings,proximity to campgrounds, proximity to roads,proximity to campgrounds, proximity to roads,maximum temperature in July, and annualmaximum temperature in July, and annualprecipitationprecipitation


5/31

Variables in Fire DistributionVariables in Fire DistributionStudyStudy

XX11 Area: area of geographic unitArea: area of geographic unitXX22 Perimeter: perimeter of geographic unitPerimeter: perimeter of geographic unitXX33 Vegetation: vegetation computed by rotation periodVegetation: vegetation computed by rotation periodXX44 Building: proximity to structuresBuilding: proximity to structuresXX

55

Campground: proximity to campgroundsCampground: proximity to campgroundsXX66 Road: proximity to roadsRoad: proximity to roadsXX77 Temperature: maximum temperature in JulyTemperature: maximum temperature in JulyXX88 Precipitation: annual precipitationPrecipitation: annual precipitation

Dependent variable is a code indicating whether or not aDependent variable is a code indicating whether or not a

geographic unit is burned or not. Area and perimeter providegeographic unit is burned or not. Area and perimeter providegeneral geometric characteristics. Vegetation, precipitation, andgeneral geometric characteristics. Vegetation, precipitation, andtemperature represent environmental factors, while building,temperature represent environmental factors, while building,campground, and road represent humancampground, and road represent human--related factorsrelated factors


6/31

Results of Logistic RegressionResults of Logistic Regression The model indicatesThe model indicates

that perimeter,that perimeter,vegetation, campground,vegetation, campground,

road, and temperatureroad, and temperatureare variables to beare variables to beincluded in the model.included in the model.Other variables are notOther variables are notincluded as they are notincluded as they are not

statistically different fromstatistically different from00

Variable Coefficient Chi-square P-Value

X0 -6.3246 31.13 0

X1 0 1.42 0.234

X2 -0.0002 8.13 0.0043

X3 1.5577 43.65 0

X4 -1.1451 1.93 0.1648X5 -294.58 4.61 0.0318

X6 -0.5244 4.46 0.0348

X7 0.179 28.19 0

X8 0.0023 0.21 0.6493

Log Likelih -1366

PCE 60Chi-square 0.384 for alpa = .05


7/31

Results of Logistic RegressionResults of Logistic Regression

PercentagePercentage--correctlycorrectly--estimated (PCE)estimated (PCE)

index shows the maximum level ofindex shows the maximum level ofestimation accuracy of a model.estimation accuracy of a model.

In this example, PCE is 60%, not muchIn this example, PCE is 60%, not muchbetter than a random 50/50 chance.better than a random 50/50 chance.

Therefore, another parameter wasTherefore, another parameter wasevaluatedevaluated


8/31

Alternative ModelAlternative Model

Included an additional variable to determineIncluded an additional variable to determinewhether it makes any significant difference inwhether it makes any significant difference inmodel performancemodel performance

New variable represents neighborhood effects, orNew variable represents neighborhood effects, orconditions of the surrounding geographic unitsconditions of the surrounding geographic units

Assumes that fire occurrence probability is not onlyAssumes that fire occurrence probability is not onlyaffected by the environmental and humanaffected by the environmental and human--relatedrelatedvariables listed in the basic model, but by thevariables listed in the basic model, but by thedistribution of fire occurrence probability of adjacentdistribution of fire occurrence probability of adjacentunitsunits

The new spatial term X9 is defined by the percentageThe new spatial term X9 is defined by the percentageof neighboring units that were burned during theof neighboring units that were burned during thestudy periodstudy period


9/31

New ResultsNew Results

Results from the new studyResults from the new studyare quite differentare quite different

Only two variables areOnly two variables arestatistically significant:statistically significant:vegetation and neighborhoodvegetation and neighborhoodeffectseffects

Vegetation appears to be theVegetation appears to be thedetermining environmentaldetermining environmentalvariable in the distribution ofvariable in the distribution ofwildfires in the study areawildfires in the study area

Finally, wildfires areFinally, wildfires areinfluenced by neighborhoodinfluenced by neighborhoodconditionsconditions

X1 0 1.03 0.3106

X2 0.0003 0.97 0.3249

X3 1.6738 6.88 0.0087

X4 0.8416 0.19 0.6669X5 42.28 0 0.9701

X6 1.0241 3 0.0831

X7 0.1121 1 0.3168

X8 0.0127 0.55 0.4597

X9 17.951 2359.3 0

LogLikeli ood 164.788

PCE 97

Chi s are 3.84 fo alpa = .05


10/31

Testing Statistical SignficanceTesting Statistical Signficance

Did the neighborhood effects significantly change theDid the neighborhood effects significantly change themodel? Need to test the chimodel? Need to test the chi--square test of likelihoodsquare test of likelihoodratioratio

Where LWhere L00 denotes the likelihood of the basic model anddenotes the likelihood of the basic model andLL11 denotes the likelihood of the study modeldenotes the likelihood of the study model

Statistical testing suggests that the neighborhoodStatistical testing suggests that the neighborhoodvariable significantly improved the performance of thevariable significantly improved the performance of themodelmodel

11

0!

L

LP

566.23962

914.167197.1366283.1198

10

!

!

!

P

P

Log

LLLog


11/31

Procedure for RegressionProcedure for Regression

Analysis (Barber, p. 448)Analysis (Barber, p. 448) Specify the variables in the model and theSpecify the variables in the model and the

exact form of the relationship between themexact form of the relationship between them

Collect dataCollect data

Estimate the parameters of the modelEstimate the parameters of the model

Statistically test the utility of the developedStatistically test the utility of the developed

model, and check whether the assumptions ofmodel, and check whether the assumptions ofthe simple linear regression model arethe simple linear regression model aresatisfiedsatisfied

Use the model for predictionUse the model for prediction


12/31

Example of DataExample of DataManipulation andManipulation andProgramming in ArcViewProgramming in ArcView

Manipulating Yield Data withManipulating Yield Data withDataManipulation.aveDataManipulation.ave


13/31

Spatial Prediction ofSpatial Prediction ofLandslide Hazard UsingLandslide Hazard Using

Logistic Regression and GISLogistic Regression and GISArt LemboArt Lembo620 Presentation620 Presentation

Based on paper by Gorsevski,Based on paper by Gorsevski,Gessler, and FolzGessler, and Folz


14/31

IntroductionIntroduction

Landslides are natural geologic processesLandslides are natural geologic processes

that cause different types of damage,that cause different types of damage,causing billions of dollars in damage andcausing billions of dollars in damage andthousands of deaths each yearthousands of deaths each year

95% of landslides occur in developing95% of landslides occur in developingcountriescountries


15/31

Causes of LandslidesCauses of Landslides

Human activities, such as deforestation andHuman activities, such as deforestation and

urban expansion, accelerate the process ofurban expansion, accelerate the process oflandslideslandslides

Roads and harvest activities in timberlandsRoads and harvest activities in timberlandsincrease the occurrence of landslidesincrease the occurrence of landslides

In undisturbed forest, soil erosion is generallyIn undisturbed forest, soil erosion is generallynegligiblenegligible


16/31

Clearwater National ForestClearwater National Forest

19951995--19961996

Major landslides occurred during the winter followingMajor landslides occurred during the winter followingheavy rains, snowmelt, and high river flowheavy rains, snowmelt, and high river flow

Over 900 landslides were recorded on the unstableOver 900 landslides were recorded on the unstableslopes of the forestslopes of the forest

Landslide occurrence was widely distributed andLandslide occurrence was widely distributed andincluded artificial slopes such as road cuts and fills, orincluded artificial slopes such as road cuts and fills, ornatural slopes in clearcut areasnatural slopes in clearcut areas


17/31


18/31

Landslide DataLandslide Data

Within the large remote area, a DEM wasWithin the large remote area, a DEM was

used to generate quantitativeused to generate quantitativetopographic attributestopographic attributes

Slope, elevation, aspect, profile, curvature,Slope, elevation, aspect, profile, curvature,tangent curvature, plan curvature, flow path,tangent curvature, plan curvature, flow path,and contributing areaand contributing area

Photo interpretation and field inventoryPhoto interpretation and field inventoryidentified landslide areasidentified landslide areas


19/31


20/31

Considerations in CreatingConsiderations in Creating

Hazard ModelsHazard Models Datasets combined and stored in a GISDatasets combined and stored in a GISdatabasedatabase

Hazard Model assumptionsHazard Model assumptions

Strength of a model depends on the quality of theStrength of a model depends on the quality of thedata collecteddata collected

Data driven models are not appropriate toData driven models are not appropriate toextrapolate to neighboring areasextrapolate to neighboring areas

Climatic conditions may change so that the past isClimatic conditions may change so that the past isnot an indicator of the futurenot an indicator of the future

Uncertainty exists when a hazard map isUncertainty exists when a hazard map isderived from a statistically based modelderived from a statistically based model


21/31

Models Used in StudyModels Used in Study

Logistic regression was used, whichLogistic regression was used, which

correlated the environmental attributescorrelated the environmental attributesand landslide distributionand landslide distribution

Because of the existence of uncertainty,Because of the existence of uncertainty,a Receivera Receiver--Operating Curve curve plotsOperating Curve curve plots

the proportion of false positives againstthe proportion of false positives againstthe true positives at each level of thethe true positives at each level of thecriterioncriterion


22/31

Assessing Landslide HazardAssessing Landslide Hazard

Field inspection using a check list to identifyField inspection using a check list to identifysites susceptible to landslidingsites susceptible to landsliding

Projection of future patterns of instability fromProjection of future patterns of instability fromanalysis of landslide inventoriesanalysis of landslide inventories

Multivariate analysis of factors characterizingMultivariate analysis of factors characterizingobserved sites of slope instabilityobserved sites of slope instability

Stability ranking based on criteria such asStability ranking based on criteria such as

slope, land forms, or geologic structureslope, land forms, or geologic structure Failure probability analysis based on slopeFailure probability analysis based on slope

stability models with stochastic hydrologicstability models with stochastic hydrologicsimulationsimulation


23/31

Preparing the DataPreparing the Data

Primary and secondary attributes are derivedPrimary and secondary attributes are derivedfrom a DEM, reducing the high cost of collectingfrom a DEM, reducing the high cost of collectingthe data (30m)the data (30m)

Landslides assessed through aerialLandslides assessed through aerialreconnaissancereconnaissance

Landslide hazard area are then identified basedLandslide hazard area are then identified basedon spatial correlation between the attributeson spatial correlation between the attributes

Identifying landslide hazard is based on spatialIdentifying landslide hazard is based on spatialcorrelation between the attributes derived fromcorrelation between the attributes derived fromthe DEMthe DEM

ROC curves used for decision makingROC curves used for decision making


24/31

Data SamplingData Sampling 15% of non15% of non--landslide cells were randomlylandslide cells were randomlysampled for an absence of landslidessampled for an absence of landslides

Multivariate subset was derived from the coveragesMultivariate subset was derived from the coverageswhere landslides were absentwhere landslides were absent

The landslide coverage was a point data setThe landslide coverage was a point data setsampled grid cells where landslides weresampled grid cells where landslides werepresentpresent

Both samples were joined together where theBoth samples were joined together where thedependent variable had a binary responsedependent variable had a binary response(present or absent)(present or absent)

Final output stored in ASCII and used in SASFinal output stored in ASCII and used in SAS


25/31

Statistical AnalysisStatistical Analysis Normal plot of data to determine if the dataNormal plot of data to determine if the data

followed a normal distributionfollowed a normal distribution Plot showed that data points do not fall along aPlot showed that data points do not fall along a

straight line. The data is not multivariate normalstraight line. The data is not multivariate normal

Logistic regression is usedLogistic regression is used

when the predictor variableswhen the predictor variablesare not normally distributed,are not normally distributed,and some predictor variablesand some predictor variablesare categoricalare categorical

Factor analysis wasFactor analysis wasapplied to determine theapplied to determine thenumber of underlying variablesnumber of underlying variables

Only significantly loaded variables were consideredOnly significantly loaded variables were considered


26/31

Statistical AnalysisStatistical Analysis

The form of the logistic regression model isThe form of the logistic regression model isdefined as:defined as:

WhereWhere xx is the data vector for a randomlyis the data vector for a randomlyselected experimental unit andselected experimental unit and yy is the value ofis the value ofthe binary outcome variable. Maximumthe binary outcome variable. Maximumlikelihood was used to estimatelikelihood was used to estimate BB for thefor thepredictive equationpredictive equation

Variables not significant at the .1 level wereVariables not significant at the .1 level wereeliminatedeliminated


27/31

Logit ResultsLogit Results

Logit showed that the most important variablesLogit showed that the most important variablescontributing to the slope instability were Flowcontributing to the slope instability were Flow

Path and mean slope of upland areaPath and mean slope of upland area

log (p/(1log (p/(1--p)) = (p)) = (--2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)oror

p = exp (p = exp (--2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)/(1 + exp(2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)/(1 + exp(--2.26422.2642

+ FACTOR8 * 0.4969 + FLPATH * 0.6039)+ FACTOR8 * 0.4969 + FLPATH * 0.6039)____________________________________________________________________________________________________________________________

pp probability of landslide hazardprobability of landslide hazardFACTOR8FACTOR8 factor with underlying characteristics of aspectfactor with underlying characteristics of aspectFLPATHFLPATH Maximum distance of water to the point in the catchmentMaximum distance of water to the point in the catchment


28/31


29/31

Logit ResultsLogit Results

Coefficients of Logit model included positiveCoefficients of Logit model included positivecoefficients. Therefore, higher scores wouldcoefficients. Therefore, higher scores wouldincrease the probability of landslide hazard.increase the probability of landslide hazard.

Logit model assumes a nonlinear relationshipLogit model assumes a nonlinear relationshipbetween the probability and the explanatorybetween the probability and the explanatoryvariablesvariables

Hazard map based on ROC curve techniqueHazard map based on ROC curve techniquegroups the hazard into two classes: Lowgroups the hazard into two classes: LowHazard and High Hazard, showing five classesHazard and High Hazard, showing five classesof probabilities of landslide hazardof probabilities of landslide hazard


30/31

Final ResultsFinal Results

59.1% of the landslides and 69.8% of non59.1% of the landslides and 69.8% of nonlandslides were correctly determinedlandslides were correctly determined

Model can be applied to large geographic areasModel can be applied to large geographic areas

ROC curves are incorporated as a sophisticatedROC curves are incorporated as a sophisticatedtool for decision makers for the spatial predictiontool for decision makers for the spatial predictionof landslide hazardof landslide hazard


31/31

regression algorithm

Documents