regression algorithm

Upload: smarttag99

Post on 08-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Regression Algorithm

    1/31

    Logistic RegressionLogistic Regression Often, the spatial phenomenon underOften, the spatial phenomenon under

    investigation can only be described by ainvestigation can only be described by acategorical variable.categorical variable.

    Wild fires typically depicted with polygons showingWild fires typically depicted with polygons showingburned vs. not burnedburned vs. not burned

    Or, bird distribution indicating presence or absenceOr, bird distribution indicating presence or absenceof birdsof birds

    Previous regression technique is not suitablePrevious regression technique is not suitablebecause the dependent variable is neitherbecause the dependent variable is neitherinterval or ratiointerval or ratio

    Logistic regression treats the distribution in aLogistic regression treats the distribution in aprobabilistic manner, that is, the occurrence ofprobabilistic manner, that is, the occurrence ofthe study phenomenon is evaluated in terms ofthe study phenomenon is evaluated in terms ofprobabilityprobability

  • 8/7/2019 Regression Algorithm

    2/31

    Logistic RegressionLogistic Regression

    If the probability of presence of a phenomenonIf the probability of presence of a phenomenonis Pis Paa, then P, then Pbb represents the absence of therepresents the absence of thephenomenon andphenomenon and

    PPaa + P+ Pbb = 1= 1

    UUaa == FF00 ++ FF11XX11 ++ FF22XX22 + ++ + FFnnXXnn ++ II

    UUaa is the utility function of eventis the utility function of event aa expressed as aexpressed as alinear combination of a number of explanatorylinear combination of a number of explanatoryvariablesvariables XX11,, XX22, .., and, .., and FFnn is the estimatedis the estimatedparameter of variableparameter of variable XXnn

    )(1

    )(

    a

    a

    a

    UEXP

    UEXPP

    !

  • 8/7/2019 Regression Algorithm

    3/31

    Logistic RegressionLogistic Regression A greater value of UA greater value of Uaa implies a greaterimplies a greater

    probability for the event to take place. Whenprobability for the event to take place. When

    UUaa approaches infinity, Papproaches infinity, Paa approaches 1,approaches 1,indicating a high likelihood for the event toindicating a high likelihood for the event tooccur. When Uoccur. When Uaa approaches negative infinity,approaches negative infinity,PPaa approaches 0.approaches 0.

    When UWhen Uaa equals zero, the probability is .50,equals zero, the probability is .50,implying a 50/50 chance for the event to occur.implying a 50/50 chance for the event to occur.

  • 8/7/2019 Regression Algorithm

    4/31

    Logistic Regression ExampleLogistic Regression Example

    Example from ChouExample from Chou

    Fires in San Jacinto Ranger District of the SanFires in San Jacinto Ranger District of the SanBernardino National Forest were examined toBernardino National Forest were examined tomap the distribution of fire occurrencemap the distribution of fire occurrenceprobability. The basic model consisted of eightprobability. The basic model consisted of eightindependent variablesindependent variables

    Area, perimeter, vegetation, proximity to buildings,Area, perimeter, vegetation, proximity to buildings,proximity to campgrounds, proximity to roads,proximity to campgrounds, proximity to roads,maximum temperature in July, and annualmaximum temperature in July, and annualprecipitationprecipitation

  • 8/7/2019 Regression Algorithm

    5/31

    Variables in Fire DistributionVariables in Fire DistributionStudyStudy

    XX11 Area: area of geographic unitArea: area of geographic unitXX22 Perimeter: perimeter of geographic unitPerimeter: perimeter of geographic unitXX33 Vegetation: vegetation computed by rotation periodVegetation: vegetation computed by rotation periodXX44 Building: proximity to structuresBuilding: proximity to structuresXX

    55

    Campground: proximity to campgroundsCampground: proximity to campgroundsXX66 Road: proximity to roadsRoad: proximity to roadsXX77 Temperature: maximum temperature in JulyTemperature: maximum temperature in JulyXX88 Precipitation: annual precipitationPrecipitation: annual precipitation

    Dependent variable is a code indicating whether or not aDependent variable is a code indicating whether or not a

    geographic unit is burned or not. Area and perimeter providegeographic unit is burned or not. Area and perimeter providegeneral geometric characteristics. Vegetation, precipitation, andgeneral geometric characteristics. Vegetation, precipitation, andtemperature represent environmental factors, while building,temperature represent environmental factors, while building,campground, and road represent humancampground, and road represent human--related factorsrelated factors

  • 8/7/2019 Regression Algorithm

    6/31

    Results of Logistic RegressionResults of Logistic Regression The model indicatesThe model indicates

    that perimeter,that perimeter,vegetation, campground,vegetation, campground,

    road, and temperatureroad, and temperatureare variables to beare variables to beincluded in the model.included in the model.Other variables are notOther variables are notincluded as they are notincluded as they are not

    statistically different fromstatistically different from00

    Variable Coefficient Chi-square P-Value

    X0 -6.3246 31.13 0

    X1 0 1.42 0.234

    X2 -0.0002 8.13 0.0043

    X3 1.5577 43.65 0

    X4 -1.1451 1.93 0.1648X5 -294.58 4.61 0.0318

    X6 -0.5244 4.46 0.0348

    X7 0.179 28.19 0

    X8 0.0023 0.21 0.6493

    Log Likelih -1366

    PCE 60Chi-square 0.384 for alpa = .05

  • 8/7/2019 Regression Algorithm

    7/31

    Results of Logistic RegressionResults of Logistic Regression

    PercentagePercentage--correctlycorrectly--estimated (PCE)estimated (PCE)

    index shows the maximum level ofindex shows the maximum level ofestimation accuracy of a model.estimation accuracy of a model.

    In this example, PCE is 60%, not muchIn this example, PCE is 60%, not muchbetter than a random 50/50 chance.better than a random 50/50 chance.

    Therefore, another parameter wasTherefore, another parameter wasevaluatedevaluated

  • 8/7/2019 Regression Algorithm

    8/31

    Alternative ModelAlternative Model

    Included an additional variable to determineIncluded an additional variable to determinewhether it makes any significant difference inwhether it makes any significant difference inmodel performancemodel performance

    New variable represents neighborhood effects, orNew variable represents neighborhood effects, orconditions of the surrounding geographic unitsconditions of the surrounding geographic units

    Assumes that fire occurrence probability is not onlyAssumes that fire occurrence probability is not onlyaffected by the environmental and humanaffected by the environmental and human--relatedrelatedvariables listed in the basic model, but by thevariables listed in the basic model, but by thedistribution of fire occurrence probability of adjacentdistribution of fire occurrence probability of adjacentunitsunits

    The new spatial term X9 is defined by the percentageThe new spatial term X9 is defined by the percentageof neighboring units that were burned during theof neighboring units that were burned during thestudy periodstudy period

  • 8/7/2019 Regression Algorithm

    9/31

    New ResultsNew Results

    Results from the new studyResults from the new studyare quite differentare quite different

    Only two variables areOnly two variables arestatistically significant:statistically significant:vegetation and neighborhoodvegetation and neighborhoodeffectseffects

    Vegetation appears to be theVegetation appears to be thedetermining environmentaldetermining environmentalvariable in the distribution ofvariable in the distribution ofwildfires in the study areawildfires in the study area

    Finally, wildfires areFinally, wildfires areinfluenced by neighborhoodinfluenced by neighborhoodconditionsconditions

    X1 0 1.03 0.3106

    X2 0.0003 0.97 0.3249

    X3 1.6738 6.88 0.0087

    X4 0.8416 0.19 0.6669X5 42.28 0 0.9701

    X6 1.0241 3 0.0831

    X7 0.1121 1 0.3168

    X8 0.0127 0.55 0.4597

    X9 17.951 2359.3 0

    LogLikeli ood 164.788

    PCE 97

    Chi s are 3.84 fo alpa = .05

  • 8/7/2019 Regression Algorithm

    10/31

    Testing Statistical SignficanceTesting Statistical Signficance

    Did the neighborhood effects significantly change theDid the neighborhood effects significantly change themodel? Need to test the chimodel? Need to test the chi--square test of likelihoodsquare test of likelihoodratioratio

    Where LWhere L00 denotes the likelihood of the basic model anddenotes the likelihood of the basic model andLL11 denotes the likelihood of the study modeldenotes the likelihood of the study model

    Statistical testing suggests that the neighborhoodStatistical testing suggests that the neighborhoodvariable significantly improved the performance of thevariable significantly improved the performance of themodelmodel

    11

    0!

    L

    LP

    566.23962

    914.167197.1366283.1198

    10

    !

    !

    !

    P

    P

    Log

    LLLog

  • 8/7/2019 Regression Algorithm

    11/31

    Procedure for RegressionProcedure for Regression

    Analysis (Barber, p. 448)Analysis (Barber, p. 448) Specify the variables in the model and theSpecify the variables in the model and the

    exact form of the relationship between themexact form of the relationship between them

    Collect dataCollect data

    Estimate the parameters of the modelEstimate the parameters of the model

    Statistically test the utility of the developedStatistically test the utility of the developed

    model, and check whether the assumptions ofmodel, and check whether the assumptions ofthe simple linear regression model arethe simple linear regression model aresatisfiedsatisfied

    Use the model for predictionUse the model for prediction

  • 8/7/2019 Regression Algorithm

    12/31

    Example of DataExample of DataManipulation andManipulation andProgramming in ArcViewProgramming in ArcView

    Manipulating Yield Data withManipulating Yield Data withDataManipulation.aveDataManipulation.ave

  • 8/7/2019 Regression Algorithm

    13/31

    Spatial Prediction ofSpatial Prediction ofLandslide Hazard UsingLandslide Hazard Using

    Logistic Regression and GISLogistic Regression and GISArt LemboArt Lembo620 Presentation620 Presentation

    Based on paper by Gorsevski,Based on paper by Gorsevski,Gessler, and FolzGessler, and Folz

  • 8/7/2019 Regression Algorithm

    14/31

    IntroductionIntroduction

    Landslides are natural geologic processesLandslides are natural geologic processes

    that cause different types of damage,that cause different types of damage,causing billions of dollars in damage andcausing billions of dollars in damage andthousands of deaths each yearthousands of deaths each year

    95% of landslides occur in developing95% of landslides occur in developingcountriescountries

  • 8/7/2019 Regression Algorithm

    15/31

    Causes of LandslidesCauses of Landslides

    Human activities, such as deforestation andHuman activities, such as deforestation and

    urban expansion, accelerate the process ofurban expansion, accelerate the process oflandslideslandslides

    Roads and harvest activities in timberlandsRoads and harvest activities in timberlandsincrease the occurrence of landslidesincrease the occurrence of landslides

    In undisturbed forest, soil erosion is generallyIn undisturbed forest, soil erosion is generallynegligiblenegligible

  • 8/7/2019 Regression Algorithm

    16/31

    Clearwater National ForestClearwater National Forest

    19951995--19961996

    Major landslides occurred during the winter followingMajor landslides occurred during the winter followingheavy rains, snowmelt, and high river flowheavy rains, snowmelt, and high river flow

    Over 900 landslides were recorded on the unstableOver 900 landslides were recorded on the unstableslopes of the forestslopes of the forest

    Landslide occurrence was widely distributed andLandslide occurrence was widely distributed andincluded artificial slopes such as road cuts and fills, orincluded artificial slopes such as road cuts and fills, ornatural slopes in clearcut areasnatural slopes in clearcut areas

  • 8/7/2019 Regression Algorithm

    17/31

  • 8/7/2019 Regression Algorithm

    18/31

    Landslide DataLandslide Data

    Within the large remote area, a DEM wasWithin the large remote area, a DEM was

    used to generate quantitativeused to generate quantitativetopographic attributestopographic attributes

    Slope, elevation, aspect, profile, curvature,Slope, elevation, aspect, profile, curvature,tangent curvature, plan curvature, flow path,tangent curvature, plan curvature, flow path,and contributing areaand contributing area

    Photo interpretation and field inventoryPhoto interpretation and field inventoryidentified landslide areasidentified landslide areas

  • 8/7/2019 Regression Algorithm

    19/31

  • 8/7/2019 Regression Algorithm

    20/31

    Considerations in CreatingConsiderations in Creating

    Hazard ModelsHazard Models Datasets combined and stored in a GISDatasets combined and stored in a GISdatabasedatabase

    Hazard Model assumptionsHazard Model assumptions

    Strength of a model depends on the quality of theStrength of a model depends on the quality of thedata collecteddata collected

    Data driven models are not appropriate toData driven models are not appropriate toextrapolate to neighboring areasextrapolate to neighboring areas

    Climatic conditions may change so that the past isClimatic conditions may change so that the past isnot an indicator of the futurenot an indicator of the future

    Uncertainty exists when a hazard map isUncertainty exists when a hazard map isderived from a statistically based modelderived from a statistically based model

  • 8/7/2019 Regression Algorithm

    21/31

    Models Used in StudyModels Used in Study

    Logistic regression was used, whichLogistic regression was used, which

    correlated the environmental attributescorrelated the environmental attributesand landslide distributionand landslide distribution

    Because of the existence of uncertainty,Because of the existence of uncertainty,a Receivera Receiver--Operating Curve curve plotsOperating Curve curve plots

    the proportion of false positives againstthe proportion of false positives againstthe true positives at each level of thethe true positives at each level of thecriterioncriterion

  • 8/7/2019 Regression Algorithm

    22/31

    Assessing Landslide HazardAssessing Landslide Hazard

    Field inspection using a check list to identifyField inspection using a check list to identifysites susceptible to landslidingsites susceptible to landsliding

    Projection of future patterns of instability fromProjection of future patterns of instability fromanalysis of landslide inventoriesanalysis of landslide inventories

    Multivariate analysis of factors characterizingMultivariate analysis of factors characterizingobserved sites of slope instabilityobserved sites of slope instability

    Stability ranking based on criteria such asStability ranking based on criteria such as

    slope, land forms, or geologic structureslope, land forms, or geologic structure Failure probability analysis based on slopeFailure probability analysis based on slope

    stability models with stochastic hydrologicstability models with stochastic hydrologicsimulationsimulation

  • 8/7/2019 Regression Algorithm

    23/31

    Preparing the DataPreparing the Data

    Primary and secondary attributes are derivedPrimary and secondary attributes are derivedfrom a DEM, reducing the high cost of collectingfrom a DEM, reducing the high cost of collectingthe data (30m)the data (30m)

    Landslides assessed through aerialLandslides assessed through aerialreconnaissancereconnaissance

    Landslide hazard area are then identified basedLandslide hazard area are then identified basedon spatial correlation between the attributeson spatial correlation between the attributes

    Identifying landslide hazard is based on spatialIdentifying landslide hazard is based on spatialcorrelation between the attributes derived fromcorrelation between the attributes derived fromthe DEMthe DEM

    ROC curves used for decision makingROC curves used for decision making

  • 8/7/2019 Regression Algorithm

    24/31

    Data SamplingData Sampling 15% of non15% of non--landslide cells were randomlylandslide cells were randomlysampled for an absence of landslidessampled for an absence of landslides

    Multivariate subset was derived from the coveragesMultivariate subset was derived from the coverageswhere landslides were absentwhere landslides were absent

    The landslide coverage was a point data setThe landslide coverage was a point data setsampled grid cells where landslides weresampled grid cells where landslides werepresentpresent

    Both samples were joined together where theBoth samples were joined together where thedependent variable had a binary responsedependent variable had a binary response(present or absent)(present or absent)

    Final output stored in ASCII and used in SASFinal output stored in ASCII and used in SAS

  • 8/7/2019 Regression Algorithm

    25/31

    Statistical AnalysisStatistical Analysis Normal plot of data to determine if the dataNormal plot of data to determine if the data

    followed a normal distributionfollowed a normal distribution Plot showed that data points do not fall along aPlot showed that data points do not fall along a

    straight line. The data is not multivariate normalstraight line. The data is not multivariate normal

    Logistic regression is usedLogistic regression is used

    when the predictor variableswhen the predictor variablesare not normally distributed,are not normally distributed,and some predictor variablesand some predictor variablesare categoricalare categorical

    Factor analysis wasFactor analysis wasapplied to determine theapplied to determine thenumber of underlying variablesnumber of underlying variables

    Only significantly loaded variables were consideredOnly significantly loaded variables were considered

  • 8/7/2019 Regression Algorithm

    26/31

    Statistical AnalysisStatistical Analysis

    The form of the logistic regression model isThe form of the logistic regression model isdefined as:defined as:

    WhereWhere xx is the data vector for a randomlyis the data vector for a randomlyselected experimental unit andselected experimental unit and yy is the value ofis the value ofthe binary outcome variable. Maximumthe binary outcome variable. Maximumlikelihood was used to estimatelikelihood was used to estimate BB for thefor thepredictive equationpredictive equation

    Variables not significant at the .1 level wereVariables not significant at the .1 level wereeliminatedeliminated

  • 8/7/2019 Regression Algorithm

    27/31

    Logit ResultsLogit Results

    Logit showed that the most important variablesLogit showed that the most important variablescontributing to the slope instability were Flowcontributing to the slope instability were Flow

    Path and mean slope of upland areaPath and mean slope of upland area

    log (p/(1log (p/(1--p)) = (p)) = (--2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)oror

    p = exp (p = exp (--2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)/(1 + exp(2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)/(1 + exp(--2.26422.2642

    + FACTOR8 * 0.4969 + FLPATH * 0.6039)+ FACTOR8 * 0.4969 + FLPATH * 0.6039)____________________________________________________________________________________________________________________________

    pp probability of landslide hazardprobability of landslide hazardFACTOR8FACTOR8 factor with underlying characteristics of aspectfactor with underlying characteristics of aspectFLPATHFLPATH Maximum distance of water to the point in the catchmentMaximum distance of water to the point in the catchment

  • 8/7/2019 Regression Algorithm

    28/31

  • 8/7/2019 Regression Algorithm

    29/31

    Logit ResultsLogit Results

    Coefficients of Logit model included positiveCoefficients of Logit model included positivecoefficients. Therefore, higher scores wouldcoefficients. Therefore, higher scores wouldincrease the probability of landslide hazard.increase the probability of landslide hazard.

    Logit model assumes a nonlinear relationshipLogit model assumes a nonlinear relationshipbetween the probability and the explanatorybetween the probability and the explanatoryvariablesvariables

    Hazard map based on ROC curve techniqueHazard map based on ROC curve techniquegroups the hazard into two classes: Lowgroups the hazard into two classes: LowHazard and High Hazard, showing five classesHazard and High Hazard, showing five classesof probabilities of landslide hazardof probabilities of landslide hazard

  • 8/7/2019 Regression Algorithm

    30/31

    Final ResultsFinal Results

    59.1% of the landslides and 69.8% of non59.1% of the landslides and 69.8% of nonlandslides were correctly determinedlandslides were correctly determined

    Model can be applied to large geographic areasModel can be applied to large geographic areas

    ROC curves are incorporated as a sophisticatedROC curves are incorporated as a sophisticatedtool for decision makers for the spatial predictiontool for decision makers for the spatial predictionof landslide hazardof landslide hazard

  • 8/7/2019 Regression Algorithm

    31/31