title of projects: - applying linear regression model … · title of projects: - applying linear...

17
Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently used techniques in statistics is linear regression where you predict a real-valued output based on an input value. Technically, linear regression is a statistical technique to analyze/predict the linear relationship between a dependent variable and one or more independent variables. Let’s say you want to predict the price of a house, the price is the dependent variable and factors like size of the house, locality, and season of purchase might act as independent variables. This is because the price depends on other variables. R comes with many default data sets and it can be seen using MASS library. Steps:- Open R Studio then Install.packages(“MASS”) using above command- > install.packages(MASS) > Library (MASS) > data() This will give you a list of available data sets using which you can get can a clear idea of linear regression problems. Data sets in package ‘datasets’: AirPassengers Monthly Airline Passenger Numbers 1949-1960 BJsales Sales Data with Leading Indicator BJsales.lead (BJsales) Sales Data with Leading Indicator BOD Biochemical Oxygen Demand CO2 Carbon Dioxide Uptake in Grass Plants ChickWeight Weight versus age of chicks on different diets DNase Elisa assay of DNase EuStockMarkets Daily Closing Prices of Major European Stock Indices, 1991-1998 Formaldehyde Determination of Formaldehyde HairEyeColor Hair and Eye Color of Statistics Students Harman23.cor Harman Example 2.3 Harman74.cor Harman Example 7.4 Indometh Pharmacokinetics of Indomethacin InsectSprays Effectiveness of Insect Sprays JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share LakeHuron Level of Lake Huron 1875-1972 LifeCycleSavings Intercountry Life-Cycle Savings Data Loblolly Growth of Loblolly pine trees Nile Flow of the River Nile Orange Growth of Orange Trees OrchardSprays Potency of Orchard Sprays PlantGrowth Results from an Experiment on Plant Growth Puromycin Reaction Velocity of an Enzymatic Reaction Seatbelts Road Casualties in Great Britain 1969-84 Theoph Pharmacokinetics of Theophylline Titanic Survival of passengers on the Titanic ToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea Pigs

Upload: nguyenbao

Post on 01-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

Title of Projects: - Applying linear Regression model to a real world problem.

One of the most popular and frequently used techniques in statistics is linear regression where you predict a real-valued output based on an input value. Technically, linear regression is a statistical technique to analyze/predict the linear relationship between a dependent variable and one or more independent variables.

Let’s say you want to predict the price of a house, the price is the dependent variable and factors like size of the house, locality, and season of purchase might act as independent variables. This is because the price depends on other variables.

R comes with many default data sets and it can be seen using MASS library.

Steps:-

Open R Studio then Install.packages(“MASS”) using above command-

> install.packages(“MASS”) > Library (MASS) > data()

This will give you a list of available data sets using which you can get can a clear idea of linear regression problems.

Data sets in package ‘datasets’:

AirPassengers Monthly Airline Passenger Numbers 1949-1960

BJsales Sales Data with Leading Indicator

BJsales.lead (BJsales)

Sales Data with Leading Indicator

BOD Biochemical Oxygen Demand

CO2 Carbon Dioxide Uptake in Grass Plants

ChickWeight Weight versus age of chicks on different

diets

DNase Elisa assay of DNase

EuStockMarkets Daily Closing Prices of Major European Stock

Indices, 1991-1998

Formaldehyde Determination of Formaldehyde

HairEyeColor Hair and Eye Color of Statistics Students

Harman23.cor Harman Example 2.3

Harman74.cor Harman Example 7.4

Indometh Pharmacokinetics of Indomethacin

InsectSprays Effectiveness of Insect Sprays

JohnsonJohnson Quarterly Earnings per Johnson & Johnson

Share

LakeHuron Level of Lake Huron 1875-1972

LifeCycleSavings Intercountry Life-Cycle Savings Data

Loblolly Growth of Loblolly pine trees

Nile Flow of the River Nile

Orange Growth of Orange Trees

OrchardSprays Potency of Orchard Sprays

PlantGrowth Results from an Experiment on Plant Growth

Puromycin Reaction Velocity of an Enzymatic Reaction

Seatbelts Road Casualties in Great Britain 1969-84

Theoph Pharmacokinetics of Theophylline

Titanic Survival of passengers on the Titanic

ToothGrowth The Effect of Vitamin C on Tooth Growth in

Guinea Pigs

Page 2: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

UCBAdmissions Student Admissions at UC Berkeley

UKDriverDeaths Road Casualties in Great Britain 1969-84

UKgas UK Quarterly Gas Consumption

USAccDeaths Accidental Deaths in the US 1973-1978

USArrests Violent Crime Rates by US State

USJudgeRatings Lawyers' Ratings of State Judges in the US

Superior Court

USPersonalExpenditure

Personal Expenditure Data

UScitiesD Distances Between European Cities and

Between US Cities

VADeaths Death Rates in Virginia (1940)

WWWusage Internet Usage per Minute

WorldPhones The World's Telephones

ability.cov Ability and Intelligence Tests

airmiles Passenger Miles on Commercial US Airlines,

1937-1960

airquality New York Air Quality Measurements

anscombe Anscombe's Quartet of 'Identical' Simple

Linear Regressions

attenu The Joyner-Boore Attenuation Data

attitude The Chatterjee-Price Attitude Data

austres Quarterly Time Series of the Number of

Australian Residents

beaver1 (beavers) Body Temperature Series of Two Beavers

beaver2 (beavers) Body Temperature Series of Two Beavers

cars Speed and Stopping Distances of Cars

chickwts Chicken Weights by Feed Type

co2 Mauna Loa Atmospheric CO2 Concentration

crimtab Student's 3000 Criminals Data

discoveries Yearly Numbers of Important Discoveries

esoph Smoking, Alcohol and (O)esophageal Cancer

euro Conversion Rates of Euro Currencies

euro.cross (euro) Conversion Rates of Euro Currencies

eurodist Distances Between European Cities and

Between US Cities

faithful Old Faithful Geyser Data

fdeaths (UKLungDeaths)

Monthly Deaths from Lung Diseases in the UK

freeny Freeny's Revenue Data

freeny.x (freeny) Freeny's Revenue Data

freeny.y (freeny) Freeny's Revenue Data

infert Infertility after Spontaneous and Induced

Abortion

iris Edgar Anderson's Iris Data

iris3 Edgar Anderson's Iris Data

islands Areas of the World's Major Landmasses

ldeaths (UKLungDeaths)

Monthly Deaths from Lung Diseases in the UK

lh Luteinizing Hormone in Blood Samples

longley Longley's Economic Regression Data

lynx Annual Canadian Lynx trappings 1821-1934

mdeaths (UKLungDeaths)

Monthly Deaths from Lung Diseases in the UK

morley Michelson Speed of Light Data

mtcars Motor Trend Car Road Tests

nhtemp Average Yearly Temperatures in New Haven

nottem Average Monthly Temperatures at Nottingham,

1920-1939

npk Classical N, P, K Factorial Experiment

occupationalStatus Occupational Status of Fathers and their

Sons

precip Annual Precipitation in US Cities

presidents Quarterly Approval Ratings of US Presidents

Page 3: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

pressure Vapor Pressure of Mercury as a Function of

Temperature

quakes Locations of Earthquakes off Fiji

randu Random Numbers from Congruential Generator

RANDU

rivers Lengths of Major North American Rivers

rock Measurements on Petroleum Rock Samples

sleep Student's Sleep Data

stack.loss (stackloss)

Brownlee's Stack Loss Plant Data

stack.x (stackloss) Brownlee's Stack Loss Plant Data

stackloss Brownlee's Stack Loss Plant Data

state.abb (state) US State Facts and Figures

state.area (state) US State Facts and Figures

state.center (state)

US State Facts and Figures

state.division (state)

US State Facts and Figures

state.name (state) US State Facts and Figures

state.region (state)

US State Facts and Figures

state.x77 (state) US State Facts and Figures

sunspot.month Monthly Sunspot Data, from 1749 to "Present"

sunspot.year Yearly Sunspot Data, 1700-1988

sunspots Monthly Sunspot Numbers, 1749-1983

swiss Swiss Fertility and Socioeconomic Indicators

(1888) Data

treering Yearly Treering Data, -6000-1979

trees Girth, Height and Volume for Black Cherry

Trees

uspop Populations Recorded by the US Census

volcano Topographic Information on Auckland's Maunga

Whau Volcano

warpbreaks The Number of Breaks in Yarn during Weaving

women Average Heights and Weights for American

Women

Data sets in package ‘MASS’:

Aids2 Australian AIDS Survival Data

Animals Brain and Body Weights for 28 Species

Boston Housing Values in Suburbs of Boston

Cars93 Data from 93 Cars on Sale in the USA in 1993

Cushings Diagnostic Tests on Patients with Cushing's

Syndrome

DDT DDT in Kale

GAGurine Level of GAG in Urine of Children

Insurance Numbers of Car Insurance claims

Melanoma Survival from Malignant Melanoma

OME Tests of Auditory Perception in Children

with OME

Pima.te Diabetes in Pima Indian Women

Pima.tr Diabetes in Pima Indian Women

Pima.tr2 Diabetes in Pima Indian Women

Rabbit Blood Pressure in Rabbits

Rubber Accelerated Testing of Tyre Rubber

SP500 Returns of the Standard and Poors 500

Sitka Growth Curves for Sitka Spruce Trees in 1988

Sitka89 Growth Curves for Sitka Spruce Trees in 1989

Skye AFM Compositions of Aphyric Skye Lavas

Traffic Effect of Swedish Speed Limits on Accidents

UScereal Nutritional and Marketing Information on US

Cereals

UScrime The Effect of Punishment Regimes on Crime

Page 4: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

Rates

VA Veteran's Administration Lung Cancer Trial

abbey Determinations of Nickel Content

accdeaths Accidental Deaths in the US 1973-1978

anorexia Anorexia Data on Weight Change

bacteria Presence of Bacteria after Drug Treatments

beav1 Body Temperature Series of Beaver 1

beav2 Body Temperature Series of Beaver 2

biopsy Biopsy Data on Breast Cancer Patients

birthwt Risk Factors Associated with Low Infant

Birth Weight

cabbages Data from a cabbage field trial

caith Colours of Eyes and Hair of People in

Caithness

cats Anatomical Data from Domestic Cats

cement Heat Evolved by Setting Cements

chem Copper in Wholemeal Flour

coop Co-operative Trial in Analytical Chemistry

cpus Performance of Computer CPUs

crabs Morphological Measurements on Leptograpsus

Crabs

deaths Monthly Deaths from Lung Diseases in the UK

drivers Deaths of Car Drivers in Great Britain

1969-84

eagles Foraging Ecology of Bald Eagles

epil Seizure Counts for Epileptics

farms Ecological Factors in Farm Management

fgl Measurements of Forensic Glass Fragments

forbes Forbes' Data on Boiling Points in the Alps

galaxies Velocities for 82 Galaxies

gehan Remission Times of Leukaemia Patients

genotype Rat Genotype Data

geyser Old Faithful Geyser Data

gilgais Line Transect of Soil in Gilgai Territory

hills Record Times in Scottish Hill Races

housing Frequency Table from a Copenhagen Housing

Conditions Survey

immer Yields from a Barley Field Trial

leuk Survival Times and White Blood Counts for

Leukaemia Patients

mammals Brain and Body Weights for 62 Species of

Land Mammals

mcycle Data from a Simulated Motorcycle Accident

menarche Age of Menarche in Warsaw

michelson Michelson's Speed of Light Data

minn38 Minnesota High School Graduates of 1938

motors Accelerated Life Testing of Motorettes

muscle Effect of Calcium Chloride on Muscle

Contraction in Rat Hearts

newcomb Newcomb's Measurements of the Passage Time

of Light

nlschools Eighth-Grade Pupils in the Netherlands

npk Classical N, P, K Factorial Experiment

npr1 US Naval Petroleum Reserve No. 1 data

oats Data from an Oats Field Trial

painters The Painter's Data of de Piles

petrol N. L. Prater's Petrol Refinery Data

phones Belgium Phone Calls 1950-1973

quine Absenteeism from School in Rural New South

Wales

road Road Accident Deaths in US States

rotifer Numbers of Rotifers by Fluid Density

ships Ships Damage Data

shoes Shoe wear data of Box, Hunter and Hunter

Page 5: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

shrimp Percentage of Shrimp in Shrimp Cocktail

shuttle Space Shuttle Autolander Problem

snails Snail Mortality Data

steam The Saturated Steam Pressure Data

stormer The Stormer Viscometer Data

survey Student Survey Data

synth.te Synthetic Classification Problem

synth.tr Synthetic Classification Problem

topo Spatial Topographic Data

waders Counts of Waders at 15 Sites in South Africa

whiteside House Insulation: Whiteside's Data

wtloss Weight Loss Data from an Obese Patient

Analysing a default data set in R

I will use a default data set called “airquality” data. The data set has various air quality

parameters in New York city.

These are the parameters in the data set:

Daily temperature from May to August

Solar radiation data

Ozone data

Wind data

Our goal is to predict the temperature for a particular month in New York using solar radiation, ozone and wind data. I am going to use Linear Regression (LR) to make the prediction.

To start using LR or any other algorithm, first and foremost step is to generate a Hypothesis.

The hypothesis is: “Temperature of house depends on ozone, wind and solar radiations”. Now, the null hypothesis of linear regression says there is no relation between dependent and independent variables; and all coefficients are zero. i.e. if equation is Temp=a1.Solar.R +a2.Ozone + a3.Wind + error.

On the other hand, alternate hypothesis says there is at least one non-zero coefficient and hence relationship exists between dependent and independent variables.

In mathematical notations it can be written as:

H0: a1=a2=a3=0

Ha: a1≠a2≠a3≠0

Let’s test the hypothesis using a linear regression model and draw a conclusion.

To test the hypothesis, we would check the level of significance of variables in support of our hypothesis. If the significance is higher than accepted level (generally 95%), we would reject the null hypothesis and hence there is a relation between dependent and independent variables.

If the significance is less than the accepted level, we will reject the null hypothesis and hence there is no relationship between dependent and independent variables.

Page 6: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

Before that, let’s understand the data by exploring it in R.

> data("airquality") > attach(airquality) > head(airquality,10)# to see first 10 rows Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5ta 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 > summary(airquality) Ozone Solar.R Wind Temp Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 NA's :37 NA's :7 Month Day Min. :5.000 Min. : 1.0 1st Qu.:6.000 1st Qu.: 8.0 Median :7.000 Median :16.0 Mean :6.993 Mean :15.8 3rd Qu.:8.000 3rd Qu.:23.0 Max. :9.000 Max. :31.0

Data() function is used to call airquality data.

Attach () function makes the data available to the R search path.

Summary function gives you the range, quartiles, median and mean for numerical variables and table with frequencies for categorical variables.

Data visualization

I use a boxplot to visualize the daily temperature for month 5, 6, 7, 8 and 9.

> month5=subset(airquality,Month=5) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month5$Temp~airquality$Day),main="month5",col=rainbow(3)) > month6=subset(airquality,Month=6) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month6$Temp~airquality$Day),main="month6",col=rainbow(3)) > month7=subset(airquality,Month=7) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month7$Temp~airquality$Day),main="month7",col=rainbow(3)) > month8=subset(airquality,Month=8) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month8$Temp~airquality$Day),main="month8",col=rainbow(3)) > month9=subset(airquality,Month=9) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month9$Temp~airquality$Day),main="month9",col=rainbow(3)) Output of BOX plot:-

Page 7: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

now use a histogram to see the distribution of temperature data.

> hist(airquality$Temp,col=rainbow(2))

Page 8: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

now use a scatter plot to see if there is a linear pattern between the ‘temperature rise’ and other variables.

>plot(airquality$Temp~airquality$Day+airquality$Solar.R+airquality$Wind+airquality$Wind+airquality$Ozone,col=blues9) Hit <Return> to see next plot:

It seems that solar.R , Ozone, and wind have a linear pattern with temperature. Solar and Ozone have a positive relationship and wind has a negative one.

Now use Co-plot to see the effect of wind and solar radiations combined on Temperature

> coplot(Ozone~Solar.R|Wind,panel=panel.smooth,airquality,col="green") Missing rows: 5, 6, 10, 11, 25, 26, 27, 32, 33, 34, 35, 36, 37, 39, 42, 43, 45, 46, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 65, 72, 75, 83, 84, 96, 97, 98, 102, 103, 107, 115, 119, 150

Page 9: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

Now It’s time to execute to Linear Regression on our data set

Now use lm function to run a linear regression on our data set.

The function lm fits a linear model to the data where Temperature (dependent variable) is on the left hand side separated by a ~ from the independent variables.

Data preparation:

The input data needs be processed before we use them in our algorithm. This means, deleting rows that has no values, checking correlation and outliers. While building the model, R inherently takes care of the null values, and drops the rows where the data is missing. This eventually results in data loss.

There are different methods to deal with data loss like imputing mean for numerical variables and mode for categorical variables. Another method is to replace null values with any value way larger than other values in the range.

For e.g. we can replace a null value with -1 when the variable is age. Since age cannot be negative, R considers the negative value as an outlier while building the model.

We can use the following command to find column wise count of null values in the data.

> sapply(airquality,function(x){sum(is.na(x))}) Ozone Solar.R Wind Temp Month Day 37 7 0 0 0 0

You can see that there are missing values in Ozone and Solar.R. We can drop those rows but that would result in data loss since there are just 153 rows in our data, dropping 37 would be almost a 20% loss.

Hence, we will replace the null values with mean (since both of the variables are numerical).

> airquality$Ozone[is.na(airquality$Ozone)]=mean(airquality$Ozone,na.rm=T) > airquality$Solar.R[is.na(airquality$Solar.R)]=mean(airquality$Solar.R,na.rm=T) > sapply(airquality,function(x){sum(is.na(x))}) Ozone Solar.R Wind Temp Month Day 0 0 0 0 0 0

Now, let’s check the correlation between independent variables.

We use corrplot library to visualize the correlation between variables.

> o=corrplot(cor(airquality),method = 'number')

# this method can be changed try using method=’circle’

Page 10: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

Now drop one of the two variables that has high correlation but if we have a good knowledge about data then we can form a new variable by taking the difference of two.

For example, if ‘expenditure and income’ as variables have high correlation then we can create a new variable called ‘savings’ by taking the difference of ‘expenditure’ and ‘income’. We can do this only if we have domain knowledge.

Let’s see the effect of multicollinearity (without dropping a parameter) on our model.

> Model_lm1=lm(Temp~.,data=airquality) > summary(Model_lm1) Call: lm(formula = Temp ~ ., data = airquality) Residuals: Min 1Q Median 3Q Max -19.6924 -4.3178 0.6003 3.8249 16.2020 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 59.349707 4.052346 14.646 < 2e-16 *** Ozone 0.142977 0.023515 6.080 9.87e-09 *** Solar.R 0.014273 0.006589 2.166 0.0319 * Wind -0.423795 0.182787 -2.319 0.0218 * Month 2.252433 0.389039 5.790 4.11e-08 *** Day -0.106120 0.061414 -1.728 0.0861 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.628 on 147 degrees of freedom Multiple R-squared: 0.5258, Adjusted R-squared: 0.5097 F-statistic: 32.6 on 5 and 147 DF, p-value: < 2.2e-16

Before we interpret the results, I am going to the tune the model for a low AIC value.

The Akaike Information Criterion (AIC) is a measure of the relative quality of statistical models for a given set of data.

Page 11: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.

You can tune the model for a low AIC in two ways:

1) By eliminating some less significant variables and re-running the model

2) Using a ‘Step’ function in R. The step function runs all the possible parameters and checks the lowest value.

I am going to use the second method here.

> Model_lm_best=step(Model_lm1) Start: AIC=584.62 Temp ~ Ozone + Solar.R + Wind + Month + Day Df Sum of Sq RSS AIC <none> 6457.8 584.62 - Day 1 131.17 6588.9 585.69 - Solar.R 1 206.13 6663.9 587.43 - Wind 1 236.15 6693.9 588.11 - Month 1 1472.59 7930.3 614.05 - Ozone 1 1624.07 8081.8 616.94 > summary(Model_lm_best) Call: lm(formula = Temp ~ Ozone + Solar.R + Wind + Month + Day, data = airquality) Residuals: Min 1Q Median 3Q Max -19.6924 -4.3178 0.6003 3.8249 16.2020 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 59.349707 4.052346 14.646 < 2e-16 *** Ozone 0.142977 0.023515 6.080 9.87e-09 *** Solar.R 0.014273 0.006589 2.166 0.0319 * Wind -0.423795 0.182787 -2.319 0.0218 * Month 2.252433 0.389039 5.790 4.11e-08 *** Day -0.106120 0.061414 -1.728 0.0861 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.628 on 147 degrees of freedom Multiple R-squared: 0.5258, Adjusted R-squared: 0.5097 F-statistic: 32.6 on 5 and 147 DF, p-value: < 2.2e-16

This summary gives coefficients of dependent variables and error term with the significance level (confidence level).

The highlighted line in the result shows how to read the level of significance. A three asterisks means 99.99% significant (check the corresponding value. If it is less than 0.01 means variable is 99.99% significant).

The R square and adjusted R square values defines how much variance of the dependent variable is explained by the model and the rest is explained by the error term. Hence, higher the R square or adjusted R square better the model.

Adjusted R square is a better indicator of explained variance because it considers only important variables and extra variables are deliberately dropped by adjusted R square. In other words,

Page 12: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

adjusted R square penalizes the inclusion of many variables in the model for the sake of high percentage of variance explained.

VIF and Multicollinearity

Page 13: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

Variable Inflation factor is an important parameter regarding value of coefficient of determination (R2).

If two independent variables are highly correlated then it inflates the model’s variance (estimated error).

To deal with this, we can check VIF of the model before and after dropping one of the two highly correlated variables.

Formula for VIF:

VIF(k)= 1/1+Rk^2

Where R2 is the value obtained by regressing the kth predictor on the remaining predictors.

So to calculate VIF, we make model for each independent variable and consider all other variables as predictors. Then we calculate VIF for each variable. Whenever VIF is high, it means that set of variables have high correlation with the selected variable.

We will use an R library called ‘fmsb’ to calculate VIF.

So we can check VIF for our final linear model.

> library(fmsb) > Model_lm1=lm(Temp~ Ozone+Solar.R+Month,data=airquality) > VIF(lm(Month ~ Ozone+Solar.R,data=airquality)) [1] 1.039042 > VIF(lm(Ozone ~ Solar.R+Month, data=airquality)) [1] 1.137975 > VIF(lm(Solar.R ~ Ozone +Month, data=airquality)) [1] 1.118629

As a general rule, VIF < 5 is acceptable (VIF = 1 means there is no multicollinearity), and VIF > 5 is not acceptable and we need to check our model.

In our example, VIF < 5 and hence there is no need of any additional verification needed.

Interpretation of results

Basic assumptions of linear regression:

Linear relationship between variables

Normal distribution of residuals

No or little multi-collinearity: we have seen this using VIF

Homoscedasticity: Variance across the regression line should be uniform

R displays the summary of the model and gives intercept values of all independent variables along with error terms (or residuals).

The Linear relationship between variables has been verified by the significance (p value) of variables.

Page 14: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

In ‘Residuals vs fitted values’ plot it can be seen that residuals are linearly distributed and hence variance is uniform.

In ‘Normal Q-Q’ plot it can be seen that residuals are normally distributed. It can be seen by plotting histogram of residuals also

> hist(Model_lm_best$residuals)

To measure the quality of the model there are many ways and residual sum of squares is the most common one.

There are many ways to measure the quality of a model, but residual sum of squares is the most common one.

Residual sum of squares attempts to make a ‘line of best fit’ in the scattered data points so that the line has least error with respect to the actual data points.

If Y is the actual data point and Y’ is the predicted value by the equation, then the error is Y-Y’. But this has a bias towards ‘sign’ because when you sum up the error positive and negative values would cancel each other so the resultant error would be less than the actual value. To overcome this, a general method is to take square which serves two purposes:

1) Cancel out the effect of signs

2) Penalize the error in prediction

Prediction

To make a prediction, let’s build a data frame for new values of Solar.R, Wind and Ozone.

> new_data=data.frame(Solar.R,Wind,Ozone,Month) > new_data Solar.R Wind Ozone Month 1 190 7.4 41 5 2 118 8.0 36 5 3 149 12.6 12 5 4 313 11.5 18 5 5 NA 14.3 NA 5 6 NA 14.9 28 5 7 299 8.6 23 5

Page 15: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

8 99 13.8 19 5 9 19 20.1 8 5 10 194 8.6 NA 5 11 NA 6.9 7 5 12 256 9.7 16 5 13 290 9.2 11 5 14 274 10.9 14 5 15 65 13.2 18 5 16 334 11.5 14 5 17 307 12.0 34 5 18 78 18.4 6 5 19 322 11.5 30 5 20 44 9.7 11 5 21 8 9.7 1 5 22 320 16.6 11 5 23 25 9.7 4 5 24 92 12.0 32 5 25 66 16.6 NA 5 26 266 14.9 NA 5 27 NA 8.0 NA 5 28 13 12.0 23 5 29 252 14.9 45 5 30 223 5.7 115 5 31 279 7.4 37 5 32 286 8.6 NA 6 33 287 9.7 NA 6 34 242 16.1 NA 6 35 186 9.2 NA 6 36 220 8.6 NA 6 37 264 14.3 NA 6 38 127 9.7 29 6 39 273 6.9 NA 6 40 291 13.8 71 6 41 323 11.5 39 6 42 259 10.9 NA 6 43 250 9.2 NA 6 44 148 8.0 23 6 45 332 13.8 NA 6 46 322 11.5 NA 6 47 191 14.9 21 6 48 284 20.7 37 6 49 37 9.2 20 6 50 120 11.5 12 6 51 137 10.3 13 6 52 150 6.3 NA 6 53 59 1.7 NA 6 54 91 4.6 NA 6 55 250 6.3 NA 6 56 135 8.0 NA 6 57 127 8.0 NA 6 58 47 10.3 NA 6 59 98 11.5 NA 6 60 31 14.9 NA 6 61 138 8.0 NA 6 62 269 4.1 135 7 63 248 9.2 49 7 64 236 9.2 32 7 65 101 10.9 NA 7 66 175 4.6 64 7 67 314 10.9 40 7 68 276 5.1 77 7 69 267 6.3 97 7 70 272 5.7 97 7 71 175 7.4 85 7 72 139 8.6 NA 7 73 264 14.3 10 7 74 175 14.9 27 7 75 291 14.9 NA 7 76 48 14.3 7 7 77 260 6.9 48 7 78 274 10.3 35 7

Page 16: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

79 285 6.3 61 7 80 187 5.1 79 7 81 220 11.5 63 7 82 7 6.9 16 7 83 258 9.7 NA 7 84 295 11.5 NA 7 85 294 8.6 80 7 86 223 8.0 108 7 87 81 8.6 20 7 88 82 12.0 52 7 89 213 7.4 82 7 90 275 7.4 50 7 91 253 7.4 64 7 92 254 9.2 59 7 93 83 6.9 39 8 94 24 13.8 9 8 95 77 7.4 16 8 96 NA 6.9 78 8 97 NA 7.4 35 8 98 NA 4.6 66 8 99 255 4.0 122 8 100 229 10.3 89 8 101 207 8.0 110 8 102 222 8.6 NA 8 103 137 11.5 NA 8 104 192 11.5 44 8 105 273 11.5 28 8 106 157 9.7 65 8 107 64 11.5 NA 8 108 71 10.3 22 8 109 51 6.3 59 8 110 115 7.4 23 8 111 244 10.9 31 8 112 190 10.3 44 8 113 259 15.5 21 8 114 36 14.3 9 8 115 255 12.6 NA 8 116 212 9.7 45 8 117 238 3.4 168 8 118 215 8.0 73 8 119 153 5.7 NA 8 120 203 9.7 76 8 121 225 2.3 118 8 122 237 6.3 84 8 123 188 6.3 85 8 124 167 6.9 96 9 125 197 5.1 78 9 126 183 2.8 73 9 127 189 4.6 91 9 128 95 7.4 47 9 129 92 15.5 32 9 130 252 10.9 20 9 131 220 10.3 23 9 132 230 10.9 21 9 133 259 9.7 24 9 134 236 14.9 44 9 135 259 15.5 21 9 136 238 6.3 28 9 137 24 10.9 9 9 138 112 11.5 13 9 139 237 6.9 46 9 140 224 13.8 18 9 141 27 10.3 13 9 142 238 10.3 24 9 143 201 8.0 16 9 144 238 12.6 13 9 145 14 9.2 23 9 146 139 10.3 36 9 147 49 10.3 7 9 148 20 16.6 14 9 149 193 6.9 30 9

Page 17: Title of Projects: - Applying linear Regression model … · Title of Projects: - Applying linear Regression model to a real world problem. One of the most popular and frequently

150 145 13.2 NA 9 151 191 14.3 14 9 152 131 8.0 18 9 153 223 11.5 20 9

> pred_temp=predict(Model_lm_best,newdata=new_data) > pred_temp 1 2 3 4 5 6 7 75.94367 73.84070 68.79616 72.35493 NA NA 73.78063 8 9 10 11 12 13 14 68.04417 62.55352 NA NA 71.16926 71.04545 70.41943 15 16 17 18 19 20 21 66.92733 70.80933 72.96546 62.87507 72.60731 66.57943 64.52970 22 23 24 25 26 27 28 67.38249 64.98904 68.86786 NA NA NA 66.02899 29 30 31 32 33 34 35 71.25071 84.63794 73.45851 NA NA NA NA 36 37 38 39 40 41 42 NA NA 73.96970 NA 80.36578 77.11588 NA 43 44 45 46 47 48 49 NA 73.49532 NA NA 70.58058 71.63151 70.44288 50 51 52 53 54 55 56 69.40292 70.19098 NA NA NA NA NA 57 58 59 60 61 62 63 NA NA NA NA NA 96.41447 81.55126 64 65 66 67 68 69 70 78.84326 NA 84.28504 80.06159 87.16122 89.27762 89.49714 71 72 73 74 75 76 77 85.57032 NA 72.98099 73.78086 NA 69.15063 81.06862 78 79 80 81 82 83 84 77.86272 83.32619 84.90340 80.26839 72.35157 NA NA 85 86 87 88 89 90 91 84.55975 87.69784 72.72866 75.77116 83.77363 79.97721 81.55875 92 93 94 95 96 97 98 79.98919 81.09965 72.93791 77.30141 NA NA NA 99 100 101 102 103 104 105 96.01404 88.14867 91.70577 NA NA 80.25357 79.01597 106 107 108 109 110 111 112 83.30709 NA 75.46506 82.05879 77.25284 78.64853 79.88461 113 114 115 116 117 118 119 75.27117 70.77489 NA 80.17140 100.69243 84.72578 NA 120 121 122 123 124 125 126 84.05074 93.39974 86.90851 86.24597 92.70072 91.21206 91.16596 127 128 129 130 131 132 133 92.95623 84.03080 78.30447 80.71585 80.83618 80.33257 81.57786 134 135 136 137 138 139 140 81.79925 78.47868 82.97257 75.14591 76.61348 84.95924 77.74003 141 142 143 144 145 146 147 75.59043 80.06876 79.26544 77.30905 76.87634 79.94693 74.40987 148 149 150 151 152 153 72.22074 80.98238 NA 75.31788 77.59717 77.60688

Conclusion

The regression algorithm assumes that the data is normally distributed and there is a linear relation between dependent and independent variables.

It is a good mathematical model to analyze relationship and significance of various variables.

Reference :- https://www.edvancer.in/step-step-guide-to-execute-linear-regression-r/

**************************************THE END*********************************