anderson darling test

52
Outline 1.Background 2.Motivation 3.Objective 4. Description of Data 5. Intelligence extracted from Data a. Using Scatter Plots and Null Hypothesis b. Graphs of Correlation 6. Use of R Programming a. R-code b. Module wise description 7. What you have learnt from this Project? 8.Summary 9. Innovation finds in the field of Communication

Upload: pandyakavi

Post on 05-Feb-2016

18 views

Category:

Documents


0 download

DESCRIPTION

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free.

TRANSCRIPT

Page 1: Anderson Darling Test

Outline

1. Background

2. Motivation

3. Objective

4. Description of Data

5. Intelligence extracted from Data

a. Using Scatter Plots and Null Hypothesis

b. Graphs of Correlation

6. Use of R Programming

a. R-code

b. Module wise description

7. What you have learnt from this Project?

8. Summary

9. Innovation finds in the field of Communication

Page 2: Anderson Darling Test

1. Background

With the ever increasing traffic of data both on web and in inventories we have reached a stage when we are dealing with the concept of Big Data. Thus, we have abundant data with us ready to be exploited but it is of no usage unless we make some meaning out of it or untill we analyze it.

Thus, we use the probability approach of data analysis to understand data, its behaviour and its underlying characteristics. The reason of probability based data analysis being important is discussed below.

Why probability based data analysis is important?● Probability based data analysis comprises of many statistical techniques that analyzes

facts.● It helps to find particular data which is correct, can understand it in detail how the data

is processed.● Through such techniques, we can even conclude some patterns from the given data. ● We can do any statistics with the data and can extract useful information from it which

can help us in giving conclusions.Through interpretation of data we can have conclusion and come up with some patterns.

● It can help taking decisions also from the data we can interpret some useful things that helps us in making some important decisions

● Data analysis should any certain probability value through which information can be extracted easily.

● Probability based data analysis can be used in business areas, social science and many areas wherever we need some statistical conclusion such analysis can be useful

● For example for some clever conclusions in business such data analysis can be used.For example, Population,data of consumers at various places need probability based data analysis. For weather also such analysis is useful even in banking data analysis can be of much use.

2. Motivation

Page 3: Anderson Darling Test

AD test is one of the statistical test that is being applied upon the data to understand the behaviour of data and exploit it characteristics. The Advantages of using AD test as compared to other lies below:

Advantages of AD Test

1. Determine type of Distribution:

-> AD test can be used to determine the distribution followed by the specific data. It can be used to test that which distribution is being followed from the given list of distributions such as: Weibull distribution, Exponential Distribution, Log-Normal, Normal etc.

-> Thus on knowing the type of distribution of data, we can mention about the characteristic that data follows and comment about its behavior.

2. Better Test-Statistics than others:

-> According to the M.A. Stephen the test statistic of Ad test is one of the best as it can be easily used to find deviations and departures of data from normality. [1]

3. Critical values are distribution specific:

-> Critical values within ad.test() depends on the distribution being studied. This makes AD test preferable to KS test where critical values is independent of the distribution being studied.

-> Critical values being dependent on distribution makes ad.test distribution oriented and also increases the sensitivity of the test. [2]

4. Best Fit both for small and large samples:

-> The modified Test Statistic is given by A*(1+0.75/n+2.25/n^2). The modified test statistic takes into consideration the small sample size.Thus, the test statistic has been modified so that it can cater both small as well as wide scale of data and thus acts as best distance test. [3]

3. Objective

Page 4: Anderson Darling Test

What you are going to do with an AD test in data analysis and communication algorithm?

-> The main objective behind AD Test is to know the the type of distribution followed by data and accordingly predict its behaviour.

-> Each type of distribution has a specific characteristics of their own and with this characteristics we can get to know the behaviour of the data being study and thus, analyzing data depending upon their behaviour helps us to generate some refined conclusions and find particular pattern being followed.

-> The statistics of Anderson darling test are used in goodness-of-fits-test for Gompertz distribution, which in turn is used to find out span of real elements like life cycle of an electronic item, rate at which a code would fail, and widely used for generating span of living organisms. Anderson darling test is used with some modifications to find the upper and lower tails of many distributions.[4] [5]

-> Anderson Darling technique is used in Cognitive Radio. Cognitive radio is the concept in which unused part of Spectrum is supplied to Secondary user while catering the requirements of Primary User. In such system the distribution of Signal can be modeled by Gaussian Distribution and then we compare the received signal with the noise distribution.In such cases if we have an aprior information about the noise distribution then we can use Anderson Darling Test to check whether the received signals are drawn from the noise distribution.This method is also called as Anderson Darling Sensing. [6]

4. Description of Data

The collected data monitors the weather and atmospheric conditions of place in and around James Clerk Maxwell Building, located in Edinburg, U.K..

Page 5: Anderson Darling Test

Data is taken from: http://www.ed.ac.uk/schools-departments/geosciences/weather-station/download-weather-data

It takes readings of particular/specified parameters::i. Atmospheric Pressure(mBar)ii. Rainfall (mm)iii. Wind Speed (m/s)iv. Wind Direction (degrees)v. Surface Temperature (Celcius)vi. Relative Humidity (%)vii. Solar Flux (Kw/m2)viii. Battery (Volts)

All the above mentioned readings are taken at every minutes starting from Jan-1 and extending upto December- 31 of the year 2014. This results in total number of 5,25,206 records.

The data “JCMB_2014.csv” is a minute by minute data of weather conditions like atmospheric pressure ,rainfall etc. at James Clerk Maxwell Building, located in Edinburg, U.K

i.e it contains 60 minutes x 24 hours x 365 days = 525206 data entries over the period of the Year 2014.

5. Intelligence extracted from Dataa. Using Scatter Plots and other toolsb. Null hypothesis testing

What is Normal Distribution?

A normal distribution is a probability distribution of a Normal Random Variable X with

mean and variance . It is statistical probability distribution with probability density function (PDF): [7]

Page 6: Anderson Darling Test

This probability has distribution is a “Bell Shape” symmetric curve. Center peak of the

Bell varies as we change the value Mean( ) and the broadness of the bell curve varies as we

vary the value of variance( ).

Not every Bell Shape curve represents the “Normal Distribution”. The shape of the Normal distribution does not depend on the distribution parameters. Even though the data is symmetric in the probability distribution. Other distributions do have a bell shape curve as we can see from the following:

Therefore, in order to determine a Specific Distribution, one has to perform many tests as well as have to test the alternative models.[8]

NOTE : The reason for using t.test()

Anderson Darling test within nortest package can be used to determine whether the data follows normal distribution or not. If we want to build hypothesis regarding mean then it preferable to use t.test(), as it directly gives me the analysis depending upon the actual mean and the assumed mean. Whereas Anderson Darling test is distribution specific with Test Statistic changing for different distribution.

5.a Using Scatter Plots and Null Hypothesis

Page 7: Anderson Darling Test

5.1 Atmospheric Pressure

1. AD Normality Test

H0 = Atmospheric Pressure is following Normal DistributionH1 = Atmospheric Pressure is not following Normal Distribution

● Atmospheric Pressure does not follow Normal Distribution. Thus, we reject our null hypothesis.

● As we can observe from the Histogram that the value of Atmospheric Pressure is ranging mainly from 1000 mBar to 1025 mBar

● Average Yearly Atmospheric Pressure of Edinburgh is 1013.25 mBar

2. Student’s T Test

Page 8: Anderson Darling Test

H0 = The mean of Atmospheric Pressure is 1013.25 mBar

H1 = The mean of Atmospheric Pressure is not 1013.25 mBar

2.a. For 100 Samples

● Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar gets rejected with 95% of confidence level.

● As all of the values are nearly same , there is no change in the value and the graph would be a constant graph so the t.test(); won’t work

2.b. For 1000 Samples

● Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar gets rejected with 95% of confidence level.

● As starting 1000 samples means 16.67 hrs of January 1 i.e its winter so the temperature is less.

● As we know that atmospheric pressure rises with increase in temperature , hence the value of atmospheric pressure is much less than expected hence coinciding with our test results

2.c. For 10,000 Samples

● Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar gets rejected with 95% of confidence level.

Page 9: Anderson Darling Test

● As starting 10000 samples means 7 days of January 1 i.e its winter so the temperature is less.

● As the samples are of less range and sample size has increased the mean would reduce than that of 1000 samples

● As we know that atmospheric pressure rises with increase in temperature , hence the value of atmospheric pressure is much less than expected hence coinciding with our test results

2.d. For 1,00,000 Samples

● Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar gets rejected with 95% of confidence level.

● As starting 100000 samples means 2.3 month i.e its winter so the temperature is less.

● As the time has passed and we have data over 2.3 months , the temperature has gradually started increasing but not much. So the temperature increases the atmospheric pressure than the previous case.

● As we know that atmospheric pressure rises with increase in temperature , hence the value of atmospheric pressure is less than expected hence coinciding with our test results

2.e. For ALL Samples

● Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar gets rejected with 95% of confidence level.

Page 10: Anderson Darling Test

● The data is yearly data of the place in UK , as UK is a cold place and at height from sea level the temperature would be less than that of any place at sea level. we have considered the null hypothesis for 1013.25 mBar which is general pressure at sea level

● So the value of atmospheric pressure is less than expected hence coinciding with our test results

5.2 Relative Humidity

1. AD Normality Test

H0 = Relative Humidity following Normal DistributionH1 = Relative Humidity is not following Normal Distribution

Page 11: Anderson Darling Test

● Relative Humidity does not follow Normal Distribution. Thus, we reject our null hypothesis.

● As we can observe from the Histogram that the value of Relative Humidity is ranging mainly from 72% to 90%

● Average Yearly Relative Humidity of Edinburgh is 80.18249 %

2. Student’s T TestH0 = The mean of Relative Humidity is 82.91667 %

H1 = The mean of Relative Humidity is not 82.91667 %

2.a. For 100 Samples

● Null hypothesis of Relative Humidity having the mean of 82.91667 % gets rejected with 95% of confidence level.

● As starting 100 samples means 1.667 hrs of January 1 i.e its winter so the temperature is less.

● As we know that humidity is less in cold atmosphere hence the value of humidity is much less than expected hence coinciding with our test results

Page 12: Anderson Darling Test

2.b. For 1000 Samples

● Null hypothesis of Relative Humidity having the mean of 82.91667 % gets rejected with 95% of confidence level.

● As starting 1000 samples means 16.67 hrs of January 1 i.e its winter so the temperature is less.

● Still the atmosphere is cool so the humidity would not change much , just the average would change but not reach expected value, hence coinciding with the test

Page 13: Anderson Darling Test

2.c. For 10,000 Samples

● Null hypothesis of Relative Humidity having the mean of 82.91667 % gets rejected with 95% of confidence level.

● As starting 10000 samples means 7 days of January 1 i.e its winter so the temperature is less.

● Still the atmosphere is cool so the humidity would not change much , just the average would decrease as the variations over days will change but not reach expected value, hence coinciding with the test

Page 14: Anderson Darling Test

2.d. For 1,00,000 Samples

● Null hypothesis of Relative Humidity having the mean of 82.91667 % gets rejected with 95% of confidence level.

● As starting 100000 samples means 2.3 month i.e its winter so the temperature is less.

● As the time has passed and we have data over 2.3 months , the temperature has gradually started increasing but not much. So the temperature change because of change in humidity than the previous case.

Page 15: Anderson Darling Test

● Hence the relative humidity has increased but not upto the expected mean

2.e. For ALL Samples

● Null hypothesis of Relative Humidity having the mean of 82.91667 % gets rejected with 95% of confidence level.

● Here though the relative humidity increases as we reach the monsoon season , but due to large sample size of 525206 samples the mean gets reduced instead of increasing

Page 16: Anderson Darling Test

● Hence instead of increasing and satisfying the condition it decreases and is proved by the test

5.3 Surface Temperature

1. AD Normality Test

H0 = Surface Temperature following Normal DistributionH1 =Surface Temperature is not following Normal Distribution

Page 17: Anderson Darling Test

● Surface Temperature does not follow Normal Distribution. Thus, we reject our null hypothesis.

● As we can observe from the Histogram that the value of Surface Temperature is ranging mainly from 50C to 150C.

● Average Yearly Relative Surface Temperature of Edinburgh is 9.410C

2. Student’s T TestH0 = The mean of Surface Temperature is 13 0C

H1 = The mean of Surface Temperature is not 13 0C

2.a. For 100 Samples

Page 18: Anderson Darling Test

● Null hypothesis of Surface Temperature having the mean of 130C gets rejected with 95% of confidence level.

● As starting 100 samples means 1.667 hrs of January 1 i.e its winter so the temperature is less.

● As its winter the temperature would be much less than the expected value

2.b. For 1000 Samples

● Null hypothesis of Surface Temperature having the mean of 130C gets rejected with 95% of confidence level.

● As starting 1000 samples means 16.67 hrs of January 1 i.e its winter so the temperature is less.

● As the day passes the temp even drops further so the mean would go down further

2.c. For 10,000 Samples

Page 19: Anderson Darling Test

● Null hypothesis of Surface Temperature having the mean of 130C gets rejected with 95% of confidence level.

● As starting 10000 samples means 7 days of January 1 i.e its winter so the temperature is less.

● As time passes by and winter goes the avg. temperature rises but not upto expected yearly avg.

2.d. For 1,00,000 Samples

● Null hypothesis of Surface Temperature having the mean of 130C gets rejected with 95% of confidence level.

● As starting 100000 samples means 2.3 month i.e its winter so the temperature is less.

● As the time has passed and we have data over 2.3 months , the temperature has gradually started increasing but not much.

2.e. For ALL Samples

Page 20: Anderson Darling Test

● Null hypothesis of Surface Temperature having the mean of 130C gets rejected with 95% of confidence level.

● As over the year we are measuring the temperature of the cold place and the temp is very less compared to the avg. expected value

5.4 Wind Speed

1. AD Normality Test

H0 = Wind Speed following Normal DistributionH1 = Wind Speed is not following Normal Distribution

Page 21: Anderson Darling Test

● By observing the graph and checking from ad.test(), we find that Wind Speed does not follow Normal Distribution. Thus, we reject our null hypothesis.

● As we can observe from the Histogram that the value of Wind Speed is ranging mainly from 1.042 m/s to 4.396 m/s

● It shows a linear decrease from 1 m/s to 14m/s.● The mean Wind Speed is 2.952 m/s, indicating that a give regular day it is more likely that a wind

speed will be around 3 m/s● Thus, it is less likely to have wind speed beyond 7.5m/s as they take place during uneven

weather conditions● Average Yearly Wind Speed in Edinburgh is 2.952 m/s

● Overall Conclusion:o Null Hypothesis get rejected as both test and graphical observation support the same

result.

o Mean wind-speed remains at about 3 m/s during regular days

2. Student’s T TestH0 = The mean of Wind Speed is 2.83 m/s

H1 = The mean of Wind Speed is not 2.83 m/s

2.a. For 100 Samples

● Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95% of confidence level.

Page 22: Anderson Darling Test

● The mean speed from data is around 6.3 m/s whereas we are checking for 2.83m/s. Thus, there is a large variation between two means.

2.b. For 1000 Samples

● Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95% of confidence level.

Page 23: Anderson Darling Test

2.c. For 10,000 Samples

● Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95% of confidence level.

● First 10,000 samples correspond to the data of wind speed from the first week of January. Roughly, the wind speed in that time is 14Km/h or 3.9m/s. Thus we find, that 2.83 m/s deviates quite a lot from the recorded mean.

Page 24: Anderson Darling Test

2.d. For 1,00,000 Samples

● Null hypothesis of Wind Speed having the mean of 2.83 m/s gets accepted with 95% of confidence level.

● The acceptance level of mean ranges from 2.819304 and 2.85. Whereas, the mean that was assumed was 2.83. It perfectly fits in the mid range and hence gets accepted.

Page 25: Anderson Darling Test

2.e. For ALL Samples

● Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95% of confidence level.

Page 26: Anderson Darling Test

5.5 Wind Directional

1. AD Normality Test

H0 = Wind Directional following Normal DistributionH1 = Wind Directional is not following Normal Distribution

Page 27: Anderson Darling Test

● By observing the graph and checking from ad.test(), we find that Wind Direction does not follow Normal Distribution. Thus, we reject our null hypothesis.

● There is major distribution around two peaks, one at around 225o-250o and other at around 301o-320o. Thus, wind direction does not follow normal distribution.

● First peak correspond to direction of Southwest and some parts of West and other peak corresponds to direction of Northwest.

● Thus, majority of time wind flows from the west (North-west as well as south-west) side of direction.

● This can also be validated from the fact that there is huge open golf course (Craig Millar Park) surrounding the west and the southern part of the observatory.

● Range from 45o to 135o corresponds to direction of North-East, East and South-East. Thus no wind from that side.

● Average Yearly Wind Direction in Edinburgh is 159.6O

2. Student’s T TestH0 = The mean of Wind Direction is 238O

H1 = The mean of Wind Speed is not 238O

2.a. For 100 Samples

● Null hypothesis of Wind Direction having the mean of 238O gets rejected with 95% of confidence level.

2.b. For 1000 Samples

Page 28: Anderson Darling Test

● Null hypothesis of Wind Direction having the mean of 238O gets accepted with 95% of confidence level.

2.c. For 10,000 Samples

● Null hypothesis of Wind Direction having the mean of 238O gets rejected with 95% of confidence level.

2.d. For 1,00,000 Samples

Page 29: Anderson Darling Test

● Null hypothesis of Wind Direction having the mean of 238O gets rejected with 95% of confidence level.

2.e. For ALL Samples

● Null hypothesis of Wind Direction having the mean of 238O gets rejected with 95% of confidence level.

5.b. Graphs of Correlation1. Surface-temperature and rainfall

Page 30: Anderson Darling Test

SCATTER PLOTObservation through plot:

● This is the relationship between rainfall and surface- temperature - bar plot for correlation of atmospheric pressure and relative humidity.

● Rainfall on y axis and surface- temperature on x axis● There are much scattered data points in this plot which shows that this relationship will be weak

to a great extent● no linear or curvilinear relationship● Very few influential data points in the range of -9 to 25 values of surface-temperature● This relation has much lower correlation as seen from the graph due to the scattered data

points.

R CODE:plot(data$surface.temperature..C.,data$rainfall..mm. )

WITH REGRESSION LINElines(lowess(data$surface.temperature..C.,data$rainfall..mm.),col="blue")

Page 31: Anderson Darling Test

Statistical Observation:CORRELATION

● On the basis of r value, it can be said that the strength of the relationship is much weaker almost tending to 0..

● Through statistical data also we can see that the relationship is almost zero and thus weaker from the graph it was seen as the data points are much scattered.

● Since, we have an horizontal line, there is no correlation between data● Also, by the correlation function, we get the value near to zero ● Since, the correlation coefficient is negative but close to zero, we find that they are not

correlated

2. Rainfall and humidity

SCATTER PLOT

Page 32: Anderson Darling Test

Observation through plot:

● This is the relationship between relative humidity and rainfall- plot for correlation of rainfall and relative humidity.

● Relative humidity is on y axis and rainfall on x axis● The relative humidity is mainly clustered over a certain range between 0 to 2 values of rainfall● From the graph it is seen that the data points are clustered only at some area this type of

clustering can be said to have no correlation or much lesser correlation. We can say that the correlation is weak but it cannot be negative.

● It does not even follow any linear or curvilinear relationship.● The data points in the range of 0 to 2 of rainfall can be said to be somewhat influential.

R CODEplot(data$rainfall..mm.,data$relative.humidity....)

WITH REGRESSION LINE

Page 33: Anderson Darling Test

lines(lowess(data$rainfall..mm.,data$relative.humidity....),col="blue")

Statistical Observation:CORRELATION

● On the basis of r value, it can be said that the strength of the relationship is very weak relationship but positive weak relationship

● Regression line is slightly curvilinear and then constant, and thus r would be near to zero.● And on calculating correlation, we get it nearly zero and hence proved that they are not

correlated

Page 34: Anderson Darling Test

3. Wind speed and rainfallSCATTER PLOTObservation through plot:

● This is the relationship between rainfall and wind-speed - bar plot for correlation of rainfall and wind speed.

● Rainfall on y axis and wind-speed on x axis● The data is not even clustered at any place● Slope of the line is also too less which shows that there is not much correlation i.e lesser

correlation. We can say it has weak correlation but we can say that the relationship would not be negative.

● It follows somewhat linear relationship with slope almost negligible so this also shows that the correlation is weak.

R CODEplot(data$wind.speed..m.s.,data$rainfall..mm.

WITH REGRESSION LINElines(lowess(data$wind.speed..m.s.,data$rainfall..mm.),col="blue")

Page 35: Anderson Darling Test

Statistical Observation:CORRELATION

● On the basis of r value, it can be said that the strength of the relationship is very weak relationship but positive weak relationship.

● From the r value it is clearly seen that there is very weak correlation ● Wind speed and rainfall are not correlated as the regression line is horizontal, yet we can see

slight positive correlation between data● This is due to the scattered points above the regression line

4. Surface temperature and atmospheric pressureSCATTER PLOTObservation through plot:

Page 36: Anderson Darling Test

● This is the relationship between atmospheric pressure and surface-temperature - bar plot for correlation of surface-temperature and atmospheric pressure.

● atmospheric pressure is on y axis and surface-temperature on x axis● The graph is mainly clustered over a certain range of surface-temperature values between -9

approximately and 25● These data points do not have a specific pattern so we can say that they have lesser correlation

i.e the correlation is weak● It does not follow any linear or curvilinear relationship● The data points in the range of -9 to 25 can be said to be somewhat influential data points.

R CODEplot(data$surface.temperature..C.,data$atmospheric.pressure..mBar.)WITH REGRESSION LINElines(lowess(data$surface.temperature..C.,data$atmospheric.pressure..mBar.),col="blue")

CORRELATION

● On the basis of r value, it can be said that the strength of the relationship is weak relationship● The surface temperature and atmospheric pressure are positively correlated with each other, as

we get a positive slope regression line

5. Relative humidity and atmospheric pressure

Page 37: Anderson Darling Test

Observation through plot:

● This is the relationship between relative humidity and atmospheric pressure- bar plot for correlation of atmospheric pressure and relative humidity.

● Relative humidity is on y axis and atmospheric pressure on x axis● The graph is mainly clustered over a certain range of atmospheric pressure values between 950

approximately and 1100 ● The cluster it is decreasing downward gradually after sometime so it can be said that the

direction is downwards and it has negative association. As the atmospheric pressure increases the relative humidity decreases. Thus we can say that it has negative correlation by observing the plot.

● The form cannot be stated clearly as it is all clustered it does not follow any linear or curvilinear relationship

● The data points are closer in the right corner that shows that they are closely related with each other i.e they have higher correlation at that corner. We can say that they show a higher negative correlation as they have negative association and are more closely related but overall it can be concluded that it has lower correlation.

Page 38: Anderson Darling Test

● The data points in the right corner can be said to be influential as they are in the flow of major cluster of the data points

Statistical Observation:

● On the basis of r value, it can be said that the strength of the relationship is negative weak.● From the regression line, we observe that we have a negative linear regression line contributing

to negative correlation of data

6. Use of R Programming

a. Module wise description of Functions used

1. ad.test()function (x) { DNAME <- deparse(substitute(x)) x <- sort(x[complete.cases(x)]) //STEP-1 n <- length(x) //STEP-2 if (n < 8) //STEP-3 stop("sample size must be greater than 7") logp1 <- pnorm((x - mean(x))/sd(x), log.p = TRUE) //STEP-4 logp2 <- pnorm(-(x - mean(x))/sd(x), log.p = TRUE) //STEP-4 h <- (2 * seq(1:n) - 1) * (logp1 + rev(logp2)) //STEP-5 A <- -n - mean(h) //STEP-5 AA <- (1 + 0.75/n + 2.25/n^2) * A //STEP-6 if (AA < 0.2) { //STEP-7 begins pval <- 1 - exp(-13.436 + 101.14 * AA - 223.73 * AA^2) } else if (AA < 0.34) { pval <- 1 - exp(-8.318 + 42.796 * AA - 59.938 * AA^2) } else if (AA < 0.6) { pval <- exp(0.9177 - 4.279 * AA - 1.38 * AA^2) } else if (AA < 10) { pval <- exp(1.2937 - 5.709 * AA + 0.0186 * AA^2)

Page 39: Anderson Darling Test

} else pval <- 3.7e-24 //STEP-7 ends RVAL <- list(statistic = c(A = A), p.value = pval, method = "Anderson-Darling normality test", data.name = DNAME) class(RVAL) <- "htest" return(RVAL)}

Step-1:-> All NA values are removed from data and it is sorted in ascending order.

Step-2:-> Length of data is calculated

Step-3:-> Length of data is validated. If length is greater than 7, we proceed else we stop.

Step-4:-> Calculating CDF of Data using the formula pnorm(x-mean(x))/sd(x). mean(x) will return one value and it would be the Mean of Data Vector, sd(x) will calculate the standard deviation of ‘x’.

Step-5:-> We manipulate CDF accordingly to get the Test Statistic which also involves taking mean of our manipulated CDF.

Step-6:-> Then, we modify the test statistic so that it gives us the correct result even for very small sample values.

Step-7:-> Depending upon the varied range of modified Test Statistic we calculate our p-value.

2. cor(x,y)

This function takes as an input two columns and calculates the Correlation coefficient. It just depicts that whether two data are correlated/follow one another or not by giving r value. r value is between -1 and 1. From -1 to 0, it tells us about negative correlation of data and from 0 to 1 it tells about positive correlation of data. For r=0, we find that there is no correlation between data

3. plot(x,y)

Page 40: Anderson Darling Test

Plots the scatter plot of two data columns.

4. hist(x) or plot(table(x))

Using either function we get the graphical representation of the frequency of a data column with respect to its values. On X - axis data values whereas on the Y - axis it has frequency of occurrence.

5. t.test(x, mu = assumed mean value, conf. level = confidence level)

It compares the assumed mean value with the actual mean value of data and correspondingly takes decision on Null Hypothesis for a given confidence level.

6. lines(lowess)Gives the regression line of scatter plot, it is used for interpreting correlation of

data. The factor of alpha, passed as an argument to the function smoothes the line

7. What you have learnt from this Project?

- Programming Skills● We got an opportunity to Explore R Programming Language● We got to learn in depth about many functions inside the various packages of R

○ i.e., Nortest Package, ad.test, t.test, cor, distfit, MASS Package, FitDistPlus package

● It has enabled/familiarized us to determine Whether a data is drawn from a specific probability distribution or not.

- Data Analysis and Basic Concept of Hypothesis○ If the Test Statistic (Ac

2) exceeds the critical value then the Null Hypothesis is rejected.

○ Another approach can be if the P-value is less than 0.05 significance level then the Null Hypothesis is rejected.

- Interpretation of data Graphically○ How to correlate two different types of data through graphically and by

the correlation coefficient○ We learned to interpret from plot of two types and associate as well as

correlate the data graphically. [We learned to interpret from various

Page 41: Anderson Darling Test

types of plots such as scatter, histogram, normal-plot etc and are able to associate as well as correlate the data graphically.Thus we learnt the data analysis part.]

○ From graph, we can approximately get the value of mean which will be around the peak and standard deviation by width of the graph provided that it is a normal distribution.

○ Correlation is also an important part. We can predict many conclusions from it

○ We learned about from,strength and direction of plots○ For example, if we had a linear data having higher slope then it can be

interpreted as stronger relationship○ Whether the curve is linear or have any other shape depicts different

solutions.○ There can be positive correlation, negative correlation and no correlation.○ There can be combinations also like stronger negative correlation,

stronger positive correlation, weaker positive correlation, weaker negative correlation and no correlation.

○ The main conclusions depends on r value(correlation coefficient). It should be between -1 and 1.

○ If the data points are fully clustered that shows that it has no correlation

● We understood the functionality of ad.test and the internal structure of it such as:

CDF calculations, modified Test Statistic to incorporate small sample sizes, and calculation of p-value depending upon the range of Test Statistic

8. Summary

Anderson Darling Test provided in nortest package is used to accept/reject null hypothesis by looking at the normal distribution of data.The normal distribution is Gaussian distribution, calculated from mean and standard deviation of data.When our Null Hypothesis is based on means we also apply student’s t-test on individual data columns and arrive to certain conclusions based on varying means. The two sample student’s t-test familiarizes us with interrelation of data. We have used correlation of data to arrive to certain conclusion on the interrelation of data. Overall, The Anderson Darling test, can also be used to detect various distributions such as weibull, logarithmic, if critical values corresponding to the given distributions and required significance levels are known. Also, the test can be modified to be used with mean and standard deviation of our choice and then checking p-value for comparison

Page 42: Anderson Darling Test

between the actual and defined tests provided we have modified formula of Test-Statistics for the above mentioned distribution. Correlation leads us to a broader picture of data analysis, where we see relations between various columns of data; it might help us in predicting future trends between the given data. It can be thus used to predict weather, market analysis, product future analysis, etc. We also found out that Anderson Darling test is not sufficient for data analysis like correlation, or for known mean and unknown variance,yet for normality distribution analysis it is one of the most powerful tests.

9. Innovation finds in the field of Communication

SPECTRUM SENSING

Based on our research paper spectrum sensing in cognitive radio using goodness of fit testing by H Wang, the goodness of fit test is used to formulate signal sensing techniques as an alternate to energy detection based sensing. The Anderson Darling test is used here to develop Anderson darling sensing technique. The main issue today is over the costly spectrum bands and their optimum utilization, in this scenario it becomes important to have wide range of users adapting to the same spectrum, to improve its utilization and on the other hand, save the bandwidth of spectrum for newer products and technologies. Cognitive radio is a radio technology which gives us the freedom to shift the spectrum utilization of licensed users by the non-licensed users when the band is free and shift it back to the primary user when needed. Thus, it is adaptive to the changing environment parameters. Also, the Anderson Darling sensing gives higher probability of detecting signal than Energy Detection based on certain sensing parameters. The energy detection might use covariance mapping and waveform detection which are parametric sensing methods. They are used where there is low signal to noise ratio as energy detection gives low probability of correct sensing. We need to find the existence of signal quickly and efficiently for better utility.

MOBILE CELLULAR SYSTEMThe location registration, gives the local device details of the location of mobile station.

We need frequent contact with the mobile station for higher accuracy. One of the registration types is Distance based registration. There is Centralized tendency for the mobile terminal during random movement such that it is distributed at the center. The benefit of this tendency is that the probability density function of random variable for movement of mobile station is approximated to normalized distribution. Here, the Anderson Darling test is used as goodness of fit test for approximation by finding p-value for each of the multiple contacts.

Page 43: Anderson Darling Test

10.References:a. Text References

1. http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test#cite_note-Stephens74-1 2. http://www.isixsigma.com/dictionary/anderson-darling-normality-test/ 3. https://www.scribd.com/doc/234252923/Anderson-Darling-Test 4. http://iussp.org/sites/default/files/event_call_for_papers/lenartMissov.pdf 5. http://maths.york.ac.uk/www/sites/default/files/QilinHu.pdf 6. http://maths.york.ac.uk/www/sites/default/files/QilinHu.pdf7. http://mathworld.wolfram.com/NormalDistribution.html 8. http://www.mathwave.com/articles/distribution_fitting_faq.html#q4

b. Other references

1. http://en.wikipedia.org/wiki/Predictive_analytics

2. http://www.mathwave.com/articles/goodness_of_fit.html

3. http://www.mathwave.com/articles/distribution_fitting_faq.html#q3

4. http://www.cde.ca.gov/ta/tg/hs/documents/mathstudysec2.pdf

5. http://www.westga.edu/assetsCOE/virtualresearch/scatterplots_and_correlation_notes.pdf

6. http://math.tutorvista.com/statistics/scatter-plot.html

7. Spectrum sensing in cognitive radio using goodness of fit testing By Haiquan Wang, Member, IEEE, En-hui Yang, Fellow, IEEE, Zhijin Zhao and Wei Zhang, Member, IEEE

8. Information Networking Advances in Data Communications and Wireless Networks:International Conference, ICOIN 2006, Sendai, Japan, January 16-19, 2006, Revised Selected Papers (https://books.google.co.in)

C. Weather Related Information [Edinburgh ,U.K .]i. http://www.accuweather.com/en/gb/edinburgh/eh1-3/weather-forecast/327336ii. http://www.bbc.com/weather/2650225iii. http://www.weatherhq.co.uk/weather-station/edinburgh-airportiv. https://weatherspark.com/averages/28753/Edinburgh-Scotland-United-Kingdom