basics of statistical notation

39
Basics of Statistical Notation You will be introduced to a large number of formulas in this section on statistical concepts. These formulas use a relatively standardized notation to simplify the description of how a statistic should be computed. This section introduces the logic and basic concepts behind that notation. With each new formula, we will remind you what the notation means, but this section provides a head's up before we get to those formulas and provides a helpful summary in case you forget a notational concept. Designating a Variable Statistical formulas use algebraic notation, which rely on letters to designate a variable. By convention, if there is just one variable in a formula, the letter X is used to designate the variable. If there is a second variable in the formula, traditionally the letter Y is used to indicate the variable. If there is a third variable, the letter Z is traditionally used. After that, there are no universal traditions, but it is rare to have statistical formulas that involve more than three variables. The capital letter N traditionally refers to the total number of participants in a study. The single letter in statistical formulas refers to the variable. The individuals scores on that variable can be indicated by subscripts, which are numbers written below the letter to refer to a specific score. For example, X1 refers to the score for the first person on the X variable, and X27 refers to the score for the 27th person on the X variable. Y11 refers to the score on the Y variable for the 11th person. If there are several groups of participants, the number of participants in each group is indicated by a lower-case n with a subscript to indicate the group number. For example, n1 refers to the number of participants in the first group. Traditionally, the number of groups in a study are referred to by the lower-case letter k, although in complex designs, this tradition is modified. Therefore, nk refers to the

Upload: mocia76

Post on 07-Dec-2014

269 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Basics of statistical notation

Basics of Statistical NotationYou will be introduced to a large number of formulas in this section on statistical concepts. These formulas use a relatively standardized notation to simplify the description of how a statistic should be computed. This section introduces the logic and basic concepts behind that notation. With each new formula, we will remind you what the notation means, but this section provides a head's up before we get to those formulas and provides a helpful summary in case you forget a notational concept.

Designating a Variable

Statistical formulas use algebraic notation, which rely on letters to designate a variable. By convention, if there is just one variable in a formula, the letter X is used to designate the variable. If there is a second variable in the formula, traditionally the letter Y is used to indicate the variable. If there is a third variable, the letter Z is traditionally used. After that, there are no universal traditions, but it is rare to have statistical formulas that involve more than three variables.

The capital letter N traditionally refers to the total number of participants in a study.

The single letter in statistical formulas refers to the variable. The individuals scores on that variable can be indicated by subscripts, which are numbers written below the letter to refer to a specific score. For example, X1 refers to the score for the first person on the X variable, and X27 refers to the score for the 27th person on the X variable. Y11 refers to the score on the Y variable for the 11th person.

If there are several groups of participants, the number of participants in each group is indicated by a lower-case n with a subscript to indicate the group number. For example, n1 refers to the number of participants in the first group.

Traditionally, the number of groups in a study are referred to by the lower-case letter k, although in complex designs, this tradition is modified. Therefore, nk refers to the number of participants in the kth group, which is the last group.

Algebraic Rules

This is a specified order in which functions are to be carried out. The order is:

o The highest priority action should be to raise any variables to a power. For example, to compute 2X2, you would first square the value of X and then multiply by 2.

o The next highest priority action is multiplication or division. For example, to compute 2X +1, you would multiply the value of X by 2 and then add 1.

o The lowest priority action is addition or subtraction.

You can override any of these priorities by using parentheses. Anything in parentheses should be done before other actions. For example, X + Y2 is computed by squaring Y and adding it to X. In contrast, (X+Y)2 is computed by adding X and Y first and then squaring

Page 2: Basics of statistical notation

the sum. In other words, the parentheses in the second equation overrides the normal priority order (raise to a power before adding).

Summation Notation

Many statistical formulas involve adding a series of numbers. The notation for adding a series of numbers is the capital Greek letter sigma. The sigma stands for "add up everything that follows." Therefore, if the sigma is followed by the letter X, it means that you should add up all of the X scores.

Parentheses indicate that you should perform the operation in parentheses before you do the summation. For example, the notation below indicates that you should subtract Y from X before you sum the difference.

Standard Notation for Statistics

A distinction is made between a statistic that is computed on everyone in a population and a the same statistic that is computed on everyone in a sample drawn from the population.

o A statistic computed on everyone in the population is called a population parameter.

o A statistic computed on everyone in a sample is called a sample statistic.

The population mean is designated by the Greek letter mu, whereas the sample mean is designated by an X with a bar over the top (read X bar). Both are illustrated below.

A similar distinction is made for standard deviation, which is a measure of variability. The population standard deviation in indicated by the lower case Greek letter sigma, whereas the sample standard deviation is indicated by the lower case letter s, as shown below.

Page 3: Basics of statistical notation

The lower case letter r is used to designate a correlation. If there is any doubt about which two variables were used to compute the correlation, the two variables are listed as subscripts. For example, rXY indicates the correlation of X and Y.

Descriptive Statistics

Descriptive statistics describe aspects of a data set in a single number. Many descriptive statistics are also used in the computation of inferential statistics. In this section, we will be covering four classes of descriptive statistics. Those classes and their definitions are listed below.

Measures of Central Tendency - measures that indicate the typical or average score.

Measures of Variability - measures that indicate the spread of scores about the measure of central tendency.

Relative Scores - a score for an individual that tells how that individual performed relative to the other individuals on the measure.

Measures of Relationships - measures that indicate the strength and direction of a relationship between two or more variables.

In addition to these widely used descriptive statistics, we will also introduce some other less frequently used statistics, but statistics that you will occasionally run into, especially if you use computers to do your statistical analyses. 

You can go directly to any of these sections by clicking the section in the Table of Contents below, or you can simply click on the Next Page button to go to the next page within this section.

Measures of Central Tendency

Measures of central tendency indicate the typical or average score in a distribution of scores. This section covers three measures of central tendency: the mean, median, and mode.

The Mean

The mean is the arithmetic average of the scores. It is computed by adding all the scores and dividing by the total number of scores. Remember from our section on notation, we use summation notation to indicated that we should add all the scores, and we use the uppercase letter N to indicate the total number of scores. The notation for the mean of the X scores is an uppercase X with a bar across the top. Therefore the formula for computing the mean is written as follows: 

Page 4: Basics of statistical notation

If you have several groups in your research study, it is traditional to compute the mean for each group. In such a situation, you would use a subscript notation to indicate the groups. In formulas, the groups are numbered from 1 to k. Remember that k is the letter that we use to indicate the number of groups. So using this notation, the mean for Group 1 would use the following formula. 

Note that we use a subscript 1 to indicate that we are computing the mean for Group 1. We are adding all the scores in Group 1 (the X1s) and dividing the the number of scores in Group 1 (n1). We use a lowercase n here, because we are NOT talking about the total number of scores in the study, but rather the number of scores in just one group of the study. 

These notation rules can be a pain to learn initially, but once you get them down, you can quickly translate almost any formula into the computational steps that are required.

Although it is convenient to number groups for formulas, and we will generally be numbering groups in all of the formulas that we use, it is easier to use subscripts in your computations that are not a code. For example, if you are studying gender differences on a variable, you might compute the mean of that variable for men and women separately. Instead of using the subscripts 1 and 2 and remembering which refers to males and which refers to females, you might as well use a descriptive subscript. For example, the mean of the females might be written as follows:

Elsewhere on this website are instructions on how to compute the mean either by hand or using SPSS for Windows. To see those instructions, you click on one of the buttons below. To return to this page after viewing that material, you click on the back arrow key of the web browser that you are using to view this website.

The mean is the most widely used measures of central tendency, because it is the measure of central tendency that is most often used in inferential statistics. However, as you will see at the end of this section, the mean does not always provide the best indication of the typical score in a distribution.

Page 5: Basics of statistical notation

The Median

The median is the middle score in a distribution. It is also the score at the 50 percentile, which means that 50% of the scores are lower and 50% are higher. In the textbook, we showed how to compute the median with a small number of scores. In such a case, you:

1. Order the scores from lowest to highest and count the number of scores (N).

2. If the number of scores is odd, you add 1 to N and divide by 2 to get the middle score [(N+1)/2]. For example, if you have 15 scores, the middle score is the 8th score [(15+1)/2=8]. There will be seven scores above the 8th score and seven below it.

3. If the number of scores is even, there will be no middle score. Instead there will be two scores that straddle the middle. For example, if there are 14 scores, the 7th and 8th scores in your ordered list will straddle the middle. You can figure out which scores to focus on by dividing N by 2 and taking that score from the bottom and the one above it [e.g., 14/2=7, so you take the 7th and 8th scores from the bottom]. The median is between these two scores, so you average them. If the 7th score is 36 and the 8th score is 39, you sum the two and divide by two to get the average [e.g., (36+39)/2=37.5].

This formula works fine when there is a small number of scores and little duplication of scores, but it is not considered accurate enough when there are a large number of scores and many scores are duplicated. In such a situation, a more complicated formula is used. This formula is listed below. Don't panic; it is easier to use than it looks.

To compute a median using this formula, you must first create a frequency or grouped frequency distribution. We will use the frequency distribution that we created in the section on organizing data. That distribution is listed below.

Score FrequencyCumulativeFrequency

17 8 394

16 20 386

15 33 366

14 48 333

13 71 285

Page 6: Basics of statistical notation

12 85 214

11 58 129

10 39 71

9 21 32

8 11 11

Now we need to define each of the terms in the formula for the median. We must do this in steps. We start by finding the middle score.

1. The middle score (nmedian) is the total number of scores divided by 2. In a frequency distribution that also includes a cumulative frequency column, you can read the total number of scores (N) as the number at the top of the cumulative frequency column. In this case, N is 394, so nmedian is 394/2=197.

2. Next you find the interval that includes the 197th score from the bottom. To do this, start at the bottom of the cumulative frequency column and move up until you find the first number that is either equal to 197 or greater than 197. In this case, it is the interval for the score of 12. You may be surprised that we are calling that an interval, because we have only one score, but for the purposes of this computation it is an interval from 11.5 to 12.5. This is illustrated by the figure below.

3. Now we can identify all of the numbers that will go into the formula for the median. LRL stands for Lower Real Limit of the interval that contains the median. In this case, it is 11.5. The interval width (i in the formula) is 1. We compute it by subtracting the lower real limit from the upper real limit [e.g., 12.5-11.5=1.0]. We had previously computed nmedian as 197 [394/2]. The term nLRL refers to the number of people with scores below the lower real limit of the interval. You can read this off of the frequency distribution by noting the number in the cumulative frequency column for the interval below the one that

Page 7: Basics of statistical notation

contains the median. In this case, it is 129. In other words, 129 people in our example score below a 12. Finally, fi is the frequency of scores within the interval that contains the median. We can read that number from the frequency column of our distribution. In this case, it is 85.

4. Now we plug all of those items into the formula to get the following.

This is the most complicated formula that you have had to deal with so far, but the logic behind it is not as complicated as the formula makes it appear. We have determined that the middle score (197th) appears in the interval of 12, which has real limits from 11.5 to 12.5. Furthermore, we have determined that 85 people are in that interval and 129 score below that interval. This formula makes the assumption that the 85 people scoring in the interval that contains the median are evenly distributed. That is, the first person takes the bottom 1/85th of that interval, the next person takes the next 1/85th of that interval, up to the last person, who takes the top 1/85th of that space. The value "nmedian - nLRL" computes how far we have to count up from the bottom of the interval. In this case, we must count up 68 people [197-129], which is 80% of the way from the bottom of the interval [68/85]. The figure below illustrates this process, which is built into the formula.

Although you can create a frequency distribution and do the computations that we just walked you through, with large data sets, it is much more likely that you will use a computer package like SPSS for Windows to do this computation. Computer packages will use this formula to make the computation. To see how you would request the computation of the median using SPSS for Windows, click on the button below. When you want to return to this page, use the back arrow key on your browser to return.

The Mode

Page 8: Basics of statistical notation

The mode is the most frequently occurring score. In a frequency distribution, you compute the mode by looking for the largest number in the frequency column. The score associated with that number is the mode. Using the frequency distribution previously used for computing the median, the largest frequency is 85, and it is associated with a score of 12. Therefore, 12 is the mode.

If you have a grouped frequency distribution, the mode is the midpoint of the interval that contains the largest number of scores. That can create a bit of instability, which we can illustrate with an example. 

Suppose that we use the data from the frequency distribution above to create a grouped frequency distribution with an interval width of 2 scores. If we start by grouping 8 and 9 together, we will produce the following grouped frequency distribution.

Interval FrequencyCumulativeFrequency

16-17 28 394

14-15 81 366

12-13 156 285

10-11 97 129

8-9 32 32

The interval with the largest frequency is 12-13, and the midpoint of that interval is 12.5. Therefore, 12.5 is the mode. 

However, suppose that we create a similar grouped frequency distribution, again with an interval of 2, but this time we start with an interval of 7-8. If we do, we will get the following grouped frequency distribution.

Interval FrequencyCumulativeFrequency

17-18 8 394

15-16 53 386

13-14 119 333

11-12 143 214

9-10 60 71

7-8 11 11

Page 9: Basics of statistical notation

Now the interval with the largest frequency is 11-12, with a midpoint of 11.5. Therefore the mode is 11.5. So the mode shifts depending on how we set up the intervals. This effect is rather small in this example, because the sample sizes are rather large and the distribution is close to symmetric. 

With small sample sizes and less symmetric distributions, you can get huge shifts in the mode. This is why the mode is considered to be unstable. Another reason that the mode is unstable is that a shift of just a few scores could change the mode by making a different score the mode.

Comparing the Measures

In the textbook, we provide an example in one of the Cost of Neglect boxes of how the mean can misrepresent the typical score when there are a few deviant scores. In our example, there were five employees with the company, four of which made $40,000 and one making $340,000. The mean was $100,000, which clearly does not reflect the typical salary. The median ($40,000) was a much better estimate of the typical salary. 

This is an extreme example of a general principle. When a distribution is symmetric, like in the top panel of the figure below, the mean, median, and mode will all be the same. However, as the curve becomes more skewed, these three measures of central tendency diverge. The mode will always be at the peak of the curve, because the highest point indicates the most frequent score. The mean will be pulled the most toward the tail of the skew, with the median in between.

Page 10: Basics of statistical notation

These graphs may help you to understand what each of these measures of central tendency measure.

The mode is always the score at which the curve reaches its highest point (i.e., the most frequent score).

The median is the score the cuts the curve into two equal areas. In other words, the area above the median line is equal to the area below the median line. The area under a frequency curve is proportional to the number of people or objects represented by that curve. Remember, the median is the 50 percentile, so that should be an equal number above and below the median. So the area of the curve should be equal above and below the median to reflect this.

Page 11: Basics of statistical notation

The mean is the balance point for the curve. What that means is if we cut a block of wood in the exact shape of the curve, the mean would be the point at which that block of wood could be perfectly balanced on your finger. It is the point where the average distance from the mean is exactly the same for the scores above the mean and the scores below the mean. 

Measures of Variability

Measures of variability indicate the degree to which the scores in a distribution are spread out. Larger numbers indicate greater variability of scores. Sometimes the word dispersion is substituted for variability, and you will find that term used in some statistics texts. 

We will divide our discussion of measures of variability into four categories: range measures, the average deviation, the variance, and the standard deviation.

Range Measures

In Chapter 5, we introduced only one range measure, which was called the range. The range is the distance from the lowest score to the highest score. We noted that the range is very unstable, because it depends on only two scores. If one of those scores moves further from the distribution, the range will increase even though the typical variability among the scores has changed little. 

This instability of the range has lead to the development of two other range measures, neither of which rely on only the lowest and highest scores. The interquartile range is the distance from the 25th percentile and the 75 percentile. The 25th percentile is also called the first quartile, which means that it divides the first quarter of the distribution from the rest of the distribution. The 75th percentile is also called the third quartile because it divides the lowest three quarters of the distribution from the rest of the distribution. Typically, the quartiles are indicated by uppercase Qs, with the subscript indicating which quartile we are talking about (Q1 is the first quartile and Q3 is the third quartile). So the interquartile range can be computed by subtracting Q1

from Q3 [i.e., Q3 -Q1]. 

There is a variation on the interquartile range, called the semi-interquartile range or quartile deviation. This value is equal to half of the interquartile range.

Using this notation, the median is the second quartile (the 50th percentile). That means that we can use a variation of the formula for the median to compute both the first and third quartiles. Looking at the equation below for the median, we would make the following changes to compute these quartiles.

To compute Q1, nmedian becomes nQ1, which is equal to .25*N. We then identify the interval that contains the nQ1 score. All of the other values are obtained in the same way as for the median.

To compute Q3, nmedian becomes nQ3, which is equal to .75*N. We then identify the interval that contains the nQ3 score. All of the other values are obtained in the same way as for the median.

Page 12: Basics of statistical notation

To compute the interquartile range, subtract Q1 from Q3. To compute the quartile deviation, divide the interquartile range by 2.

It is common to report the range, and many computer programs routinely provide the minimum score, maximum score, and the range as part of their descriptive statistics package. Nevertheless, these are not widely used measures of variability. The same computer programs that give a range, will also provide both a standard deviation and variance. We will be discussing these measures of variability shortly, after we have introduced the concept of the average deviation.

The Average Deviation

The average deviation is not a measure of variability that anyone uses, but it provides an understandable introduction to the variance. The variance is not an intuitive statistic, but it is very useful in other statistical procedures. In contrast, the average deviation is intuitive, although generally worthless for other statistical procedures. So we will use the average deviation to introduce the concept of the variance.

The average deviation is, as the name implies, the average deviation (distance) from the mean. To compute it, you start by computing the mean, then you subtract the mean from each score, ignoring the sign of the difference, and sum those differences. You then divide by the number of scores (N). The formula is shown below. The vertical lines on either side of the numerator indicate that you should take the absolute value, which converts all the differences to positive quantities. Therefore, you are computing deviations (distances) from the mean.

Chapter 5 in the textbook walked you through the computation of the average deviation. The reason we take the absolute value of these distances from the mean is that the sum of the differences from the mean, some positive and some negative, will always equal zero. We can prove that fact with a little algebra, but you can take our word for it.

As we mentioned earlier, the average deviation is easy to understand, but it has little value for inferential statistics. In contrast, the next two measures (variance and standard deviation) are useful in other statistical procedures. So we now turn our attention to them.

Page 13: Basics of statistical notation

The Variance

The variance takes a different approach to making all of the distances from the mean positive so that they will not sum to zero. Instead of taking the absolute value of the difference from the mean, the variance squares all of those differences. 

The notation that is used for the variance is a lowercase s2. The formula for the variance is shown below. If you compare it with the formula for average deviation, you will see two differences instead of one between these formulas. The first is that the differences are squared instead of taking the absolute value. The numerator of this formula is called the sum of squares, which is short for sum of squared differences from the mean. See if you can spot the second difference.

Did you recognize that the variance formula does not divide by N, but instead divides by N-1? The denominator (N-1) in this equation is called the degrees of freedom. It is a concept that you will hear about again and again in statistics. If you would like to know more about degrees of freedom, you can click on this link. This link provides a conceptual explanation of this concept.

The reason that the variance formula divides the sum of squared differences from the mean by N-1 is that dividing by N would produce a biased estimate of the population variance, and that bias is removed by dividing by N-1. You can learn more about the concept of biased versus unbiased estimates of population parameters by clicking on this link.

The Standard Deviation

The variance has some excellent statistical properties, but it is hard for most students to conceptualize. To start with, the unit of measurement for the mean is the same as the unit of measurement for the score. For example, if we compute the mean age of our sample and find that it is 28.7 years, that mean is on the same scale as the individual ages of our participants. But the variance is in squared units. For example, we might find that the variance is 100 years2. 

Can you even imagine what the unit of years squared represents? Most people can't. But there is a measure of variability that is in the same units as the mean. It is called the standard deviation, and it is the square root of the variance (see the formula below). So if the variance was 100 years2, the standard deviation would be 10 years. Since we used the symbol s2 to indicate variance, you might not be surprised that we use the lowercase letter s to indicate the standard deviation. You will see in our discussion of relative scores how valuable the standard deviation can be.

Page 14: Basics of statistical notation

At this point, many students assume that the variance is just a step in computing the standard deviation, because the standard deviation seems like it is much more useful and understandable. In fact, you will use the standard deviation for description purposes only and will use the variance for all your other statistical tasks. If you are wondering why that is, click on this link to find out.

Relative Scores

In this section, we will use some of the statistical information that you have already learned to solve the practical problem of how to indicate the relative standing of a person on a measure. 

As part of this section, you will learn about the standard normal distribution, which is a distribution that is defined by a mathematical equation. Such mathematical distributions are a critical part of the inferential statistical process that we will be covering later.

Percentile Ranks

Relative scores indicate where a person stands within a specified normative sample. In general, scores have little meaning unless you know how other people scored. For example, on probably most of the exams that you have taken in school, you need to be correct 70% or more of the time just to pass the course, yet the standardized tests that you took as part of the college admission process are designed so that less than half of the people taking them get 70% or more of the questions correct. 

To know how good a score is, you need to know what other people got. For example, a professional baseball player who got a hit only half the times he went up to bat would be by far the greatest hitter of all time. Most professional baseball players only get a hit about every fourth time at bat. In contrast, a driver who only reached his or her destination without an accident half the time would be considered so bad that no insurance company would cover the person. Most people arrive safely at their destination 99+% of the time. 

We are constantly seeking information about relative scores, sometime even before the scores have been computed. How many times have you walked out of a test and asked other students whether the test seemed hard or easy. If you thought it was hard, and therefore are worried that you did not do well, you are likely to feel a little better after other students tell you that it seemed very hard to them as well.

The most basic relative score is the percentile rank, which specifies the percentage of people in the normative group who score lower on the measure than yourself. So if you scored at the 25th percentile, it means that 25% of the people score lower than you and 75% score higher than you. Percentile ranks can range from 0 (for the person with the lowest score) to 100 (for the person with the highest score). 

Most often percentile ranks are computed from a frequency distribution. Let's again use the frequency distribution that we have used before for examples. That distribution is below. Suppose that we want to compute the percentile rank for a score of 15. From the table, we can

Page 15: Basics of statistical notation

see that there are 333 people with a score below 15, but what do we do with the 33 people who have exactly 15? Do we count them as scoring above or below our person with a score of 15. The tradition is to assume that half of the people with the same score are below and half are above. That means that we have 33/2=16.5 with a 15 that we consider to be lower than us, and the same number with a score of 15 that we consider to have a score higher than us. We add the 16.5 people to the 333 people with lower scores of 14 or lower to get the number of people with scores lower than ours. 

Score FrequencyCumulativeFrequency

17 8 394

16 20 386

15 33 366

14 48 333

13 71 285

12 85 214

11 58 129

10 39 71

9 21 32

8 11 11

There are a total of 394 people. To get the percentile rank, we divide the number of people below our score by the total number of people and multiply by 100 (to convert the proportion to a percent). In this case, the percentile rank is 89 [(349.5/394)*100]. 

We traditionally round percentile ranks to two significant digits. So we rounded 88.705584% to 89%.

Standard Normal Distribution

Many variables in psychology tend to show a distinctive shape when graphed using a histogram or frequency polygon. The shape resembles a bell shaped curve like the one shown below. This classic bell shaped curve is called a normal curve or normal distribution. 

The normal curve is perfectly symmetric. The right half and left half are mirror images of one another. The curve also does not quite reaches zero, although it gets very close. The shape of the normal curve is actually determined by a complex equation, with dictates the height of the curve at every point. You need not know the details of this equation, but you should know that the equation includes two variables. They are the mean and the standard deviation. The mean

Page 16: Basics of statistical notation

dictates where the middle of the distribution is, which is the highest point of the curve and the point that separates the the area under the curve into two equal segments. The standard deviation determines how spread out the curve is. 

Because the normal curve is based on an equation, it is possible to know exactly how high the curve is at every point and how much area is under the curve between any two scores on the X-axis. The figure below marks off 1 and 2 standard deviations both above and below the mean. The area under the curve between the mean and one standard deviation below the mean is approximately 34%, as shown in the figure. More precisely, it is 34.13%. We will show you where that number comes from shortly. 

Because the curve is symmetric, the area between the mean and one standard deviation above the mean is also 34%. Similarly, the area between 1 and 2 standard deviations, either above or below the mean, is approximately 14%, and the area beyond 2 standard deviations is 2%  on either side of the distribution. All of these areas are determined by the equation for the normal curve, but you do not need to use this equation, because the values are computed for you and available in a table called the Area under the Standard Normal Curve Table. If you click on the link to this table, you can see what it looks like.

To use the Standard Normal Table, you need to know a little more about the normal curve and you need to learn about the standard score, also known as the Z-score. 

If you look at the two normal curves above, you might recognize that there are no scores listed on the X-axis. Remember that the location on the X-axis is determined by the mean of the distribution and the spread of the curve is determined by the standard deviation of the distribution. But you can convert a normal curve with any mean and variance into a standard normal distribution, which is a a normal curve with a mean of zero and a standard deviation of 1. 

Page 17: Basics of statistical notation

If the curve above were a standard normal distribution, the labels on the X-axis at the lines that divide the curve into sections would be -2, -1, 0, +1, +2, read from left to right. The table we showed you earlier gives the areas under the curve of such a standard normal distribution. 

Shown below is the equation that converts any score to a standard score using the values of the score and the mean and standard deviation of the distribution. A standard score shows where the person scores in a standard normal distribution. It tells you instantly whether the score is above or below the mean by the sign of the Z-score. If the Z-score is positive, the person scored above the mean; if it is negative, the person scored below the mean. The size of the Z-score indicates how far away from the mean the person scored.

If you want to see exactly how the standard normal distribution and the Z-score can be used to compute a percentile rank, you can click on this link. Besides walking you through the process, this link provides exercises to help you master this concept and procedure.

Other Relative Scores

The score on any measure could be converted to a Z-score, which would tell you at a glance how a person scored relative to the reference group. For example, if someone tells you that her Z-score on the exam was +1.55, you know immediately that she scored above the mean and enough above the mean that she is near the top of the class. Remember, most of the normal distribution is contained between the boundaries of -2.0 and +2.0 standard deviations. There is only about 2% of the area under the curve in each of the tails. If another student tells you the his exam score was a Z of -.36, you know that he scores a bit below the mean. 

With the standard normal table, you could compute the percentile rank for each of these students in a few minutes. Of course, this procedure is only legitimate if the shape of the distribution of scores is normal or very close to normal. If the shape is not normal, the Standard Normal Table will not give accurate information about how the proportion of people who score above and below a given score.

Although Z-scores are very useful and allow people to judge the relative performance of an individual quickly, many people get easily confused by the negative numbers that are possible with Z-scores. Consequently, many tests compute Z-scores, but then translate them mathematically to avoid negative numbers.

For example, the IQ test produces a distribution of scores that is very close to normal. However, the IQ test does not give a person's score as a Z-score; instead, the IQ test reports the score as an IQ score. The IQ score is simply the Z-score multiplied by 15 and then added to 100, as shown in the equation below. The values of 15 and 100 are arbitrary, but the effect of this transformation

Page 18: Basics of statistical notation

is to produce a normal distribution with a mean of 100 and a standard deviation of 15. So the IQ distribution looks like the figure below. Note that this figure is identical in shape to all the other figures in this section. The only difference is that the scores on the X-axis are IQ scores. So just over 95% of people have IQ scores between 70 and 130, and no one has a negative IQ.

Standardized tests often perform a similar transformation to avoid negative scores. For example, the Scholastic Aptitude Test (SAT) used for college admission and the Graduate Record Exam (GRE) used for admission to graduate school are both standardized so the the mean of the subtests is 500, with a standard deviation of 100. So if you score 450 on the verbal section of the SAT, you are scoring .5 standard deviations below the mean, which puts you at the 31st percentile. (See if you can do the computations and use the standard normal table to verify this percentile rank. This link shows you the method to make this computation.) 

The Normative Sample

Z-scores and transformed Z-scores, such as SAT scores, are very handy and are used extensively in reporting test scores. But it is critical to understand that the score is meaningful ONLY if you take into account the normative sample. 

A quick example will illustrate this point. Let's assume that Dan took the SAT as a High School senior and scored 650 on each subtest. That is 150 points above the mean (1.5 standard deviations) and would place him at approximately the 93rd percentile. 

Four years later, after doing well in college, he decides to go to graduate school, and so he takes the GRE. This time he only obtained a 550 on each of the subtests. What happened? Why did his performance decrease despite the fact that he worked hard in college and did very well?

You may already have guessed the answer to that question. In effect, we are trying to compare apples and oranges. The scores on the SAT and the GRE mean entirely different things, because they are based on entirely different normative samples. 

Page 19: Basics of statistical notation

The SAT is taken by people who expect to complete high school and are considering going on to college. In contrast, the GRE is taken by people who expect to graduate from college and plan to go onto graduate school. Anyone who drops out of college or does poorly in college is unlikely to take the GRE. In other words, the normative sample for the GRE is much more exclusive than for the SAT. Dan's GRE score would place him in the 70th percentile of the people applying for graduate school, who are a pretty elite group academically. Most of the people who took the SAT did not take the GRE, and most of the people who take the GRE did very well on the SAT. The competition (i.e., the normative group) was tougher for the GRE than the SAT.

Whenever you are given a normative score, such as a Z-score, percentile rank, or score on a standardized test, you should always consider the nature of the normative sample. A person making $500,000 per year may be one of the best paid people in the country (the normative sample including all workers), but one of the lowest paid CEOs for a Fortune 500 company (a different normative sample).

Measures of Relationship

Chapter 5 of the textbook introduced you to the three most widely used measures of relationship: the Pearson product-moment correlation, the Spearman rank-order correlation, and the Phi correlation. We will be covering these statistics in this section, as well as other measures of relationship among variables.

What is a Relationship?

Correlation coefficients are measures of the degree of relationship between two or more variables. When we talk about a relationship, we are talking about the manner in which the variables tend to vary together. For example, if one variable tends to increase at the same time that another variable increases, we would say there is a positive relationship between the two variables. If one variable tends to decrease as another variable increases, we would say that there is a negative relationship between the two variables. It is also possible that the variables might be unrelated to one another, so that you cannot predict one variable by knowing the level of the other variable.

As a child grows from an infant into a toddler into a young child, both the child's height and weight tend to change. Those changes are not always tightly locked to one another, but they do tend to occur together. So if we took a sample of children from a few weeks old to 3 years old and measured the height and weight of each child, we would likely see a positive relationship between the two.

A relationship between two variables does not necessarily mean that one variable causes the other. When we see a relationship, there are three possible causal interpretations. If we label the variables A and B, A could cause B, B could cause A, or some third variable (we will call it C) could cause both A and B. 

With the relationship between height and weight in children, it is likely that the general growth of children, which increases both height and weight, accounts for the observed correlation. It is

Page 20: Basics of statistical notation

very foolish to assume that the presence of a correlation implies a causal relationship between the two variables. There is an extended discussion of this issue in Chapter 7 of the text.

Scatter Plots and Linear Relationships

A helpful way to visualize a relationship between two variables is to construct a scatter plot, which you were briefly introduced to in our discussion of graphical techniques. A scatter plot represents each set of paired scores on a two dimensional graph, in which the dimensions are defined by the variables. 

For example, if we wanted to create a scatter plot of our sample of 100 children for the variables of height and weight, we would start by drawing the X and Y axes, labeling one height and the other weight, and marking off the scales so that the range on these axes is sufficient to handle the range of scores in our sample. Let's suppose that our first child is 27 inches tall and 21 pounds. We would find the point on the weight axis that represents 21 pounds and the point on the height axis that represents 27 inches. Where these two points cross, we would put a dot that represents the combination of height and weight for that child, as shown in the figure below.

We then continue the process for all of the other children in our sample, which might produce the scatter plot illustrated below.

Page 21: Basics of statistical notation

It is always a good idea to produce scatter plots for the correlations that you compute as part of your research. Most will look like the scatter plot above, suggesting a linear relationship. Others will show a distribution that is less organized and more scattered, suggesting a weak relationship between the variables. But on rare occasions, a scatter plot will indicate a relationship that is not a simple linear relationship, but rather shows a complex relationship that changes at different points in the scatter plot. 

The scatter plot below illustrates a nonlinear relationship, in which Y increases as X increases, but only up to a point; after that point, the relationship reverses direction. Using a simple correlation coefficient for such a situation would be a mistake, because the correlation cannot capture accurately the nature of a nonlinear relationship.

Page 22: Basics of statistical notation

Pearson Product-Moment Correlation

The Pearson product-moment correlation was devised by Karl Pearson in 1895, and it is still the most widely used correlation coefficient. This history behind the mathematical development of this index is fascinating. Those interested in that history can click on the link. But you need not know that history to understand how the Pearson correlation works.

The Pearson product-moment correlation is an index of the degree of linear relationship between two variables that are both measured on at least an ordinal scale of measurement. The index is structured so the a correlation of 0.00 means that there is no linear relationship, a correlation of +1.00 means that there is a perfect positive relationship, and a correlation of -1.00 means that there is a perfect negative relationship. 

As you move from zero to either end of this scale, the strength of the relationship increases. You can think of the strength of a linear relationship as how tightly the data points in a scatter plot cluster around a straight line. In a perfect relationship, either negative or positive, the points all fall on a single straight line. We will see examples of that later. 

The symbol for the Pearson correlation is a lowercase r, which is often subscripted with the two variables. For example, rxy would stand for the correlation between the variables X and Y.

Page 23: Basics of statistical notation

The Pearson product-moment correlation was originally defined in terms of Z-scores. In fact, you can compute the product-moment correlation as the average cross-product Z, as show in the first equation below. But that is an equation that is difficult to use to do computations. The more commonly used equation now is the second equation below. 

Although this equation looks much more complicated and looks like it would be much more difficult to compute, in fact, this second equation is by far the easier of the two to use if you are doing the computations with nothing but a calculator.

You can learn how to compute the Pearson product-moment correlation either by hand or using SPSS for Windows by clicking on one of the buttons below. Use the browser's return arrow key to return to this page.

Spearman Rank-Order Correlation

The Spearman rank-order correlation provides an index of the degree of linear relationship between two variables that are both measured on at least an ordinal scale of measurement. If one of the variables is on an ordinal scale and the other is on an interval or ratio scale, it is always possible to convert the interval or ratio scale to an ordinal scale. That process is discussed in the section showing you how to compute this correlation by hand.

The Spearman correlation has the same range as the Pearson correlation, and the numbers mean the same thing. A zero correlation means that there is no relationship, whereas correlations of +1.00 and -1.00 mean that there are perfect positive and negative relationships, respectively. 

The formula for computing this correlation is shown below. Traditionally, the lowercase r with a subscript s is used to designate the Spearman correlation (i.e., rs). The one term in the formula that is not familiar to you is d, which is equal to the difference in the ranks for the two variables. This is explained in more detail in the section that covers the manual computation of the Spearman rank-order correlation.

Page 24: Basics of statistical notation

The Phi Coefficient

The Phi coefficient is an index of the degree of relationship between two variables that are measured on a nominal scale. Because variables measured on a nominal scale are simply classified by type, rather than measured in the more general sense, there is no such thing as a linear relationship. Nevertheless,  it is possible to see if there is a relationship. 

For example, suppose you want to study the relationship between religious background and occupations. You have a classification systems for religion that includes Catholic, Protestant, Muslim, Other, and Agnostic/Atheist. You have also developed a classification for occupations that include Unskilled Laborer, Skilled Laborer, Clerical, Middle Manager, Small Business Owner, and Professional/Upper Management. You want to see if the distribution of religious preferences differ by occupation, which is just another way of saying that there is a relationship between these two variables. 

The Phi Coefficient is not used nearly as often as the Pearson and Spearman correlations. Therefore, we will not be devoting space here to the computational procedures.

Advanced Correlational Techniques

Correlational techniques are immensely flexible and can be extended dramatically to solve various kinds of statistical problems. Covering the details of these advanced correlational techniques is beyond the score of this text and website. However, we have included brief discussions of several advanced correlational techniques on this Student Resource Website, including multidimensional scaling, path analysis, taxonomic search techniques, and statistical analysis of neuroimages.

Nonlinear Correlational Procedures

The vast majority of correlational techniques used in psychology are linear correlations. However, there are times when we can expect to find nonlinear relationships and we would like to apply statistical procedures to capture such complex relationships. This topic is far too complex to cover here. The interested student will want to consult advanced statistical textbooks that specialize in regression analyses. 

There are two words of caution that we want to state about using such nonlinear correlational procedures. Although it is relatively easy to do the computations using modern statistical software, you should not use these procedures unless you actually understand them and their pitfalls. It is easy to misuse the techniques and to be fooled into believing things that are not true from a naive analysis of the output of computer programs. 

Page 25: Basics of statistical notation

The second word of caution is that there should be a strong theoretical reason to expect a nonlinear relationship if you are going to use nonlinear correlational procedures. Many psychophysiological processes are by their nature nonlinear, so using nonlinear correlations in studying those processes makes complete sense. But for most psychological processes, there is no good theoretical reasons to expect a nonlinear relationship.

Linear Regression

As you learned in Chapters 5 and 7 of the text, the value of correlations is that they can be used to predict one variable from another variable. This process is called linear regression or simply regression. It involves fitting mathematically a straight line to the data from a scatter plot. 

Below is a scatter plot from our discussion of correlations. We have added a regression line to that scatter plot to illustrate how regression works. We compute the regression line with formulas that we will present to you shortly. The regression line is based on our data. Once we have the regression line, we can then use it to predict Y from knowing X. 

The scatter plot below shows the relationship of height and weight in young children (birth to three years old). The line that runs through the data points is called the regression line. It is determined by an equation, which we will discuss shortly. If we know the value of X (in this case, weight) and we want to predict Y from X, we draw a line straight up from our value of X until it intersects the regression line, and then we draw a line that is parallel to the X-axis over to the Y-axis. We then read from the Y-axis our predicted value for Y (in this case, height).

Page 26: Basics of statistical notation

In order to fit a line mathematically, there must be some stated mathematical criteria for what constitutes a good fit. In the case of linear regression, that mathematical criteria is called least squares criteria, which is shorthand for the line being positioned so that the sum of the squared distances from the score to the predicted score is as small as it can be. 

If you are predicting Y, you will compute a regression line that minimized the sum of the (Y-Y')2. Traditionally, a predicted score is referred to by using the letter of the score and adding a single quotation after it (Y' is read Y prime or Y predicted). 

To illustrate this concept, we removed most of the clutter of data points from the above scatter plot and showed the distances that are involved in the least squares criteria. Note that it is the vertical distance from the point to the prediction line--that is, the difference from the predicted Y (along the regression line) and the actual Y (represented by the data point). A common misconception is that you measure the shortest distance to the line, which will be a line to the point that is at right angles to the regression line. 

Page 27: Basics of statistical notation

It may not be immediately obvious, but if you were trying to predict X from Y, you would be minimizing the sum of the squared distances (X-X'). That means that the regression line for predicting Y from X may not be the same as the regression line for predicting X from Y. In fact, it is rare that they are exactly the same.

The first equation below is the basic form of the regression line. It is simply the equation for a straight line, which you probably learned in high school math. The two new notational items are byx and ayx which are the slope and the intercept of the regression line for predicting Y from X. The slope is how much the Y scores increase per unit of X score increase. The slope in the figure above is approximately .80. For every 10 units movement along the line on the X axis, the Y axis moves about 8 units. The intercept is the point at which the line crosses the Y axis (i.e., the point at which X is equal to zero. 

The equations for computing the slope and intercept of the line are listed as the second and third equations, respectively. If you want to predict X from Y, simply replace all the Xs with Ys and the Ys with Xs in the equations below. 

Page 28: Basics of statistical notation

A careful inspection of these equations will reveal a couple of important ideas. First, if you look at the first version of the equation for the slope (the one using the correlation and the population variances), you will see that the slope is equal to the correlation if the population variances are equal. That would be true either for predicting X from Y or Y from X. What is less clear, but is also true, is that the regression lines for predicting X or predicting Y will be identical if the population variances are equal. That is the ONLY situation in which the regression lines are the same. 

Second, if the correlation is zero (i.e., no relationship between X and Y), then the slope will be zero (look at the first part of the second equation). If you are predicting Y from X, your regression line will be horizontal, and if you are predicting X from Y, your regression line will be vertical. Furthermore, if you look at the third equation, you will see that the horizontal line for predicting Y will be at the mean of Y and the vertical line for predicting X will be at the mean of X. 

Think about that for a minute. If X and Y are uncorrelated and you are trying to predict Y, the best prediction that you can make is the mean of Y. If you have no useful information about a variable and are asked to predict the score of a given individual, your best bet is to predict the mean. To the extent that the variables are correlated, you can make a better prediction by using the information from the correlated variable and the regression equation.

Other Descriptive Statistics

The most commonly used descriptive statistics for distribution of scores are the mean and the variance (or standard deviation), but there are other descriptive statistics that are available and are often computed by data analysis programs. We will discuss two of them in this section: Skewness and Kurtosis.

Skewness and kurtosis both describe an aspect of the shape of a distribution, much as the mean and the variance describe the shape of a distribution. But these indices of the shape of the distribution are used far less frequently than the mean and variance. 

Page 29: Basics of statistical notation

Skewness was discussed in the section on graphing data with histograms and frequency polygons. There is an index of the degree and direction of  skewness  that many statistical programs produce as part of their entire descriptive statistics package. When the skewness is near zero, the distribution is close to symmetric. A negative number indicates a negative skew, and a positive number indicates a positive skew. 

Kurtosis indicates the degree of flatness of the distribution. The larger the number, the flatter the distribution. 

Statisticians refer to a concept called moments of a distribution. The mean is based on the first moment of the distribution, and the variance is based on the second moment of the distribution. The skewness and kurtosis are based on the third and fourth moments of the distribution. These concepts make more sense to people who do theoretical work in statistics. But for the purpose of this course, the only points you need to remember is the definition of these terms in case your run across them in your reading. 

If you want to compute them, most statistical analysis packages allow you to include these statistics in the descriptive statistics package.