mash spss sessions getting started with spss/file/...14 as spss produces a lot of output for...

42
Getting started with SPSS Maths and Statistics Help Centre 1 community project encouraging academics to share statistics support resources All stcp resources are released under a Creative Commons licence MASH SPSS sessions Getting started with SPSS

Upload: others

Post on 04-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Getting started with SPSS Maths and Statistics Help Centre

    1

    community project encouraging academics to share statistics support resources

    All stcp resources are released under a Creative Commons licence

    MASH SPSS sessions

    Getting started with SPSS

  • Getting started with SPSS Maths and Statistics Help Centre

    2

    Data sets used in this booklet ........................................................................................................................................... 3

    Statistical Analysis Cycle ................................................................................................................................................... 4

    Introduction to data .......................................................................................................................................................... 4

    Data types ..................................................................................................................................................................... 5

    What is SPSS? ................................................................................................................................................................ 6

    Opening an Excel file in SPSS .................................................................................................................................... 7

    Titanic data ............................................................................................................................................................ 9

    Exercise 1: Were wealthy people more likely to survive on the Titanic? ............................................................. 9

    Labelling values ....................................................................................................................................................... 11

    Summarising categorical data ..................................................................................................................................... 13

    Output in SPSS ......................................................................................................................................................... 13

    Exercise 2: Who are the most dangerous drivers? ............................................................................................. 14

    Research question 1: Were wealthy people more likely to survive on the Titanic? ...................................................... 15

    Bar Charts .................................................................................................................................................................... 16

    Tidying up a bar chart ......................................................................................................................................... 16

    Adjusting variables ...................................................................................................................................................... 19

    Reducing the number of categories ........................................................................................................................ 19

    Changing continuous to categorical variables ........................................................................................................ 20

    Exercise 3 ............................................................................................................................................................ 20

    Summary statistics and graphs: Continuous data ........................................................................................................... 21

    Averages ...................................................................................................................................................................... 21

    Measures of spread .................................................................................................................................................... 21

    Which summary statistics should be used .................................................................................................................. 23

    Ex 4: Comparison of continuous data by group ...................................................................................................... 24

    Exercise 5: ........................................................................................................................................................... 25

    Research question 2: Which of three diets was best? .................................................................................................... 27

    Calculations using variables ........................................................................................................................................ 28

    Summary statistics for groups in tables .................................................................................................................. 29

    Scatterplots: ............................................................................................................................................................ 30

    Summary of descriptive and graphical statistics......................................................................................................... 32

    Research question 3: Which variables are strongly related to birthweight? ................................................................. 33

    Exercise 6: ........................................................................................................................................................... 33

    Exercise 7 ............................................................................................................................................................ 34

    Getting SPSS on your home computer............................................................................................................................ 35

    MASH contact details ...................................................................................................................................................... 35

    Solutions to exercises...................................................................................................................................................... 36

  • Getting started with SPSS Maths and Statistics Help Centre

    3

    Data sets used in this booklet All the data needed for this booklet is contained in the Excel file ‘all_data_for_MASH_workshops. You will

    need to download this file from the MASH workshops web page and save this file on your computer in

    order to use it. Once saved, close the file.

    Save the file somewhere:

    Datasets:

    Dataset Description

    Titanic List of 1309 passengers on board the Titanic when it sank and details about them such as gender, whether they survived, class etc

    Diet 78 people were put on one of three diets with the goal being to determine which diet was best.

    Birthweight Details for a number of babies and their parents such as weight and length of babies at birth and weight and height of mother.

    www.sheffield.ac.uk/mash/workshops

  • Getting started with SPSS Maths and Statistics Help Centre

    4

    Statistical Analysis Cycle

    Introduction to data SECONDARY data is data collected by someone else e.g. using the data from the National Students survey PRIMARY data is data collected by the researcher e.g. by producing a questionnaire. If you are producing a questionnaire think very carefully about the questions.

    QUANTITATIVE DATA is numeric and a variety of statistical techniques can be used to summarise and analyse the data.

    QUALITATIVE data is collected using open ended questions such as ‘What do you like best about your course?’.

    For all types of quantitative data, it is likely that it will end up in a spreadsheet with individuals/ subjects on rows and each column representing a variable e.g. answer to Q1 from a questionnaire or heart beat after running for 5 mins.

    A variable is just a measurement which varies between subjects e.g. height or the answer to a question.

    One variable per column

    One subject per row

  • Getting started with SPSS Maths and Statistics Help Centre

    5

    Data types

    In order to choose suitable summary statistics and analysis for the data, it is also important to distinguish between continuous (numerical) measurements and categorical variables. The choice of variable necessary to answer the main research questions should be considered at the planning rather than the analysis stage.

    NOMINAL data is categorical data with no order. The labels just name the category. Examples: Department Marital status What is your favourite animal? Dog Cat Horse Hamster Fish Other

    ORDINAL data has a recognisable order e.g. 1st, 2nd, 3rd

    Likert scales are ordinal e.g. Strongly disagree – strongly agree Can be numbered but the numbers are no different to names The gap between 1st and 2nd may be different to the gap between 2nd and 3rd

    DISCRETE data can only take whole numbers

    Number of children, how many times have you been on holiday this year CONTINUOUS data can be measured on any scale Examples: height, anything that can have decimals Discrete usually treated as continuous in analysis

    In most situations, the key distinction is between continuous/scale/ measurement data and categorical variables. Different summary statistics, charts and statistical tests are needed for the two types of variables. If discrete variables have a fairly large range of numbers, they can be treated as continuous for analysis purposes.

    Data Variables

    Measurements/ scale

    appear as meaningful numbers

    Continuous:

    takes any value e.g. height

    Discrete/ count:

    takes whole numbers e.g. Number of children

    in a family

    Categorical:

    appear as categories

    Ordinal:

    meaningfully ordered e.g. agree strongly - disagree

    strongly questions

    Nominal:

    No meaningful order e.g. eye

    colour

  • Getting started with SPSS Maths and Statistics Help Centre

    6

    What is SPSS?

    SPSS is similar to Excel but it’s easier to produce charts and carry out analysis. To open SPSS, select IBM

    SPSS statistics from ‘All programs’. Before opening, an additional screen appears. You can open a dataset

    from this screen but it’s easiest to just select ‘Type in data’ every time. Data can be opened after SPSS is

    opened.

    Version 21 and below:

    In version 22, select ‘New Dataset’ and ‘OK’.

  • Getting started with SPSS Maths and Statistics Help Centre

    7

    Example of data sheet in SPSS

    Opening an Excel file in SPSS

    Important note: There must be only one row with headings in for SPSS to open an Excel file correctly.

    If SPSS is not open, open SPSS. When prompted to open a file, select type in data.

    Variable headings can only

    appear at the top in the blue

    boxes

    Unlike Excel, you can only have

    one dataset on each page of

    SPSS. A new file must be created

    for each individual data set.

  • Getting started with SPSS Maths and Statistics Help Centre

    8

    To open any file in SPSS, select File Open Data. Here we are opening the ‘Titanic’ data which is currently in Excel. Note: The Excel file must not be open on your computer.

    SPSS only opens one sheet of data at a time so select the required sheet containing the Titanic data.

    Once the data is in SPSS, save the SPSS data file using File Save as. Save again after making changes to the data.

    Select ‘Excel’ as ‘Type of file’

  • Getting started with SPSS Maths and Statistics Help Centre

    9

    Titanic data

    The ship ‘The Titanic’ sank in 1914 along with most of its’ passengers and crew. The data set that we have

    contains information on 1309 passengers.

    Exercise 1: Were wealthy people more likely to survive on the Titanic?

    Once the data set is open on your computer, give the following variables suitable labels, label the values

    for categorical variables and select the correct data type.

    Variable

    name Variable label Value label Data type

    pclass Class 1 = 1st, 2 = 2nd, 3 = 3rd

    survived 0 = Died, 1 = survived

    Residence Country of Residence 0=American, 1 = British, 2 = Other

    age

    sibsp Number of siblings/ spouses

    parch Number of parents/ children on board

    fare Price of ticket

    Gender Gender 0 = male, 1 = female

    a) Which variables would you use to investigate the research question ‘Were wealthy people more

    likely to survive the sinking of the Titanic’?

  • Getting started with SPSS Maths and Statistics Help Centre

    10

    There are two sheets for each dataset. The ‘Data View’ sheet is where the numbers are entered and the

    ‘Variable View’ sheet is where the variables are named and defined. The option to choose between Data

    and Variable View is in the bottom left hand corner. For data in categories, type numbers in the Data View

    sheet and then label the numbers in ‘Variable View’.

    Select variable view to

    label the variables/ values

    There should be one row per person

    not one row per group

    Variable view: Label the variables

    The variable name has restrictions. It

    can have no spaces or use certain

    characters. Use the ‘Label’ column to

    give sensible variable descriptions

    which will appear in all output. If the

    label is blank, the variable name will

    appear in output.

    For example sibsp is ‘Number of

    siblings/ spouses on board’, parch is

    ‘Number of parents/ children on

    board’ and fare is ‘Price of ticket’.

  • Getting started with SPSS Maths and Statistics Help Centre

    11

    Labelling values

    It is best to have your categories coded as numbers for analysis in SPSS but for your output, people need to

    know what the numbers mean. Go to the ‘Values’ column in Variable View, let the mouse hover until you

    see a blue square. Clicking the square gives the ‘Value labels’ box. In the value box, put the number and

    the label for that number in the label box. Click on ‘Add’ after each label and ‘Ok’ when finished.

    Also, when using secondary data, watch for odd values, such as -99 indicating a missing value. These can

    be identified in the missing column so they are not taken into account in any analysis.

    Label the categories by

    selecting the blue box

    0 = Died and 1 = Survived Click on ‘Add’ after each one

  • Getting started with SPSS Maths and Statistics Help Centre

    12

    Note: There are two variables for gender. ‘Sex’ is a string variable (words) whereas ‘Gender’ has 0 for males and 1 for females so should be used during analysis.

    Variable Type: SPSS only

    analyses Numeric variables.

    String means it’s a word. The

    width is the number of

    numbers/ letters allowed for

    that variable.

    Decimals: When typing in data, the default number of decimals is 2. Change this to 0 for categorical and discrete data.

    The Measure column is where the data type is entered. Continuous/ discrete are called Scale in SPSS. SPSS won’t allow certain analysis for the wrong type of variable.

  • Getting started with SPSS Maths and Statistics Help Centre

    13

    Summarising categorical data The simplest way to summarise a single categorical variable is by using frequencies or percentages.

    Analyse Descriptive statistics Frequencies

    Output in SPSS

    Charts, tables and analysis appear in a separate ‘output’ window in SPSS. The output window is brought to the front of the screen when analysis/ charts etc are requested. The left hand column shows all of the output produced in that session. The output file has to be saved separately to the data file.

    To go back to the data file, select it on the bottom toolbar.

    Use the Valid Percent column as it

    does not include missing values.

    Move the variable for the number of parents/ children on board and

    survival to the right hand side and click ‘OK’ to run the analysis.

    Move the variables to be summarised from the list on the left hand side to the right using the arrow in the middle.

  • Getting started with SPSS Maths and Statistics Help Centre

    14

    As SPSS produces a lot of output for analysis and you may produce several charts before you decide which one is best, copying the output you require for your project and pasting into a Word document is preferable.

    Quick question: What percentage of people survived the sinking of the Titanic?

    Exercise 2: Who are the most dangerous drivers?

    Often we are interested in looking at the relationship between two variables. We start by investigating how age and gender relate to the number of car accidents in the UK. Stacked or multiple bar charts can summarise this type of information. The following multiple bar chart is taken from an article in the Guardian.

    http://www.theguardian.com/politics/reality-check/2013/oct/11/dangerous-drivers-how-old-uk-age-18

    a) Which gender is most likely to have an accident?

    b) Which age group is most likely to have an accident?

    c) The point of the chart should have been to look at how likely people were to have an accident by age and gender. What is wrong with the chart regarding addressing this research question?

  • Getting started with SPSS Maths and Statistics Help Centre

    15

    Research question 1: Were wealthy people more likely to survive on the Titanic? In general, using percentages to summarise categorical data is preferable although in the case of small

    numbers, percentages can be misleading e.g. ‘100% of people agree that mascara A is better than mascara

    B’ when only 2 people have been asked!

    Suitable charts for categorical data are bar charts and pie charts.

    A contingency table is a way of summarising two categorical variables. However, care needs to be taken

    with comparing groups of different sizes.

    If class had an effect on survival, a higher percentage of people in one class would have survived. If class

    had no effect roughly the same percentage would have survived in each class.

    To break down survival by class, a crosstabulation or contingency table is needed. Percentages are usually

    preferable to frequencies but remember to include counts for small sample sizes. Choose either row or

    column percentages carefully.

    Analyse Descriptive statistics Crosstabs

    3) Select ‘Cells’ to get the %

    options. Choose row %’s

    1) Select the

    variable class here

    and move to the

    ‘Row’ box. Move

    survival to the

    column box

    2) Move selected

    variables using the arrow

    4) Select ‘OK’ when finished and the

    chart appears in the output

    window.

  • Getting started with SPSS Maths and Statistics Help Centre

    16

    Bar Charts

    Plotting graphs in SPSS is much easier than in Excel. All graphs can be accessed through

    Graphs Legacy Dialogs There is a chart builder option but the legacy dialogs options are more user friendly. To display the information from the cross-tabulation graphically, use either a stacked or clustered bar chart. Both of these can be accessed through

    Graphs Legacy Dialogs Bar

    Tidying up a bar chart

    Double click on the chart to open an editing window.

    Selecting this turns the

    bars into 100% for each

    class

    Variable across the x-axis

    Variable to split the bars

  • Getting started with SPSS Maths and Statistics Help Centre

    17

    The font in graphs is usually small so adjust the axes titles etc. Select each axis and change the font size to 12. The axis titles and percentages displayed on the bars can also be changed in this way.

    Select this to add labels

    % is more useful so move it to

    the displayed box and remove

    count. Use Number Format to

    reduce to 0 decimal places

  • Getting started with SPSS Maths and Statistics Help Centre

    18

    Finally, give the chart a title and change the label on the y axis from ‘Count’ to ‘Percentage’.

    When finished, close the chart editor to return to the main output window. Right click on the chart in the output window, copy and paste into word. Sometimes you may need to select ‘Copy Special’ to move charts.

    Pasting as a picture enables easy resizing of graphs/ output in Word.

    It is clear from the bar chart that the percentage of those dying increased as class lowered. 38% of passengers in 1st class died compared to 74% in 3rd class. Is this a significant difference? To answer this, hypothesis testing is needed.

  • Getting started with SPSS Maths and Statistics Help Centre

    19

    Adjusting variables

    Reducing the number of categories

    Sometimes categories can be merged if not all the information is needed. For example, a common summary is to calculate the percentage who agreed from a Likert scale i.e. % agree or strongly agree compared to everything else.

    Use ‘re-code to different variables’ rather than ‘Re-code into same variables’ so that the re-coding can be checked.

    If there are numerous variables to be recoded in the same way, transfer several variables at the same time. Each variable needs an individual name though. Click change after each new name.

    Here a new variable is created where 0 = 3rd class and 1 = 1st or 2nd class.

    Transform Recode into different variables

    Select ‘Continue’ and then ‘OK’ to produce the new variable. Then label 0 = 3rd class and 1 = 1st or 2nd class in the value label box in variable view. Finally do a cross-tabulation of the old and new variables to check the re-coding is correct.

    All 1st and 2nd class passengers have been correctly recoded as ‘1st or 2nd class.

    Give the new

    variable a name,

    then click ‘Change’

    Move ‘class’ across

    New value Old value

    You must click add after

    each change to add to

    the Old New box

    Old

    variable

    New variable

  • Getting started with SPSS Maths and Statistics Help Centre

    20

    Changing continuous to categorical variables

    Although it is not recommended as information is lost, continuous (scale) variables can be categorised. Here we will create a new variable identifying children of 12 and under within the Titanic data set.

    Go to variable view and label 0 as ‘Adult’ and 1 as ‘Child’.

    Use ‘Crosstabs’ for the old and new variable to check the re-coding is correct i.e. age vs Child to see all those of 12 and under are classified as a child.

    Exercise 3

    Were Americans more likely to survive than the British? Produce suitable summary statistics/ charts to

    investigate this.

    5. You must

    click add

    after each

    change to

    add to the

    Old New

    box

    2. Give the new variable a

    name, then click ‘Change’

    1. Move ‘age’ across

    3. Old values of

    age up to 12

    are now going

    to be 1

    4. New value

  • Getting started with SPSS Maths and Statistics Help Centre

    21

    Summary statistics and graphs: Continuous data Continuous variables can be summarised using statistics such as the mean, median, standard deviation,

    minimum and maximum values. For continuous data, plotting a histogram gives an idea of the shape and

    spread of the distribution as well as assessing whether the variable is normally distributed. Box-plots can

    also be used and are particularly useful when comparing groups. The minimum and maximum help check

    for outliers and possible data entry errors.

    Averages

    Mode: The value which occurs most often Mean: Sum of the values/ number of values Median: The middle value of ordered data

    Measures of spread

    Range = maximum value – minimum value = 30 – 7 = 23 Quartiles: These divide the data into 4 parts. 25% of values are below the lower quartile and 25% are above the upper quartile. The median is the 2nd quartile Interquartile range = Upper quartile – lower quartile = 13 – 8 = 5

    7 7 8 8 9 10 13 13 13 14 30

    Quick question: 2 out of 3 people earn less than the average income

    1. True

    2. False

    Median Lower quartile Upper quartile

    50% of subjects below median 25% of subjects above upper quartile

  • Getting started with SPSS Maths and Statistics Help Centre

    22

    Variance: Average of the squared deviations from the mean. A deviation is the difference between a single value and the mean.

    1 - nsobservatio no.

    sdifference squared of sumdeviation Standard

    Calculating means and standard deviation Example of calculating the mean and standard deviation:

    X = exam score

    Both histograms on the left show approximately the same mean but the second has a much smaller standard deviation as it is less spread out.

    Deviations from the mean

    Mean

    Subject ID

    5.66.4210

    426

    1 - nsobservatio no.

    mean thefrom deviations squared of sum SD

    1211

    132

    nsobservatio ofnumber

    scores of sum Mean

    Outlier contributes most deviation

  • Getting started with SPSS Maths and Statistics Help Centre

    23

    Which summary statistics should be used

    Means and standard deviations are commonly used to summarise continuous data although for skewed data, the median and quartiles are more appropriate. Skewed data can be assessed by plotting a histogram of continuous data. For large samples, we would expect a histogram to peak roughly in the middle. If the histogram peaks at one end or the other, the data is skewed. The histogram below shows male height which is normally distributed. This means that most people are in the middle and the spread is fairly symmetrical about the mean. For normally distributed data, the mean and the median are similar.

    Positively skewed distribution Negatively skewed distribution Mean > median Mean < median

    Quick question solution:

    TRUE if you assume average is the mean: Two thirds of people earn less than the MEAN wage. As the chart below shows, the data is very skewed. There are a lot of people earning a low wage and a few very high earners pulling the mean up. In this situation, the median better represents the population as a whole.

    Chart from ‘How does your wage compare with an MP’s’ http://news.bbc.co.uk/1/hi/8072031.stm

    2 out of 3 people

    Mean Median

    Normally distributed data

    http://news.bbc.co.uk/1/hi/8072031.stm

  • Getting started with SPSS Maths and Statistics Help Centre

    24

    Ex 4: Comparison of continuous data by group

    Did the cost of a ticket affect chances of survival?

    a) Is there a big difference in average ticket price by group?

    b) Which group has data which is more spread out?

    c) Is the data skewed?

    d) Is the mean or median a better summary measure?

    Cost of ticket Survived?Died Survived

    Mean 23.35 49.36

    Median 10.50 26.00

    Standard Deviation 34.15 68.65

    Interquartile range 18.15 46.56

    Minimum 0.00 0.00

    Maximum 263.00 512.33

  • Getting started with SPSS Maths and Statistics Help Centre

    25

    DATA: The data set ‘diet’ contains information on 78 people who undertook 1 of three diets. There is background information such as age and gender as well as weights before and after the diet.

    Open the data set from Excel. Go into the Variable View and make sure that each variable is correctly categorised e.g. nominal. Note: continuous is called ‘Scale’ in SPSS. It is important that variables are correctly categorised as SPSS will only carry out some analysis on certain variable types.

    There are several ways to produce summary statistics and charts. This option uses ‘Explore’ which contains

    the most summary statistics to compare weight before the diet for males and females.

    Analyse Descriptive statistics Explore

    Exercise 5:

    a) Fill in the following table using the summary statistics table in the output.

    Female = 0 Male = 1

    Minimum -70

    Maximum 82

    Mean 64

    Median 66

    Standard Deviation 21.6

    b) Interpret the summary statistics by gender. Which group has the higher mean and which group is more spread out?

    Put ‘Pre-weight’ as the dependent

    variable and ‘Gender’ in the factor list.

    The summary statistics will be

    produced for each gender separately.

  • Getting started with SPSS Maths and Statistics Help Centre

    26

    A box-plot shows the spread of a distribution of values. The box contains the middle 50% of values.

    c) How could the chart be improved and is there anything odd?

    Median = central line

    Upper quartile

    Lower quartile

    Outlier

  • Getting started with SPSS Maths and Statistics Help Centre

    27

    Research question 2: Which of three diets was best? Before the next section, change the error of -70 to 70. Outliers should not normally be changed unless

    they are clearly data entry errors as in this case.

    Give the variables sensible labels and label gender with 0 = Female and 1 = Male.

    Re-run explore to see how the change has affected the summary statistics. Which summary statistics have changed the most?

    Female with outlier Female after changing outlier

    Minimum -70

    Maximum 82

    Mean 64

    Median 66

    Standard Deviation 21.6

    Change -70

    to 70kg

  • Getting started with SPSS Maths and Statistics Help Centre

    28

    Calculations using variables

    Producing the charts for gender and weight before the diet was useful for demonstrating SPSS but the main question of interest is ‘Which diet led to greater weight loss?’. How could this be assessed? To answer this, a new variable ‘weight lost’ (weight before – weight after) would be useful. As spaces are

    not allowed in variable names, use weightLOST as a name and give a better name in the label section in

    variable view.

    To do this use Transform Compute variable.

    After putting the calculation into the ‘Numeric Expression’ box, select ‘OK’ and the new variable will appear last in the Data and variable view sheets. Before carrying out the official test of a difference, use summary statistics and charts to look at the differences.

    Move ‘Preweight’ into box, select ‘-‘ and

    then move ‘Weight6week’ across

    Selecting ‘All’ gives

    you a lot of options for

    calculations e.g. mean

    of several variables

  • Getting started with SPSS Maths and Statistics Help Centre

    29

    Summary statistics for groups in tables

    SPSS has a table function which can produce more complicated tables although it is a little temperamental and frustrating at times!

    To open the table window: Analyse Tables Custom Tables Drag variables to either the row or column bars to include them in the table. If you want to create sub categories, drag the categorical variable to the front of the variable already in the table. By default, SPSS will choose means to summarise continuous (scale) variables and counts to summarise categorical variables. It is vital that variables are correctly defined as scale or categorical.

    1) Move ‘WeightLOST’ to the row section and ‘Diet’ to the Columns section. 2) Select the summary statistics you require 3) Choose ‘Columns’ in the ‘Position’ options for a better display.

    Which diet seems the best and which diet has the most variation in weight loss?

    Selecting the ‘Summary

    Statistics’ button opens a

    window where options for

    statistics displayed can be

    chosen.

    The summary statistics button

    will only highlight when a

    variable is selected in the

    main window. Here, make

    sure weightLOST is highlighted

    in yellow in the central

    window.

    To change the summary statistics to

    appear down the side, select rows

    instead of columns from the

    position box.

    Select Standard deviation and

    count from the options and click

    ‘Apply to all’.

  • Getting started with SPSS Maths and Statistics Help Centre

    30

    Scatterplots:

    A scatterplot helps assess a relationship between two continuous (scale) variables by plotting a different point for each individual based on their scores on two variables. The closer the points fit a diagonal line, the stronger the relationship.

    The scatterplot below shows a negative relationship between a persons’ weight and the number of kilometres they run per week. i.e. the more they run, the lighter they are generally. There is one clear outlier who runs a lot but also weighs a lot.

    Things to look for in a scatterplot:

    How strong is the relationship? The closer the points form a line, the stronger the relationship.

    Is there a negative or positive relationship?

    Is the relationship linear? Do the points form a straight line?

    Are there any outliers that could be data entry errors?

    Outlier

    General linear trend

    downwards

  • Getting started with SPSS Maths and Statistics Help Centre

    31

    A scatterplot can be colour coded by a third categorical variable using the ‘Set marker by’ option within the

    Graphs Legacy Dialogs scatterplot menu.

    Here, we will look at the relationship between weight before and weight after the diet with different shapes for males and females.

    Double click on the chart to open the edit window. To change the shape of the scatter, click on the scatter, then again on just one of the females to open the properties window. Change the marker type and size.

    It is clear from the scatterplot that there is a strong positive relationship between a person’s weight before and after the diet. A positive relationship (uphill scatter) means that as the x (horizontal) variable (weight before diet) increases so does the y (vertical) variable. In a negative relationship, y decreases as x increases.

  • Getting started with SPSS Maths and Statistics Help Centre

    32

    Summary of descriptive and graphical statistics

    Variable type Purpose Summary Statistics

    Pie Chart or bar chart

    One Categorical variable Shows frequencies/ proportions/percentages

    Class percentages

    Stacked / multiple bar

    Two categorical variables

    Compares proportions within groups Compare percentages within groups

    Histogram One continuous variable Shows distribution of results Mean and Standard deviation

    Scatter graph Two continuous variables

    Shows relationship between two variables and helps detect outliers

    Correlation co-efficient

    Line Chart Continuous over time Continuous by group

    Displays changes over time Comparison of group means

    Frequencies Means

    Confidence Interval plot

    Continuous dependent/ categorical independent

    Comparison of group means Means and Confidence Intervals

  • Getting started with SPSS Maths and Statistics Help Centre

    33

    Research question 3: Which variables are strongly related to birthweight?

    Exercise 6:

    a) Open the data set ‘birthweight’ from Excel. Label the variables with the labels in the table below.

    b) What is the average birthweight? Is birthweight normally distributed?

    c) Recode the variable mncig (cigarettes smoked by the mother per day) into the following four

    categories: 1 = non-smoker, 2= light smoker (smokes 1 – 10 a day), 3 = Moderate smoker (11 – 20 a

    day) and 4 = Heavy smoker (21+ a day).

    d) Summarise birthweight by smoking category using suitable statistics and a graph

    e) Produce a scatterplot of birthweight and gestational age by smoking category. What is the

    relationship between the variables?

    Variable Label Variable type

    id Baby ID

    headcir Head Circumference (cm)

    leng Length of baby (inches)

    weight Baby's weight

    gest Gestational age

    mage Maternal age

    mnocig No. cigarettes smoked per day by mother

    mheight Maternal height

    mppwt Mothers pre-pregnancy weight

    fage Fathers age

    fedyrs Years father was in education

    fnocig No. cigarettes smoked per day by father

    fheight Fathers height

    lowbwt Low birth weight baby 1 = under 5lbs

  • Getting started with SPSS Maths and Statistics Help Centre

    34

    Exercise 7: Enter the following data into SPSS:

    Women

    Men

    Age

    housework (hrs per

    week) marital status

    Hours worked per

    week

    Age

    housework (hrs per

    week) marital status

    Hours worked

    per week

    46 6 Married 35

    55 10 Married 28

    62 8 Married 7

    61 0 Married 39

    42 30 Married 7

    39 2 Married 49

    36 25 Married 18

    38 3 Married 40

    58 30 Married 23

    58 4 Married 40

    36 21 Married 22

    31 6 Married 41

    32 10 Married 24

    54 7 Married 42

    35 14 Married 32

    33 4 Separated 45

    33 3 Married 36

    62 6 Divorced 38

    41 12 Married 36

    62 6 Widowed 37

    31 14 Separated 22

    31 2 Never

    married 35

    50 25 Divorced 10

    32 18 Never

    married 25

    31 15 Widowed 15

    42 20 Never

    married 35

    a) Investigate the relationship between the amount of housework someone carries out per week and

    each of the other variables using suitable charts. For scatterplots, have different markers for males

    and females.

    b) Create a new binary variable from ‘Hours worked per week’ to indicate whether someone is full

    time or part time. Classify part time as under 30 hours.

    c) Summarise the amount of housework carried out per week by working full/ part time using a table

    and a plot and interpret.

  • Getting started with SPSS Maths and Statistics Help Centre

    35

    Getting SPSS on your home computer

    Go to the downloading software page and enter your uni login and password

    https://cics.dept.shef.ac.uk/software/

    To download software or renew license codes click the SPSS Statistics 19-22 button on the page that comes

    up. You will receive an email containing a download link, a license code, installation instructions and legal

    information. The download sometimes takes a long time!

    MASH contact details

    Book an appointment or access help sheets via our webpage: https://www.shef.ac.uk/mash

    Statistics appointments are 10am – 1pm every day in term time with an additional session 4-7pm Wednesdays. For appointments outside of term time see our website or email [email protected].

    https://cics.dept.shef.ac.uk/software/https://www.shef.ac.uk/mash

  • Getting started with SPSS Maths and Statistics Help Centre

    36

    Solutions to exercises Exercise 1: Identify the type of variables and key questions of interest for the Titanic dataset

    Variable

    name Variable label Value label Data type

    pclass Class 1 = 1st, 2 = 2nd, 3 = 3rd Ordinal

    survived 0 = Died, 1 = survived Nominal

    Residence Country of Residence 0=American, 1 = British, 2 = Other Nominal

    age Scale

    sibsp Number of siblings/ spouses Scale

    parch Number of parents/ children on board Scale

    fare Price of ticket Scale

    Gender Gender 0 = male, 1 = female Nominal (binary)

    Were wealthy people more likely to survive? Which variables would you use to investigate this question?

    Survival is the outcome. Wealthy could be measured using either class or price of ticket.

    Exercise 2: Who are the most dangerous drivers?

    Males and middle aged people have more

    accidents.

    This may be because there are more drivers

    of males and middle aged drivers on the

    road.

    %’s are better than frequencies

    Given there are different numbers of

    drivers in each category and the categories

    are different widths, the best way to

    summarise is to compare the proportions

    within each category having accidents. It

    is clear that male drivers consistently have

    more accidents and that younger drivers

    are more likely to have accidents.

    Categories are different widths

  • Getting started with SPSS Maths and Statistics Help Centre

    37

    Exercise 3: Investigate whether nationality and survival were related

    56% of Americans survived compared to 32% of British passengers and 32% of other nationalities.

    Ex 4: Comparison of continuous data by group

    Did the cost of a ticket affect chances of survival?

    a) Is there a big difference in average ticket price by group? Yes. The mean and median ticket prices are much higher in the group who survived

    b) Which group has data which is more spread out? The standard deviation is double in the group who survived so there is much more variation in that group

    c) Is the data skewed? Yes – it’s very positively skewed. There a lot of people with cheap tickets and not so many with expensive tickets

    d) Is the mean or median a better summary measure? The median as the data is very skewed.

    Cost of ticket Survived?Died Survived

    Mean 23.35 49.36

    Median 10.50 26.00

    Standard Deviation 34.15 68.65

    Interquartile range 18.15 46.56

    Minimum 0.00 0.00

    Maximum 263.00 512.33

  • Getting started with SPSS Maths and Statistics Help Centre

    38

    Exercise 5:

    a) Fill in the following table using the summary statistics table in the output. Female = 0 Male = 1

    Minimum -70 71

    Maximum 82 88

    Mean 64 79

    Median 66 79

    Standard Deviation 21.6 5

    b) Interpret the summary statistics by gender. Which group has the higher mean and which group is

    more spread out? Standard deviation: The standard deviation for men, 5 is much smaller than the standard deviation for

    women of 21.6 so the weights for women are more spread out. However, the data entry error needs to be

    removed and the statistics run again.

    Averages: Females had a mean weight of 64kg and median of 66kg before the diet. There’s quite a

    difference between the two measures suggesting that the data may be skewed. Males had a mean and

    median pre-weight of 79kg suggesting that the data is normally distributed.

    Minimum/ maximum: Are there any extreme outliers? Someone weighed -70kg before the diet which is

    clearly an error. Outliers cannot always be removed/ changed but here the real weight is clearly 70kg so

    make that adjustment and re-run the analysis. What effect has this had on the summary statistics?

    c) How could the chart be improved and is there anything odd? Better labelling of variables. Someone weighed -70kg which is clearly wrong

    Before the next section, change the error of -70 to 70. Outliers should not normally be changed unless

    they are clearly data entry errors as in this case. Give the variables sensible labels and label gender with 0

    = Female and 1 = Male.

    Re-run explore to see how the change has affected the summary statistics. Which summary statistics have changed the most?

    Female with outlier Female after changing outlier

    Minimum -70 58

    Maximum 82 82

    Mean 64 67

    Median 66 67

    Standard Deviation 21.6 5.6

    The mean, standard deviation, minimum and maximum are more influenced by outliers than the median

    and interquartile range.

  • Getting started with SPSS Maths and Statistics Help Centre

    39

    Exercise 6: Open the data set ‘birthweight’ from Excel. Label the variables with the labels in the table

    below.

    All the variables are continuous/ discrete apart from ‘Low birth weight’ which is binary

    a) What is the average birthweight? Is birthweight normally distributed?

    The smallest baby in the data set was 3.3 pounds and the largest 11.4 pounds. The mean birthweight is

    7.52 pounds and the median 7.6. The histogram shows that birthweight is normally distributed.

    b) Recode the variable mncig (cigarettes smoked by the mother per day) into the following four

    categories: 1 = non-smoker, 2= light smoker (smokes 1 – 10 per day), 3 = Moderate smoker (11 – 20

    per day) and 4 = Heavy smoker (21+ per day)

    c) Summarise birthweight by smoking category using suitable statistics and a graph

    The means of the groups are similar ranging from 6.97 for moderate smokers to 7.73 pounds for

    non-smokers. The standard deviations are similar suggesting similar spread of birthweights within

    each category.

    For the plots, either a confidence interval plot or a boxplot would be useful representations of the

    differences between the groups.

  • Getting started with SPSS Maths and Statistics Help Centre

    40

    The boxplots show that the medians for the four

    groups are fairly similar and the interquartile range

    (middle 50% of the values) is of a similar width. Each

    boxplot is fairly symmetrical about the median

    suggesting the values are normally distributed within

    each group.

    Produce a scatterplot of birthweight and gestational

    age. What is the relationship between the two?

    There is a moderate positive relationship between

    gestational age and birthweight but no clear

    relationship between smoking and either weight or

    gestational age. This means that as gestational age

    increases, birthweight tends to increase. There is one

    oddity though. A standard pregnancy is 40 weeks.

    Most women are induced by 42 weeks but there seem

    to be quite a few above 42 weeks. It’s likely that this is

    old data perhaps from a time when gestational age

    estimation was less accurate.

    Exercise 7: Enter the following data into SPSS:

    The data should have been entered like this and the categorical numbers labelled.

    a) Investigate the relationship between the amount of housework someone carries out per week and

    each of the other variables using suitable charts. For scatterplots, have different markers for males

    and females.

  • Getting started with SPSS Maths and Statistics Help Centre

    41

    The graph suggests a strong negative relationship between weekly hours of work and hours of housework. This means that the more hours someone works, the less housework they do. For males, the amount of housework they do and the hours they do are less spread out.

    There doesn’t appear to be a relationship between

    age and the amount of housework someone does.

    .

    The highest medians are for those never married and those who are divorced. The data for those never married is very skewed. However, the sample size is small so not much can be concluded. How many are in each category?

  • Getting started with SPSS Maths and Statistics Help Centre

    42

    The summary statistics show that there are only 2 or 3 people in most of the categories so using summary statistics could be misleading. Merging suitable categories would be advisable.

    Produce a plot comparing those working full/ part time for hours of housework and interpret.

    Hours per week on housework

    Working status

    Part time Full time

    Mean 18.73 6.33

    Median 18.00 6.00

    Standard Deviation 8.01 5.27

    There is clearly a difference in the amount of

    housework carried out per week between those

    working full and part time. Those working part time

    carry out 19 hours of housework a week on average

    compared to 6 hours a week by those working full

    time. The amount of housework is more spread out for part time people (SD = 8 compared to SD = 5 for

    full time workers).