jul-15h.s.1 short overview of statistical methods hein stigum presentation, data and programs at: ...
TRANSCRIPT
04/19/23 H.S. 1
Short overview of statistical methods
Hein Stigum
Presentation, data and programs at:
http://folk.uio.no/heins/
courses
04/19/23 H.S. 2
Agenda• Concepts
• Bivariate analysis– Continuous symmetrical– Continuous skewed– Categorical
• Multivariable analysis– Linear regression– Logistic regression
Outcome variable decides analysis
CONCEPTS
04/19/23 H.S. 3
04/19/23 H.S. 4
Precision and bias
• Measures of populations– precision - random error - statistics
– bias - systematic error - epidemiology
Truevalue
Estimate
Precision
Bias
04/19/23 H.S. 5
Precision: Estimation
Population Sample
valueTrue
Estimate
( | )
Estimate with confidence interval
95% confidence interval: 95% of repeated intervals will contain the true value
04/19/23 H.S. 6
Precision: Testing
Population Sample
2
1
groupvalueTrue
groupvalueTrue2Estimate
1Estimate
|
|
group 1
group 2
p-value=P(observing this difference or more, when the true difference is zero)
04/19/23 H.S. 7
Precision: Significance level
Birth weight, 500 newborn, observe difference
H0: boys=girls 10 gr p=0.90
50 gr p=0.40
100 gr p=0.10
130 gr p=0.04
150 gr p=0.02
Ha: boys≠girls
p<0.05Significance level
04/19/23 H.S. 8
Precision: Test situations
• 1 sample test• Weight =10
• 2 independent samples• Weight by sex
• K independent samples• Weight by age groups
• 2 dependent samples• Weight last year = Weight today
04/19/23 H.S. 9
Bias: DAGs
Egest age
Dbirth weight
C2parity
C1sex
Associations Bivariate (unadjusted)Causal effects Multivariable (adjusted)
Draw your assumptions before your conclusions
WHY USE GRAPHS?
04/19/23 H.S. 10
04/19/23 H.S. 11
Problem example
• Lunch meals per week– Table of means (around 5 per week)
– Linear regression0
1020
3040
50P
erce
nt
1 2 3 4 5 6 7Lunch meals per week
04/19/23 H.S. 12
Problem example 2
• Iron level by sex– Both linear and logistic regression
– Opposite results
meangirls
meanboys0
.02
.04
.06
.08
75 129100 10490 110Irom level in bloodIron level in blood
04/19/23 H.S. 13
Datatypes
• Categorical data– Nominal: married/ single/ divorced
– Ordinal: small/ medium/ large
• Numerical data– Discrete: number of children
– Continuous: weight
04/19/23 H.S. 14
Data type
Normal data
MeansT-testLinear regression
MediansNon-par tests
Freq tableCross, ChisquareLogistic regression
CategoricalNumerical
Yes No
Outcome data type dictates type of analysis
BIVARIATE ANALYSIS 1Continuous symmetric outcome: Birth weight
04/19/23 H.S. 15
04/19/23 H.S. 16
Distribution0
.000
2.0
004
.000
6.0
008
De
nsity
0 2000 4000 6000weight
0.0
002
.000
4.0
006
.000
8D
ens
ity2000 3000 4000 5000 6000
weight
kdensity weightdrop if weight<2000kdensity weight
0 2,000 4,000 6,000weight
04/19/23 H.S. 17
Central tendency and dispersion
Mean and standard deviation:
Mean with confidence interval:
04/19/23 H.S. 18
Compare groups, equal variance?
• Equal • Not equal
2 0 2 4 2 0 2 4
04/19/23 H.S. 19
2 independent samples
Are birth weights the same for boys and girls?
2000 3000 4000 5000 6000Birth weight
2000
3000
4000
5000
6000
Birt
h w
eigh
t
Boys Girlssex
Scatterplot Density plot
04/19/23 H.S. 20
2 independent samples test
ttest weight, by(sex) unequal unequal variancesttest var1==var2 paired test
04/19/23 H.S. 21
K independent samples
• Is birth weight the same over parity?
Scatterplot Density plot
2000
3000
4000
5000
6000
Birt
h w
eigh
t
0 1 2-7Parity
2000 3000 4000 5000 6000Birth weight
012-7
Parity:
04/19/23 H.S. 22
K independent samples test
equal means?
Equal variances?
04/19/23 H.S. 23
Continuous by continuous • Does birth weight depend on gestational age?
Scatterplot Scatterplot, outlier dropped
200
03
000
400
05
000
600
0B
irth
wei
ght
200 300 400 500 600 700Gestational age
200
03
000
400
05
000
Birt
h w
eig
ht
200 220 240 260 280 300Gestational age
04/19/23 H.S. 24
Continuous by continuous tests
• Cut gestational age up in groups, then use T-test or ANOVA
or
• Use linear regression with 1 covariate
04/19/23 H.S. 25
Test situations
• 1 sample test• ttest weight =10
• 2 independent samples• test weight, by(sex)
• K independent samples• oneway weight parity
• 2 dependent samples (Paired)• ttest weight_last_year == weight_today
BIVARIATE ANALYSIS 2Continuous skewed outcome: Number of sexual partners
04/19/23 H.S. 26
04/19/23 H.S. 27
Distributionkdensity partners if partners<=50
25%50% 75% 95%0.0
2.0
4.0
6.0
8.1
11 4 9 20 50Partners
N=394
Distribution of number of lifetime partners
04/19/23 H.S. 28
Central tendency and dispersion
Median and percentiles:
04/19/23 H.S. 29
2 independent samples
Do males and females have the same number of partners?
Scatterplot Density plot
0 10 20 30 40 50partners
050
100
150
200
Par
tner
s
Males FemalesGender
04/19/23 H.S. 30
2 independent samples test
equal medians?
04/19/23 H.S. 31
K independent samples
Do partners vary with age?
Scatterplot Density plot
050
100
150
200
Par
tner
s
18-29 30-44 45-60agegr3
0 10 20 30 40 50partners
Age:18-2930-4445-60
04/19/23 H.S. 32
K independent samples test
equal medians?
04/19/23 H.S. 33
Table of descriptives
ProportionsNormal Skewed
DescriptivesCenter Mean Median pDispersion Standard deviation Fractiles
Confidence intervals for center estimatesStandard error se(mean) se(p)95% Confidence interval mean ± 2*se(mean) p ± 2*se(p)
Numerical data
04/19/23 H.S. 34
Table of tests
ProportionsNormal Skewed
1 sample One sample T-test Kolmogorov-Smirnov Binomial2 independent samples Independent sample T-test Mann-Whitney U Chi-squareK independent samples ANOVA Kruskal-Wallis Chi-square2 dependent samples Paired sample T-test Wilcoxon signed rank test Mc-Nemar (2x2)
Numerical data
Categorical ordered:
use nonparametric tests
If N is large:
may use parametric tests
Remarks: If unequal variance in ANOVA:
Use linear regression with robust variance estimation
BIVARIATE ANALYSIS 3Categorical outcome: Being bullied
04/19/23 H.S. 35
04/19/23 H.S. 36
Frequency and proportionFrequency:
Proportion with CI:
04/19/23 H.S. 37
Proportion, confidence interval
proportion:
standard error:
confidence interval:
nx
p x=”disease”n=total number
)(2)(
)1()(
pseppCI
npp
pse
04/19/23 H.S. 38
Crosstables
equal proportions?
Are boys bullied as much as girls?
04/19/23 H.S. 39
Ordered categories, trend
Trend?
equal proportions?
04/19/23 H.S. 40
Table of tests
ProportionsNormal Skewed
1 sample One sample T-test Kolmogorov-Smirnov Binomial2 independent samples Independent sample T-test Mann-Whitney U Chi-squareK independent samples ANOVA Kruskal-Wallis Chi-square2 dependent samples Paired sample T-test Wilcoxon signed rank test Mc-Nemar (2x2)
Numerical data
Categorical ordered:
use nonparametric tests
If N is large:
may use parametric tests
Remarks: If unequal variance in ANOVA:
Use linear regression with robust variance estimation
MULTIVARIABLE ANALYSIS 1Continuous outcome: Linear regression, Birth weight
04/19/23 H.S. 41
04/19/23 H.S. 42
Regression idea
residual error,e
xofeffect ,tcoefficienb
covariate =x
outcome=y
:model
1
10
exbby
covariate = x,x
:cofactorsmany with model
21
22110 exbxbby
2500
3000
3500
4000
4500
5000
birt
h w
eigh
t (gr
am
)
250 260 270 280 290 300 310gestational age (days)
04/19/23 H.S. 43
Model and assumptions
• Model
• Association measure1 = increase in y for one unit increase in x1
• Assumptions– Independent errors
– Linear effects
– Constant error variance
• Robustness– influence
),0(, 222110 Nxxy
04/19/23 H.S. 44
Workflow
• DAG
• Scatterplots
• Bivariate analysis
• Regression– Model estimation– Test of assumptions
• Independent errors• Linear effects• Constant error variance
– Robustness • Influence
Egest age
Dbirth weight
C2parity
C1sex
539
2000
3000
4000
5000
6000
birt
h w
eigh
t (gr
am
)
200 300 400 500 600 700gestational age (days)
Categorical covariates
• 2 categories– OK
• 3+ categories– Use “dummies”
• “Dummies” are 0/1 variables used to create contrasts• Want 3 categories for parity: 0, 1 and 2-7 children• Choose 0 as reference• Make dummies for the two other categories
04/19/23 H.S. 45
generate Parity1 = (parity==1) if parity<.
generate Parity2_7 = (parity>=2) if parity<.
Create meaningful constant
Expected birth weight at:
7_21
)(tirth weighExpected b
43210 ParityParitysexgest
yE
gr1925 0
gest= 0, sex=0, parity=0, not meaningful
gest=280, sex=1, parity=0gr35241280 210
Model estimation
04/19/23 H.S. 47
coeff 95% conf. Int.Birth weight at ref 3524.3Gestational age
per day 6.0 (3.9 , 8.2)Sex
Boy 0Girl -139.2 (-228.9 , -49.5)
Parity0 01 232.0 (130.6 , 333.5)2-7 226.0 (106.9 , 345)
04/19/23 H.S. 48
Test of assumptions
• Plot residuals versus predicted y– Independent
residuals?
– Linear effects?
– constant variance?
-100
0-5
000
500
1000
1500
Res
idua
ls
3200 3400 3600 3800 4000Linear prediction
Outlier not included
04/19/23 H.S. 49
Violations of assumptions• Dependent residuals
Use mixed models or GEE
• Non linear effectsAdd square term
• Non-constant varianceUse robust variance estimation
-1-.
50
.51
200 220 240 260 280 300gest
-2-1
01
2re
s
3400 3500 3600 3700 3800p
04/19/23 H.S. 50
Influence
Outlier
Regression with outlier
Regressionwithout outlier
2000
3000
4000
5000
6000
Birt
h w
eigt
h
200 300 400 500 600 700Gestational age
04/19/23 H.S. 51
Measures of influence
• Measure change in:– Predicted outcome
– Deviance
– Coefficients (beta)• Delta beta
Remove obs 1, see changeremove obs 2, see change
-.6
-.4
-.2
0.2
Influ
ence
1 2 10Id
Delta beta for gestational age
04/19/23 H.S. 52
539-10
-8-6
-4-2
0D
fbet
a ge
stC
280
2000 3000 4000 5000 6000weight
beta for gestational age= 6.04
If obs nr 539 is removed, beta will change from 6 to 16
Removing outlier
04/19/23 H.S. 53
coeff 95% conf. Int.Birth weight at ref 3524Gestational age
per day 6 (4 , 8)Sex
Boy 0Girl -139 (-229 , -49)
Parity0 01 232 (131 , 333)2-7 226 (107 , 345)
coeff 95% conf. Int.Birth weight at ref 3531Gestational age
per day 17 (13 , 20)Sex
Boy 0Girl -166 (-252 , -80)
Parity0 01 229 (132 , 326)2-7 225 (112 , 339)
Full model Outlier removed
One outlier affected two estimates Final model
MULTIVARIABLE ANALYSIS 2Binary outcome: Logistic regression, Being bullied
04/19/23 H.S. 54
Ordered categories and model
04/19/23 H.S. 55
Categories Regression model
2 Logistic
3-7 Ordinal logistic
>7 Linear (treat as interval)
Interval versus ordered scale:
Interval scale
Ordered scale
1 2 3
low medium high
04/19/23 H.S. 56
Logistic model and assumptions
• Association measure Odds ratio in y for 1 unit increase
in x1
• Assumptions– Independent errors
– Linear effects on the log odds scale
• Robustness– influence
11
eOR
04/19/23 H.S. 5704/19/23 5704/19/23 H.S. 57
Being bullied
• We want the total effect of country on being bullied. – The risk of being bullied depends on age
and sex.
– The age and sex distribution may differ between countries.
• Should we adjust for age and sex?
Ecountry
Dbullied
C1age
C2sex
No, age and sex are mediating variables
N % p-value OR 95% conf. Int.Country <0.001
Sweden 407 8.7 1Island 448 10.9 1.3 (0.8 , 2)Norway 379 16.2 2.0 (1.3 , 3.2)Finland 409 25.9 3.7 (2.4 , 5.6)Denmark 436 23.4 3.2 (2.1 , 4.9)
Logistic: being bullied
04/19/23 H.S. 58
ORRR if outcome is rareOR>RR (further from 1) if the outcome is common
Prevalence of being bullied=17%
Roughly:Same risk of being bullied in Island as in Sweden.2 times the risk in Norwayas in Sweden.
3 times the risk in Finnlandas in Sweden.
04/19/23 H.S. 59
Summing up• DAGs
– State prior knowledge. Guide analysis
• Plots– Linearity, variance, outliers
• Bivariate analysis– Continuous symmetrical Mean, T-test, anova– Continuous skewed Median,
nonparametric– Categorical Freq, cross, chi-square
• Multivariable analysis– Continuous Linear regression– Binary Logistic regression