logistic regression in factor identification of covid-19...

17
Logistic Regression in Factor Identification of Covid-19 Vaccine Clinical Trials Jorge Luis Romeu, Ph.D. https://www.researchgate.net/profile/Jorge_Romeu http://web.cortland.edu/romeu/ ; Email: [email protected] Copyright. December 11, 2020 1.0 Introduction We compare Logistics Regression and Discriminant Analysis methods using clinical trial data, to identify Covid-19 key factors that affect vaccine use. This work is part of our struggle vs. Covid- 19: https://www.researchgate.net/publication/341282217_A_Proposal_for_Fighting_Covid- 19_and_its_Economic_Fallout Previous work includes ICUs and hospital staffing using the Negative Binomial distribution: https://www.researchgate.net/publication/345914205_Covid- 19_ICU_Staff_and_Equipment_Requirements_using_the_Negative_Binomial screening DOEs: https://www.researchgate.net/publication/344924536_Design_of_Experiments_DOE_in_Covid- 19_Factor_Screening_and_Assessment using statistical methods to establish a new Vaccine Life: https://www.researchgate.net/publication/344495955_Survival_Analysis_Methods_Applied_to_ Establishing_Covid-19_Vaccine_Life as well as to help accelerate vaccine testing: https://www.researchgate.net/publication/344193195_Some_Statistical_Methods_to_Accelerate_ Covid-19_Vaccine_Testing and a Markov model to study problems of reopening college: https://www.researchgate.net/publication/343825461_A_Markov_Model_to_Study_College_Re- opening_Under_Covid-19 and Markov Model to study the effects of Herd Immunization: https://www.researchgate.net/publication/343345908_A_Markov_Model_to_Study_Covid- 19_Herd_Immunization?channel=doi&linkId=5f244905458515b729f78487&showFulltext=t rue as well as of general survival: https://www.researchgate.net/publication/343021113_A_Markov_Chain_Model_for_Covid- 19_Survival_Analysis about socio-economic and racial issues affected by Covid-19: https://www.researchgate.net/publication/343700072_A_Digression_About_Race_Ethnicity_Cla ss_and_Covid-19 and developing A Markov Chain Model for Covid-19 Survival Analysis: https://www.researchgate.net/publication/343021113_A_Markov_Chain_Model_for_Covid- 19_Survival_Analysis and An Example of Survival Analysis Applied to analyzing Covid-19 Data: https://www.researchgate.net/publication/342583500_An_Example_of_Survival_Analysis_Data _Applied_to_Covid-19 , and Multivariate Statistics in the Analysis of Covid-19 Data, and More on Applying Multivariate Statistics to Covid-19 Data, both of which can also be found in: https://www.researchgate.net/publication/341385856_Multivariate_Stats_PC_Discrimination_in _the_Analysis_of_Covid-19 , and the implementation of multivariate analyses methods such as: https://www.researchgate.net/publication/342154667_More_on_Applying_Principal_Component s_Discrimination_Analysis_to_Covid-19 Design of Experiments to the Assessment of Covid-19: https://www.researchgate.net/publication/341532612_Example_of_a_DOE_Application_to_Cor onavarius_Data_Analysis Offshoring: https://www.researchgate.net/publication/341685776_Off- Shoring_Taxpayers_and_the_Coronavarus_Pandemic and reliability methods in ICU assessment: https://www.researchgate.net/publication/342449617_Example_of_the_Design_and_Operation_ of_an_ICU_using_Reliability_Principles and Quality Control methods for monitoring Covid-19: https://web.cortland.edu/matresearch/AplicatSPCtoCovid19MFE2020.pdf Numerical Example https://www.researchgate.net/publication/339936386_A_simple_numerical_example_that_illustr ates_the_dangesrs_of_the_Coronavarus_epidemic

Upload: others

Post on 06-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Logistic Regression in Factor Identification of Covid-19 Vaccine Clinical Trials

Jorge Luis Romeu, Ph.D.

https://www.researchgate.net/profile/Jorge_Romeu

http://web.cortland.edu/romeu/; Email: [email protected]

Copyright. December 11, 2020

1.0 Introduction

We compare Logistics Regression and Discriminant Analysis methods using clinical trial data, to

identify Covid-19 key factors that affect vaccine use. This work is part of our struggle vs. Covid-

19: https://www.researchgate.net/publication/341282217_A_Proposal_for_Fighting_Covid-

19_and_its_Economic_Fallout Previous work includes ICUs and hospital staffing using the

Negative Binomial distribution: https://www.researchgate.net/publication/345914205_Covid-

19_ICU_Staff_and_Equipment_Requirements_using_the_Negative_Binomial screening DOEs:

https://www.researchgate.net/publication/344924536_Design_of_Experiments_DOE_in_Covid-

19_Factor_Screening_and_Assessment using statistical methods to establish a new Vaccine Life:

https://www.researchgate.net/publication/344495955_Survival_Analysis_Methods_Applied_to_

Establishing_Covid-19_Vaccine_Life as well as to help accelerate vaccine testing:

https://www.researchgate.net/publication/344193195_Some_Statistical_Methods_to_Accelerate_

Covid-19_Vaccine_Testing and a Markov model to study problems of reopening college:

https://www.researchgate.net/publication/343825461_A_Markov_Model_to_Study_College_Re-

opening_Under_Covid-19 and Markov Model to study the effects of Herd Immunization:

https://www.researchgate.net/publication/343345908_A_Markov_Model_to_Study_Covid-

19_Herd_Immunization?channel=doi&linkId=5f244905458515b729f78487&showFulltext=t

rue as well as of general survival:

https://www.researchgate.net/publication/343021113_A_Markov_Chain_Model_for_Covid-

19_Survival_Analysis about socio-economic and racial issues affected by Covid-19:

https://www.researchgate.net/publication/343700072_A_Digression_About_Race_Ethnicity_Cla

ss_and_Covid-19 and developing A Markov Chain Model for Covid-19 Survival Analysis:

https://www.researchgate.net/publication/343021113_A_Markov_Chain_Model_for_Covid-

19_Survival_Analysis and An Example of Survival Analysis Applied to analyzing Covid-19 Data:

https://www.researchgate.net/publication/342583500_An_Example_of_Survival_Analysis_Data

_Applied_to_Covid-19, and Multivariate Statistics in the Analysis of Covid-19 Data, and More

on Applying Multivariate Statistics to Covid-19 Data, both of which can also be found in:

https://www.researchgate.net/publication/341385856_Multivariate_Stats_PC_Discrimination_in

_the_Analysis_of_Covid-19, and the implementation of multivariate analyses methods such as:

https://www.researchgate.net/publication/342154667_More_on_Applying_Principal_Component

s_Discrimination_Analysis_to_Covid-19 Design of Experiments to the Assessment of Covid-19:

https://www.researchgate.net/publication/341532612_Example_of_a_DOE_Application_to_Cor

onavarius_Data_Analysis Offshoring: https://www.researchgate.net/publication/341685776_Off-

Shoring_Taxpayers_and_the_Coronavarus_Pandemic and reliability methods in ICU assessment:

https://www.researchgate.net/publication/342449617_Example_of_the_Design_and_Operation_

of_an_ICU_using_Reliability_Principles and Quality Control methods for monitoring Covid-19:

https://web.cortland.edu/matresearch/AplicatSPCtoCovid19MFE2020.pdf Numerical Example

https://www.researchgate.net/publication/339936386_A_simple_numerical_example_that_illustr

ates_the_dangesrs_of_the_Coronavarus_epidemic

Page 2: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

2.0 Problem Statement and Logistics Regression Analysis

This article starts by answering a question posed by some readers. Why didn’t we use Logistic

Regression in our Covid-19 data analyses? The short answer is that Logistics Regression and the

Discriminant Function results are equivalent, as will be shown here. Each analyst has their own

preference. We are more familiar and experienced with Fisher Discriminant Function.

We redevelop below Fisher’s Discriminant data analysis in Table #1, (originally in Section 3.0:

https://www.researchgate.net/publication/344495955_Survival_Analysis_Methods_Applied_to_

Establishing_Covid-

19_Vaccine_Life?channel=doi&linkId=5f7c9ecba6fdccfd7b4c597d&showFulltext=true) using a

Logistic Regression. We compare the two methods and verify how statistical results are similar.

In the second part of this paper we analyze an illustrative, manufactured Clinical Trials dataset,

using both Logistic Regression and Discriminant Analysis procedures. We verify again how both

procedure results are equivalent. We start by briefly discussing Logistic Regression.

Logistic Regression (https://online.stat.psu.edu/stat504/node/150/) is related to the Odds Ratio

(OR) concept. Assume we have four analogous cups, one of which is ours. The OR for randomly

selecting our own cup, is one in three (1/3; one chance to win, and three to lose). The probability

of correct selection is: p = OR/(1+OR) = [1/3]/[1+(1/3)] = 1/4 = 0.25. If we wanted to move in

the opposite direction: OR = p / (1-p) = 0.25 / (1-0.25) = 0.25/0.75 = 1/3.

Let’s define Y, a dichotomous (0, 1) random variable, such that: P{Y=1}= p, and P{Y= 0}=1- p.

The Event of Interest is {Y=1}; and we seek a vector/variable X such that: p(Y=1|x) is known:

The Logistic Regression Model (Logit) is then obtained by regressing Log (OR) on the data:

Log (OR) = Log [ p(Y|x) / (1 − p(Y|x)) ] = β0 + x·β

The Logit is the Logarithm of the Odds Ratio. It is modeled in terms of a linear regression. We

obtain regression coefficients β0 and vector β, and use them to obtain classification probability

p(Y|x; b), for each of vector x data points. Solving for “p” in the above formula we get:

p(Y|x; b) = [eβ0

+ x·β ] / [ 1+ e

β0

+ x·β ] = 1 / [1 + e

−(β0

+ x·β) ]

Applying Logistic Regression to the dataset in Section 3.0 of the above-mentioned paper:

Link Function: Logit

Variable Value Count

DscGrps 1 51

-1 22

Total 73

Page 3: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Logistic Regression Table

Odds 95% CI

Predictor Coef StDev Z P Ratio Lower Upper

Constant -27.18 10.25 -2.65 0.008

SocioEcon

1 0.484 1.639 0.30 0.768 1.62 0.07 40.30

Age 0.480 0.193 2.49 0.013 1.62 1.11 2.36

Co-Morbid 2.327 1.465 1.59 0.112 10.25 0.58 181.07

Gender

1 0.085 1.336 0.06 0.949 1.09 0.08 14.93

Log-Likelihood = -8.607

Test that all slopes are zero: G = 72.141, DF = 4, P-Value = 0.000

The table shows Logistic Regression coefficients, their p-values, estimated OR, and its 95% CI.

If OR is smaller than unit, correlation between Y and Xi is negative. If OR is greater than unit,

said correlation is positive. If OR is unit, then there is no correlation (Y and Xi are independent).

Log-Likelihood p-value plays a role analogous to the Multiple Regression F p-value.

Our Logistic Regression Equation is:

Log [ p(Y|x)/(1 − p(Y|x)) ] = -27.18 + 0.484X1 + 0.480X2 + 2.327X3 + 0.085X4

Variables in Red are statistically significant, or close to significant. Data are re-analyzed,

using only the two mildly significant variables Age & Co-Morbidities:

Link Function: Logit

Response Information

Variable Value Count

DscGrps 1 51

-1 22

Total 73

Logistic Regression Table

Odds 95% CI

Predictor Coef StDev Z P Ratio Lower Upper

Constant -25.930 8.846 -2.93 0.003

Age 0.456 0.165 2.76 0.006 1.58 1.14 2.18

Co-Morbid 2.452 1.435 1.71 0.088 11.61 0.70 193.45

Log-Likelihood = -8.656

Test that all slopes are zero: G = 72.043, DF = 2, P-Value = 0.000

The p-values of variables Age & Co-Morbidities are significant (at α=0.1). The Log-Likelihood

is highly significant; thence also must be some variables in said equation. The Odds Ratio CI for

Co-morbidities is very wide and covers Unit. We obtain P{Y|xi} by using vector x in:

p(Y|x; b) = [eβ0

+ x·β ] / [ 1+ e

β0

+ x·β ]

Page 4: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Fisher Discrimination Function for said data set (displayed at the end) Below is the data re-analysis (DG v. Age, Co-Morb) using Fisher Discriminant function:

The regression equation is: DG = - 2.77 + 0.0459 Age + 0.286 Comorb

Predictor Coef SE Coef T P

Constant -2.7651 0.1984 -13.94 0.000

Age 0.045914 0.003785 12.13 0.000

Co-Morb 0.28564 0.08278 3.45 0.001

S = 0.500681 R-Sq = 75.3% R-Sq(adj) = 74.8% (explains ¾ of the problem)

That these two results are similar can be verified by plotting them:

Logistic

Fish

erC

lass

20151050-5-10

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

Scatterplot of FisherClass vs Logistic

The scatter plot straight line of patient classification using these two statistical methods, show

the equivalence of the two statistical results: they both classify patients in a similar way.

In addition, there were exactly four misses for the same patients, using both statistical methods:

Group Age Profile Fisher Logistic

-1 55 1 0.0455 1.59

1 52 0 -0.3924 -2.27

1 45 2 -0.1185 -0.53

1 51 1 -0.1397 -0.25

Page 5: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

This reinforces the proof regarding how similar results are from both statistical procedures. For

completion, we reprint the original data, their classification values using both procedures and the

probability of becoming the Event {Y=1|x}, (patients infected during the experiment):

Table 1. Data from Section 3.0 of the above-mentioned article.

No. SocioEcon Age Comorb Gender DG Discrim Logistic ProbEvent

1 1 45 1 0 0 -0.4175 -3.01 0.04974

2 0 50 1 1 0 -0.186 -0.71 0.3387

3 0 40 0 1 0 -0.948 -7.79 0.00046

4 0 43 0 0 0 -0.8091 -6.41 0.00181

5 1 47 1 1 0 -0.3249 -2.09 0.11531

6 0 48 1 0 0 -0.2786 -1.63 0.17059

7 1 49 1 0 0 -0.2323 -1.17 0.24503

8 0 42 0 0 0 -0.8554 -6.87 0.00115

9 0 36 0 1 0 -1.1332 -9.63 0.00007

10 0 39 0 0 0 -0.9943 -8.25 0.00029

11 0 46 1 0 0 -0.3712 -2.55 0.0763

12 1 44 1 1 0 -0.4638 -3.47 0.03211

13 0 42 1 1 0 -0.5564 -4.39 0.01315

14 0 51 1 0 0 -0.1397 -0.25 0.44696

15 0 49 0 0 0 -0.5313 -3.65 0.02719

16 0 55 1 0 0 0.0455 1.59 0.83365

17 0 45 1 0 0 -0.4175 -3.01 *

18 0 47 1 1 0 -0.3249 -2.09 *

19 0 42 0 1 0 -0.8554 -6.87 *

20 1 44 1 0 0 -0.4638 -3.47 *

21 0 47 1 1 0 -0.3249 -2.09 *

22 0 41 0 1 0 -0.9017 -7.33 0.00073

23 0 73 1 1 1 0.8789 9.87 0.99995

24 1 58 2 0 1 0.4834 5.45 0.99565

25 0 60 1 1 1 0.277 3.89 0.98001

26 0 52 0 0 1 -0.3924 -2.27 0.09895

27 0 65 2 0 1 0.8075 8.67 0.99982

28 0 72 1 0 1 0.8326 9.41 0.99991

29 0 66 1 0 1 0.5548 6.65 0.99868

30 0 61 1 1 1 0.3233 4.35 0.98724

31 1 55 2 0 1 0.3445 4.07 0.98311

32 0 63 2 0 1 0.7149 7.75 0.99955

33 0 78 1 1 1 1.1104 12.17 0.99999

34 0 73 2 1 1 1.1779 12.35 1

35 0 77 1 1 1 1.0641 11.71 0.99999

Page 6: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

36 0 79 1 0 1 1.1567 12.63 1

37 0 82 1 0 1 1.2956 14.01 1

38 0 73 1 0 1 0.8789 9.87 *

39 0 78 2 0 1 1.4094 14.65 1

40 0 74 1 0 1 0.9252 10.33 0.99997

41 0 68 1 1 1 0.6474 7.57 0.99947

42 0 66 1 1 1 0.5548 6.65 *

43 0 69 2 0 1 0.9927 10.51 0.99997

44 0 77 0 1 1 0.7651 9.23 0.9999

45 0 85 2 0 1 1.7335 17.87 1

46 0 55 1 0 1 0.0455 1.59 *

47 1 45 2 1 1 -0.1185 -0.53 0.37805

48 0 49 2 0 1 0.0667 1.31 0.79032

49 0 57 1 1 1 0.1381 2.51 0.92581

50 0 51 1 0 1 -0.1397 -0.25 *

51 0 66 2 1 1 0.8538 9.13 0.99989

52 0 69 2 1 1 0.9927 10.51 *

53 0 59 1 1 1 0.2307 3.43 0.96882

54 1 55 2 1 1 0.3445 4.07 *

55 0 67 2 0 1 0.9001 9.59 0.99993

56 0 59 1 1 1 0.2307 3.43 *

57 0 68 2 0 1 0.9464 10.05 0.99995

58 0 72 1 1 1 0.8326 9.41 *

59 0 77 1 1 1 1.0641 11.71 *

60 0 73 1 1 1 0.8789 9.87 *

61 0 70 0 1 1 0.441 6.01 0.99753

62 0 79 1 0 1 1.1567 12.63 *

63 0 80 2 0 1 1.502 15.57 1

64 0 82 2 1 1 1.5946 16.49 1

65 0 81 0 0 1 0.9503 11.07 0.99998

66 0 84 1 1 1 1.3882 14.93 1

67 0 85 2 1 1 1.7335 17.87 *

68 0 72 1 1 1 0.8326 9.41 *

69 1 66 2 1 1 0.8538 9.13 *

70 0 69 2 1 1 0.9927 10.51 *

71 0 77 2 1 1 1.3631 14.19 1

72 0 79 0 0 1 0.8577 10.15 0.99996

73 0 84 0 0 1 1.0892 12.45 1

(the four lines highlighted in yellow are miss-classifications)

We provide below histograms representing the corresponding Age and Co-Morbidity patterns:

Page 7: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Distribution of patient ages (Infected or not) greatly differ in both groups.

The number of patient Co-morbidities in the Infected group are more (i.e. two).

The identification of statistically significant variables in the Logistics Regression and Fisher

Discriminant Analysis models is supported by the differing graphs of the distribution of variables

Ages and Co-Morbidities, of both subgroups (Patients Infected and Not Infected).

F r e q u e n c y

2 1 0

14

12

10

8

6

4

2

0

2 1 0

25

20

15

10

5

0

Profile Profile_1

Co-Morbidities

F r e q u e n c y

84 72 60 48 36

10

8

6

4

2

0

84 72 60 48 36

12

10

8

6

4

2

0

Age Age_1

Age Distribution

Page 8: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

3.0 Clinical Trials Data Analysis using Logistics Regression and Discriminant Function

Again, we unsuccessfully tried to obtain Covid-19 patient data. Since it is important to show the

use of Logistic Regression techniques using appropriate data, we created a data set (Table #2),

built from the previous example, adapting it to suit the present one. Modifications included

changing several concomitant variables, for each individual, using our judgment and experience.

Its intent, as before, is to show how these two statistical procedures can be used to identify key

factors that affect the performance of the two patient groups analyzed (vaccinated and not).

Assume that Covid-19 vaccine clinical trials were implemented. But now only data from infected

participants were analyzed. Those infected participants who were given a placebo (denoted with

0 in column Infected) are numbered 1 to 31. Those infected participants who were given the real

vaccine (denoted with 1 in column Infected) are numbered 32 to 68. Our Event of Interest (Y=1)

is Infected Participants that received the real vaccine (as opposed to a placebo).

Description of the concomitant data recorded from each individual participant:

Co-morbid 0.None 1.Some

Gender: 0.Male 1.Female

Infected: 0.Placebo1Vaccine

Profile: Number

Participant Profile is numbered: Zero, if participant seldom interacts with others; One, if some,

cautious interaction with the outside world is realized; Two, if extensive interaction activities.

The columns Discrim and Logstics correspond to such participants’ evaluations made using their

corresponding Discrimination and Logistics functions. These outcomes will be discussed later in

this paper, when we compare again the results of these two similar statistical procedures.

Table 2: modified data matrix, from the original, created data

No. Co-Morb Age Profile Gender Infected Discrim LogEval EventProb

1 1 45 1 0 0 0.0999 -2.3611 0.0862

2 0 50 1 1 0 0.2161 -1.7109 0.1530

3 1 60 0 1 0 0.3037 -1.3028 0.2137

4 0 43 0 0 0 -0.091 -3.5135 0.0289

5 1 47 2 1 0 0.2912 -1.2088 0.2299

6 0 68 1 0 0 0.6345 0.6298 0.6524

7 1 49 1 0 0 0.1929 -1.8410 0.1369

8 0 42 2 0 0 0.1750 -1.8590 0.1348

9 0 56 0 1 0 0.2108 -1.8230 0.1391

10 0 39 1 0 0 -0.039 -3.1414 0.0414

11 0 46 1 0 0 0.1232 -2.2311 0.0970

12 1 74 1 1 0 0.7739 1.4100 0.8038

13 0 42 1 1 0 0.0302 -2.7512 0.0600

Page 9: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

14 0 51 1 0 0 0.2394 -1.5809 0.1707

15 0 59 1 0 0 0.4253 -0.5406 0.3681

16 0 45 1 0 0 0.0999 -2.3611 *

17 0 47 2 1 0 0.2912 -1.2088 *

18 1 72 0 1 0 0.5826 0.2577 0.5641

19 1 44 1 0 0 0.0767 -2.4912 0.0765

20 0 47 1 1 0 0.1464 -2.1010 0.1090

21 0 41 0 1 0 -0.137 -3.7735 0.0225

22 0 73 1 1 0 0.7507 1.2800 0.7824

23 1 58 2 0 0 0.5468 0.2216 0.5552

24 0 60 1 1 0 0.4485 -0.4105 0.3988

25 0 65 2 0 0 0.7095 1.1319 0.7562

26 1 72 1 0 0 0.7274 1.1499 0.7595

27 0 66 2 0 0 0.7328 1.2620 0.7794

28 0 61 1 1 0 0.4718 -0.2805 0.4303

29 1 55 2 0 0 0.4771 -0.1685 0.4580

30 0 63 2 0 0 0.6630 0.8718 0.7051

31 2 78 0 1 0 0.7221 1.0379 0.7384

32 0 73 1 1 1 0.7507 1.2800 *

33 0 77 1 1 1 0.8436 1.8001 0.8582

34 1 79 1 0 1 0.8901 2.0602 0.8870

35 1 82 1 0 1 0.9598 2.4503 0.9206

36 0 33 2 0 1 -0.034 -3.0293 0.0461

37 1 78 1 0 1 0.8669 1.9302 0.8733

38 0 74 1 0 1 0.7739 1.4100 *

39 0 68 1 1 1 0.6345 0.6298 *

40 0 66 1 1 1 0.5880 0.3697 0.5914

41 0 69 1 0 1 0.6577 0.7598 0.6813

42 0 77 0 1 1 0.6988 0.9079 0.7126

43 1 85 0 0 1 0.8848 1.9482 0.8752

44 0 55 2 0 1 0.4771 -0.1685 *

45 0 49 2 0 1 0.3377 -0.9487 0.2791

46 0 57 2 1 1 0.5236 0.0916 0.5229

47 0 66 2 1 1 0.7328 1.2620 *

48 1 69 1 1 1 0.6577 0.7598 *

49 0 59 1 1 1 0.4253 -0.5406 *

50 1 55 2 1 1 0.4771 -0.1685 *

51 0 67 2 0 1 0.7560 1.3920 0.8009

52 0 59 1 1 1 0.4253 -0.5406 *

53 1 68 1 0 1 0.6345 0.6298 *

54 0 72 1 1 1 0.7274 1.1499 *

55 1 77 0 1 1 0.6988 0.9079 *

56 0 73 1 1 1 0.7507 1.2800 *

Page 10: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

57 0 70 0 1 1 0.5361 -0.0024 0.4994

58 1 79 1 0 1 0.8901 2.0602 *

59 1 80 1 0 1 0.9133 2.1902 0.8994

60 0 82 1 1 1 0.9598 2.4503 *

61 1 81 0 0 1 0.7918 1.4280 0.8066

62 1 84 1 1 1 1.0063 2.7104 0.9376

63 1 85 0 1 1 0.8848 1.9482 *

64 1 66 2 1 1 0.7328 1.2620 *

65 0 69 2 1 1 0.8025 1.6521 0.8392

66 0 77 1 1 1 0.8436 1.8001 *

67 1 79 0 0 1 0.7453 1.1680 0.7628

68 1 84 0 0 1 0.8615 1.8182 0.8603

Discriminant Regression Analysis: Vaccine versus Co-morb, Age, Profile, Gender The regression equation is:

Infected = - 1.14 - 0.145 Comorb + 0.0248 Age + 0.132 Profile + 0.042 Gender

Predictor Coef SE Coef T P

Constant -1.1419 0.2952 -3.87 0.000

Comorb -0.1447 0.1057 -1.37 0.176

Age 0.024849 0.004110 6.05 0.000

Profile 0.13183 0.08040 1.64 0.106

Gender 0.0421 0.1029 0.41 0.684

S = 0.408134 R-Sq = 37.8% R-Sq(adj) = 33.8%

Analysis of Variance

Source DF SS MS F P

Regression 4 6.3735 1.5934 9.57 0.000

Residual Error 63 10.4941 0.1666

Total 67 16.8676

Notice how only the factors Age and Profile are statistically significant, and this at level α=0.1.

They have an impact on the difference between infected patients being either vaccinated or not.

The other two factors considered (presence of co-morbidities and gender) are not significant,

which means they do not seem to have an effect on the vaccination of the patient.

The model explains a little over 1/3 of the problem (0.378). In the previous example, it explained

over 2/3 of the problem. Such low, but still realistic Index of Fit, means that additional factors

must be found, in order to explain a larger portion of the differences between the two groups.

We present next the regression model assumption graphs, which are self-explanatory.

Finally, we redo the Discriminant Function using only the two above-mentioned significant

factors: Age and Profile.

Page 11: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Standardized Residual

Pe

rce

nt

420-2-4

99.9

99

90

50

10

1

0.1

Fitted Value

Sta

nd

ard

ize

d R

esid

ua

l

1.000.750.500.250.00

2

0

-2

Standardized Residual

Fre

qu

en

cy

210-1-2

20

15

10

5

0

Observation Order

Sta

nd

ard

ize

d R

esid

ua

l

65605550454035302520151051

2

0

-2

Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

Histogram of the Residuals Residuals Versus the Order of the Data

Residual Plots for Vaccine

Regression Analysis: Vaccine versus Age, Profile The regression equation is:

Infected = - 1.09 + 0.0232 Age + 0.145 Profile

Predictor Coef SE Coef T P

Constant -1.0907 0.2924 -3.73 0.000

Age 0.023241 0.003897 5.96 0.000

Profile 0.14479 0.07872 1.84 0.070

S = 0.409384 R-Sq = 35.4% R-Sq(adj) = 33.4%

Analysis of Variance

Source DF SS MS F P

Regression 2 5.9739 2.9870 17.82 0.000

Residual Error 65 10.8937 0.1676

Total 67 16.8676

Notice how the significance levels have improved (we can now use α=0.07). The Index of Fit

remains practically the same. We still need to search for additional patient characteristics (i.e.

additional factors) that help better explain the differences between the two groups.

Page 12: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Standardized Residual

Pe

rce

nt

420-2-4

99.9

99

90

50

10

1

0.1

Fitted Value

Sta

nd

ard

ize

d R

esid

ua

l

1.000.750.500.250.00

2

1

0

-1

-2

Standardized Residual

Fre

qu

en

cy

210-1-2

20

15

10

5

0

Observation Order

Sta

nd

ard

ize

d R

esid

ua

l

65605550454035302520151051

2

1

0

-1

-2

Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

Histogram of the Residuals Residuals Versus the Order of the Data

Residual Plots for Vaccine

We now implement the equivalent Logistic Regression with the same data set.

Binary Logistic Regression: Vaccine versus Comorb, Age, Profile, Gender Link Function: Logit

Response Information

Variable Value Count

Vaccine 1 37 (Event)

0 31

Total 68

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -9.91009 2.68358 -3.69 0.000

Comorb

1 -0.417620 0.704333 -0.59 0.553 0.66 0.17 2.62

2 -22.9869 24542.6 -0.00 0.999 0.00 0.00 *

Age 0.142119 0.0350788 4.05 0.000 1.15 1.08 1.23

Profile 0.835803 0.545221 1.53 0.125 2.31 0.79 6.72

Gender

1 0.595450 0.667785 0.89 0.373 1.81 0.49 6.71

Page 13: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Log-Likelihood = -30.820

Test that all slopes are zero: G = 32.099, DF = 5, P-Value = 0.000

Compare with the above Discriminant Function, and verify how the same two Factors are

significant at level α=0.125, and the same other two factors are not significant.

Redone for the two significant variables (at alfa = 0.12)

Link Function: Logit

Response Information

Variable Value Count

Vaccine 1 37 (Event)

0 31

Total 68

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -9.10513 2.42337 -3.76 0.000

Age 0.130039 0.0318680 4.08 0.000 1.14 1.07 1.21

Profile 0.892254 0.512681 1.74 0.082 2.44 0.89 6.67

Log-Likelihood = -32.976

Test that all slopes are zero: G = 27.786, DF = 2, P-Value = 0.000

Again, results are equivalent to the ones obtained with the Discriminant. To verify this, we

plot the two responses, evaluated with their respective Logistic and Discriminant functions.

The perfect straight line shows how these two results are, in fact, equivalent:

FITS2

Lo

gEv

al

1.000.750.500.250.00

3

2

1

0

-1

-2

-3

-4

Regression Eval Fits (significant variables)

Page 14: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

The probability (EventProb: last column of Table 2) of the ith patient inclusion in the group of

interest (Y=1), given its particular characteristics vector, denoted by Xi is:

P{Y=1| Xi} = p(Y=1 | Xi ; b) = [eβ

0+ x·β

] / [ 1+ e β

0+ x·β

]

For example, for patient number one, in Table #2 we have: β0 = -9.11; β1 = 0.13; β2 =0.89

No. Co-Morb Age Profile Gender Infected FITS2 LogEval EventProb

1 1 45 1 0 0 0.0999 -2.3611 0.0862

P {Y=1 | Xi ; b= (β0, β1, β2)} = [eβ

0+ x·β

] / [ 1+ e β

0+ x·β

] = 0.086

Below, find the distributions of the two Age groups analyzed:

Fre

qu

en

cy

8472604836

9

8

7

6

5

4

3

2

1

0

8472604836

Age Age_1

Age (Vaccinated and Not)

Notice how infected age distributions differ. This factor his highly significant (very small p-val.)

Page 15: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

Below, find the distribution of the Patient Profiles for the two group analyzed:

Fr

eq

ue

ncy

210

20

15

10

5

0

210

Profile Profile_1

Profile (Vaccinated and Not)

Notice how the distributions differ, however less than with Age. That is why the p-val = 0.082 is

higher, and the OR 95% CI is wider and covers Unit. This factor is less reliable than the first one. Below are the Descriptive Statistics: Age, Age_1, Profile, Profile_1 Variable N Mean SE Mean StDev Min Q1

Age 31 55.42 2.06 11.49 39.00 45.00

Age_1 37 70.89 1.86 11.31 33.00 66.00

Profile 31 1.065 0.122 0.680 0.000 1.000

Profile_1 37 1.027 0.113 0.687 0.000 1.000

Variable Median Q3 Maximum

Age 55.00 65.00 78.00

Age_1 73.00 79.00 85.00

Profile 1.000 2.000 2.000

Profile_1 1.000 1.500 2.000

Compare the Five descriptive statistics for the two statistically significant factors analyzed:

Profile and Age, and verify how they do differ (as in the graphs above).

Page 16: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

5.0 Discussion

Again, the data used in this analysis was not collected; it was created by this author for

illustrative purposes. Thence, the results and discussion below are also only for illustrative

purposes. We hope, with this exercise, to encourage researchers in the Public Health and

medical environments to implement statistical procedures using their real Covid-19 data.

By implementing either Logistics Regression or Discrimination Analysis, we detect two factors

that differentiate results from both groups of patients infected (those who have been vaccinated,

and those who have not). Since data are assumed to be random samples from these two groups,

we can infer that older patients, even when vaccinated, are still prone to become infected, and

that patient profile (i.e. the level with which they interact with the rest of the world) also has an

effect. The latter effect is low; thence a larger sample should be drawn, to confirm or reject.

This approach can be reproduced by public health and medical researchers, using as responses

different pairs of groups: placebo v. vaccinated, infected v. not infected, deceased v. surviving,

Vaccine A v. Vaccine B, etc. Patient factors may include any characteristic of interest: weight,

age, gender, occupation, number (or specific types) of co-morbidities, level of interaction, etc.

Logistic Regression or Discrimination Analysis can then be implemented, and the statistically

significant factors will identify the key elements on which to undertake further research.

As we have said in our previous article, vaccine development, including clinical trials and release

decisions, must be science and not politically based. When a vaccine is released, it is because its

risk analysis has proven vaccine yields more benefits than harm. The early release of Covid-19

vaccines is due to the urgency of having more than a million deaths already world-wide, and the

300 thousand already occurred in the USA, and counting.

6. Conclusions

This Covid-19 work stems from our proposal to the retired academic and research communities:

https://www.researchgate.net/publication/341282217_A_Proposal_for_Fighting_Covid-

19_and_its_Economic_Fallout which pursues one goal: to contribute to defeat Covid-19.

This paper is a tutorial on the uses of Logistic Regression to help identify key factors in Covid-

19 Clinical Trials. The data analyzed was created by this researcher, using his experience and

information. Thence, its numerical results have only illustrative value. However, public health

and medical researchers and practitioners can follow our Logistic Regression procedures, and

substitute their own data for ours, generating additional analyses, and including new factors, as

they become available.

We want to reach four audiences: (1) public health professionals and researchers, (2) medical

doctors, (3) statisticians and (4) the public in general.

We want to encourage public health and medical professionals to use more statistical procedures

and do more joint work with statisticians -not only after data have been collected, but also at the

time that experiments are being designed

Page 17: Logistic Regression in Factor Identification of Covid-19 ...web.cortland.edu/matresearch/LogisticRegression.pdfLogistic Regression. We compare the two methods and verify how statistical

We want to encourage statisticians, especially those retired, who have the experience, financial

support (their pension), and the time to provide such assistance, to contribute in helping with the

planning, implementation and analysis of statistical procedures –or with writing about them.

We want to provide illustrative examples to doctors, public health researchers, and to the general

public, to help them better understand what the others do, fostering more efficient collaboration.

Finally, this series of papers on statistical analysis of Covid-19, listed in the initial section of this

article could become part of a biostatistics course in a public health or medical curriculum, or an

applications course in a statistics department.

Bibliography

Beyer, W., Editor. Handbook of Tables for Probability and Statistics. The Chemical Rubber Co.

(CRC). Ohio. 1966.

Box, G., Hunter, W. G., and J. S. Hunter. Statistics for Experimenters.Wiley. New York.1978.

Walpole, R. E. and R. H. Myers. Probability and Statistics for Engineers and Scientists. Prentice-

Hall. http://www.elcom-hu.com/Mshtrk/Statstics/9th%20txt%20book.pdf

Romeu, J. L. Operations Research and Statistics Techniques. Proceedings of Federal Conference

on Statistical Methodology. https://web.cortland.edu/matresearch/OR&StatsFCSMPaper.pdf

Romeu, J. L. Determining the Experimental Sample Size. Journal of Systems Reliability Center.

(SRC): 3rd Qtr. 2005 (pp. 11-21).

About the Author:

Jorge Luis Romeu retired Emeritus from the State University of New York (SUNY). He was, for

sixteen years, a Research Professor at Syracuse University, where he is currently an Adjunct

Professor of Statistics. Romeu worked for many years as a Senior Research Engineer with the

Reliability Analysis Center (RAC), an Air Force Information and Analysis Center operated by

IIT Research Institute (IITRI). Romeu received seven Fulbright assignments: in Mexico (3), the

Dominican Republic (2), Ecuador, and Colombia. He holds a doctorate in Statistics/O.R., is a C.

Stat. Fellow, of the Royal Statistical Society, a Senior Member of the American Society for

Quality (ASQ), and Member of the American Statistical Association. He is a Past ASQ Regional

Director (and currently a Deputy Regional Director), and holds Reliability and Quality ASQ

Professional Certifications. Romeu created and directs the Juarez Lincoln Marti International Ed.

Project (JLM, https://web.cortland.edu/matresearch/), which supports (i) higher education in

Ibero-America and (ii) maintains the Quality, Reliability and Continuous Improvement Institute

(QR&CII, https://web.cortland.edu/romeu/QR&CII.htm) applied statistics web site.