Logistic regression prediction

Author

Dr. Alexander Fisher

View libraries and data used in these notes

library(tidyverse)
library(tidymodels)
library(pROC)       # make ROC curves
library(knitr)
library(kableExtra)

## data 
heart_disease <- read_csv("https://sta221-fa25.github.io/data/framingham.csv") |>
  select(age, totChol, TenYearCHD, currentSmoker) |>
  drop_na() |>
  mutate(high_risk = as_factor(TenYearCHD), 
         currentSmoker = as_factor(currentSmoker))

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Topics

Calculating predicted probabilities from the logistic regression model
Using predicted probabilities to classify observations
Make decisions and assess model performance using
- Confusion matrix
- ROC curve

Data: Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

high_risk: 1 = High risk of having heart disease in next 10 years, 0 = Not high risk of having heart disease in next 10 years
age: Age at exam time (in years)
totChol: Total cholesterol (in mg/dL)
currentSmoker: 0 = nonsmoker; 1 = smoker

Modeling risk of coronary heart disease

heart_disease_fit <- glm(high_risk ~ age + totChol + currentSmoker, 
              data = heart_disease, family = "binomial")

tidy(heart_disease_fit, conf.int = TRUE) |> 
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-6.638	0.372	-17.860	0.000	-7.374	-5.917
age	0.082	0.006	14.430	0.000	0.071	0.093
totChol	0.002	0.001	2.001	0.045	0.000	0.004
currentSmoker1	0.457	0.092	4.951	0.000	0.277	0.639

Exercise
Solution

Interpret currentSmoker1 in terms of the odds of being high risk for heart disease.
Interpret totChol in terms of the odds of being high risk for heart disease.

The odds of being high risk for heart disease for smokers is expected to be \(\exp\{\hat{\beta}_{smoker}\} = \exp\{0.457\} =\) 1.579 times the odds of a non-smoker being high risk for heart disease, holding all else constant.
The odds of being high risk for heart disease increases multiplicatively, by a factor of \(\exp\{\hat{\beta}_{totChol}\} = \exp\{0.002\} =\) 1.002 for each unit increase in total cholestorol.

Prediction and classification

We are often interested in using the model to classify observations, i.e., predict whether a given observation will have a 1 or 0 response
For each observation
- Use the logistic regression model to calculate the predicted log-odds the response for the \(i^{th}\) observation is 1
- Use the log-odds to calculate the predicted probability the \(i^{th}\) observation is 1
- Then, use the predicted probability to classify the observation as having a 1 or 0 response using some predefined threshold

Augmented data frame

augment(heart_disease_fit)

# A tibble: 4,190 × 10
   high_risk   age totChol currentSmoker .fitted .resid     .hat .sigma  .cooksd
   <fct>     <dbl>   <dbl> <fct>           <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
 1 0            39     195 0              -3.06  -0.302 0.000594  0.890  6.94e-6
 2 0            46     250 0              -2.38  -0.420 0.000543  0.890  1.25e-5
 3 0            48     245 1              -1.77  -0.560 0.000527  0.890  2.24e-5
 4 1            61     225 1              -0.751  1.51  0.00164   0.889  8.70e-4
 5 0            46     285 1              -1.86  -0.539 0.000830  0.890  3.25e-5
 6 0            43     228 0              -2.67  -0.366 0.000546  0.890  9.43e-6
 7 1            63     205 0              -1.08   1.66  0.00154   0.889  1.15e-3
 8 0            45     313 1              -1.88  -0.532 0.00127   0.890  4.86e-5
 9 0            52     260 0              -1.87  -0.535 0.000542  0.890  2.08e-5
10 0            43     225 1              -2.22  -0.454 0.000532  0.890  1.44e-5
# ℹ 4,180 more rows
# ℹ 1 more variable: .std.resid <dbl>

Predicted log-odds

heart_disease_aug <- augment(heart_disease_fit)

heart_disease_aug |> 
  select(.fitted) |> 
  slice(1:5)

# A tibble: 5 × 1
  .fitted
    <dbl>
1  -3.06 
2  -2.38 
3  -1.77 
4  -0.751
5  -1.86

\[ \text{predicted log-odds} = \log\Big(\frac{\hat{\pi}}{1- \hat{\pi}}\Big) = -3.06 \]

Predicted odds

heart_disease_aug |> 
  select(.fitted) |> 
  slice(1:5) |>
  exp()

# A tibble: 5 × 1
  .fitted
    <dbl>
1  0.0467
2  0.0922
3  0.170 
4  0.472 
5  0.156

\[ \text{predicted odds} = \frac{\hat{\pi}}{1- \hat{\pi}} = \exp\{-3.06\} = 0.0469 \]

Predicted probability

heart_disease_aug |> 
  select(.fitted) |> 
  slice(1:5) |>
  mutate(odds = exp(.fitted)) |>
  mutate(prob = odds / (1 + odds)) |>
  select(prob)

# A tibble: 5 × 1
    prob
   <dbl>
1 0.0446
2 0.0845
3 0.145 
4 0.321 
5 0.135

\[ \text{predicted prob.} = \hat{\pi} = \frac{\hat{\text{odds}}}{1+\hat{\text{odds}}} = \frac{\exp\{-3.06\}}{1 + \exp\{-3.06\}}= 0.045 \]

Would you classify the first individual as high risk \((\hat{y} = 1)\) or not high risk \((\hat{y} = 0)\)?

Another individual

heart_disease_aug |> 
  select(.fitted) |> 
  slice(1:5) |>
  mutate(odds = exp(.fitted)) |>
  mutate(prob = odds / (1 + odds))

# A tibble: 5 × 3
  .fitted   odds   prob
    <dbl>  <dbl>  <dbl>
1  -3.06  0.0467 0.0446
2  -2.38  0.0922 0.0845
3  -1.77  0.170  0.145 
4  -0.751 0.472  0.321 
5  -1.86  0.156  0.135

\[ \text{predicted prob.} = \hat{\pi} = \frac{\hat{\text{odds}}}{1+\hat{\text{odds}}} = \frac{\exp\{-0.751\}}{1 + \exp\{-0.751\}}= 0.321 \]

Would you classify this individual as high risk \((\hat{y} = 1)\) or not high risk \((\hat{y} = 0)\)?

Predicted probabilities in R

We can calculate predicted probabilities using the predict.glm() function. Use type = "response" to get probabilities.¹

predict.glm(heart_disease_fit, type = "response")

Predicted probabilities for Observations 1 -5

         1          2          3          4          5 
0.04459439 0.08445209 0.14523257 0.32065849 0.13515474

Predictions in R

pred_prob <- predict.glm(heart_disease_fit, type = "response")

heart_disease_aug <- heart_disease_aug |> 
  bind_cols(pred_prob = pred_prob)

# A tibble: 5 × 3
  high_risk .fitted pred_prob
  <fct>       <dbl>     <dbl>
1 0          -3.06     0.0446
2 0          -2.38     0.0845
3 0          -1.77     0.145 
4 1          -0.751    0.321 
5 0          -1.86     0.135

Classifying observations

You would like to determine a threshold for classifying individuals as high risk or not high risk.

What considerations would you make in determining the threshold?

Classify using 0.5 as threshold

We can use a threshold of 0.5 to classify observations.

If \(\hat{\pi} > 0.5\), classify as 1
If \(\hat{\pi} \leq 0.5\), classify as 0

heart_disease_aug <- heart_disease_aug |>
  mutate(pred_class = factor(if_else(pred_prob > 0.5, 1, 0)))

# A tibble: 5 × 4
  high_risk .fitted pred_prob pred_class
  <fct>       <dbl>     <dbl> <fct>     
1 0          -3.06     0.0446 0         
2 0          -2.38     0.0845 0         
3 0          -1.77     0.145  0         
4 1          -0.751    0.321  0         
5 0          -1.86     0.135  0

Confusion matrix

A confusion matrix is a \(2 \times 2\) table that compares the predicted and actual classes. We can produce this matrix using the conf_mat() function in the yardstick package (part of tidymodels).

heart_disease_aug |>
  conf_mat(high_risk, pred_class)

          Truth
Prediction    0    1
         0 3553  635
         1    2    0

Visualize confusion matrix

heart_conf_mat <- heart_disease_aug |>
  conf_mat(high_risk, pred_class)

autoplot(heart_conf_mat, type = "heatmap")

Using the confusion matrix

          Truth
Prediction    0    1
         0 3553  635
         1    2    0

The accuracy of this model with a classification threshold of 0.5 is

\[ \text{accuracy} = \frac{3553 + 0}{3553 + 635 + 2 + 0} = 0.848 \]

Using the confusion matrix

          Truth
Prediction    0    1
         0 3553  635
         1    2    0

The misclassification rate of this model with a threshold of 0.5 is

\[ \text{misclassification} = \frac{635 + 2}{3553 + 635 + 2 + 0} = 0.152 \]

Using the confusion matrix

          Truth
Prediction    0    1
         0 3553  635
         1    2    0

Accuracy is 0.848 and the misclassification rate is 0.152.

What is the limitation of solely relying on accuracy and misclassification to assess the model performance?
What is the limitation of using a single confusion matrix to assess the model performance?

Sensitivity and specificity

True/false positive/negative

	Not high risk \((y_i = 0)\)	High risk \((y_i = 1)\)
Classified not high risk \((\hat{\pi}_i \leq \text{threshold})\)	True negative (TN)	False negative (FN)
Classified high risk \((\hat{\pi}_i > \text{threshold})\)	False positive (FP)	True positive (TP)

\(\text{accuracy} = \frac{TN + TP}{TN + TP + FN + FP}\)
\(\text{misclassification} = \frac{FN + FP}{TN+ TP + FN + FP}\)
False negative rate: Proportion of actual positives that were classified as negatives
- Pr(classified not high risk | high risk) = \(\frac{FN}{TP + FN}\)
False positive rate: Proportion of actual negatives that were classified as positives
- P(classified high risk | not high risk) = \(\frac{FP}{TN + FP}\)
Sensitivity: Proportion of actual positives that were correctly classified as positive
- Also known as true positive rate (TPR) and recall
- P(classified high risk | high risk) = 1 − False negative rate
Specificity: Proportion of actual negatives that were correctly classified as negative
- P(classified not high risk | not high risk) = 1 − False positive rate

Practice

          Truth
Prediction    0    1
         0 3553  635
         1    2    0

Calculate the

False negative rate
False positive rate
Sensitivity
Specificity

Using metrics to select model and threshold

Metric	Guidance for use
Accuracy	For balanced data, use only in combination with other metrics. Avoid using for imbalanced data.
Sensitivity (true positive rate)	Use when false negatives are more “expensive” than false positives.
False positive rate	Use when false positives are more “expensive” than false negatives.
Precision = \(\frac{TP}{TP + FP}\)	Use when it’s important for positive predictions to be accurate.

This table is a modification of work created and shared by Google in the Google Machine Learning Crash Course.

Choosing a classification threshold

A doctor plans to use your model to determine which patients are high risk for heart disease. The doctor will recommend a treatment plan for high risk patients.

Would you want sensitivity to be high or low? What about specificity?
What are the trade-offs associated with each decision?

ROC curve

So far the model assessment has depended on the model and selected threshold. The receiver operating characteristic (ROC) curve allows us to assess the model performance across a range of thresholds.

x-axis: 1 - Specificity (False positive rate)
y-axis: Sensitivity (True positive rate)

Which corner of the plot indicates the best model performance?

ROC curve

ROC curve in R

# calculate sensitivity and specificity at each threshold
roc_curve_data <- heart_disease_aug |>
  roc_curve(high_risk, pred_prob, 
            event_level = "second") 
# event_level = second binary outcome is "1" outcome

# plot roc curve
autoplot(roc_curve_data)

ROC curve in R

Sample from underlying data

# A tibble: 10 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
 1     0.0545       0.103       0.980
 2     0.0660       0.181       0.959
 3     0.0832       0.305       0.909
 4     0.136        0.583       0.715
 5     0.193        0.754       0.501
 6     0.221        0.805       0.411
 7     0.221        0.805       0.411
 8     0.262        0.881       0.287
 9     0.270        0.901       0.254
10     0.279        0.915       0.227

Area under the curve

The area under the curve (AUC) can be used to assess how well the logistic model fits the data

AUC=0.5: model is a very bad fit (no better than a coin flip)
AUC close to 1: model is a good fit

heart_disease_aug |>
  roc_auc(high_risk, pred_prob,
    event_level = "second"
  )

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.695

Recap

Calculated predicted probabilities from the logistic regression model
Used predicted probabilities to classify observations
Made decisions and assessed model performance using
- Confusion matrix
- ROC curve

Acknowledgements

Thanks to Prof. Maria Tackett for this example.

Your turn

Build an email spam filter

This data consists of 4601 emails that are classified as spam or non-spam. The data was collected at Hewlett-Packard labs and contains 58 variables. The first 48 variables are specific keywords and each observation is the percentage of appearance (frequency) of that word in the message. Click here to read more.

type \(= 1\) is spam
type \(= 0\) is non-spam

spam = read_csv("https://sta221-fa25.github.io/data/spam.csv")

Evaluate performance using AUC ROC

(New) classification

Use your model to classify the email below.

newEmailText = "CONGRATULATIONS!!! YOU have been selected as one of our lucky winners to receive a $0.5 Amazon Gift Card.

To claim your reward, simply click the link below and confirm your details.

Claim Your Gift Now"

Footnotes

The default is type = "link", which produces the predicted log-odds.↩︎