library(tidyverse)Homework 04: logistic regression
Due Thursday November 20 at 5:00pm
Conceptual exercises
The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.
You may write the answers and associated work for conceptual exercises by hand or type them in a Quarto document.
Exercise 11
Logistic regression interpretation
Berry (2001) examined the effect of a player’s draft position among the pool of potential players in a given year to the probability on eventually being named an all star.
Let \(d\) be the draft position \((d = 1, 2, 3, \ldots)\) and \(\pi\) be the probability of eventually being named an all star. The researcher modeled the relationship between \(d\) and \(\pi\) using the following model:
\[ \log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \beta_0 + \beta_1 \log d_i \]
Using this model, show that the odds of being named an all star are \(e^{\beta_0}d^{\beta_1}\) . Then, show how to calculate \(\pi_i\) based on this model.
What are the odds of being named an all star for the first draft pick?
In the study, Berry reported that for professional basketball \(\hat{\beta}_0 = 2.3\) and \(\hat{\beta}_1 = -1.1\), and for professional baseball \(\hat{\beta}_0 = 0.7\) and \(\hat{\beta}_1 = -0.6\) . Explain why this suggests that (1) being a first draft pick is more crucial for being an all star in basketball than in baseball and (2) players picked in high draft positions are relatively less likely to be all stars.
Exercise 2
Logistic regression with logit link log-likelihood
Show that the log-likelihood in a standard logistic regression with a logistic link function can be written
\[ \mathcal{l}(\beta) = \sum_{i=1}^n y_i x_i^T \beta - \sum_{i=1}^n \log (1 + \exp\{x_i^T\beta\}) \] Include a (brief) explanation of each step in your derivation.
Exercise 3
Score function
Show that the score function, i.e. the gradient \(\frac{\partial}{\partial{\beta}} \mathcal{l}(\beta)\) can be written in matrix form
\[ \boldsymbol{X}^T \boldsymbol{y}- \boldsymbol{X}^T \boldsymbol{p}, \] where \(\boldsymbol{p}= [p_1, \ldots, p_n]^T\). Note: you can use the gradient of the log-likelihood in class a starting point and just explain why the representation above makes sense.
Exercise 4
Hessian
Show that the Hessian matrix, i.e. the matrix of second derivatives, \(\frac{\partial^2}{\partial{\beta} \partial{\beta}^T} \mathcal{l}(\beta)\) can be written as
\[ -\boldsymbol{X}^T D \boldsymbol{X} \] where \(D\) is a diagonal matrix with entries \(d_{ii} = p_i(1-p_i)\).
Exercise 5
Maximum likelihood estimator
Show that the maximum likelihood estimator of the logistic regression model with logit link can be written as
\[ \hat{\beta}_{MLE} = \left( \boldsymbol{X}^T D \boldsymbol{X} \right)^{-1} \boldsymbol{X}^T D \boldsymbol{y}^* \] where
\[ \boldsymbol{y}^* = \boldsymbol{X}\hat{\beta}_{MLE} + D^{-1} (\boldsymbol{y}- \boldsymbol{p}). \] Hint: multiply out the equation.
Exercise 6
How does glm find the MLE? Answer: iteratively re-weighted least squares (IRLS)
Newton-Raphson is a “root finding” algorithm. This means it finds where a function equals zero. We use it to find \(\hat{\beta}_{MLE}\). This works because the maximum likelihood estimator will be precisely where the score function equals zero.
The Newton-Raphson algorithm works to find the MLE as follows:
1. Choose a starting point, beta_0
2. Let beta_next = beta_old - H^-1(beta_old) * Score(beta_old)
3. Repeat step 2 until Score(beta_next) is close to 0.
Above, H^-1(beta_old) denotes the inverse of the Hessian of the log-likelihood evaluated at beta_old and Score(beta_old) denotes the score function evaluated at beta_old.
In logistic regression with a logit link, the update can be written:
\[ \beta_{n+1} = \beta_n + (\boldsymbol{X}^TD\boldsymbol{X})^{-1} \left(\boldsymbol{X}^T \boldsymbol{y}- \boldsymbol{X}^T \boldsymbol{p}\right) \]
Note: \(D\) is a function of \(\boldsymbol{p}\) which is a function of \(\beta\).
(a). Multiply both sides by the negative of the Hessian matrix.
(b). Write the update in terms of \(\boldsymbol{y}^*\), where \(\boldsymbol{y}^* = \boldsymbol{X}\beta_{n} + D^{-1} (\boldsymbol{y}- \boldsymbol{p})\).
(c). Solve for \(\beta_{n+1}\).
(d). Explain (in words) how this is an iterative generalized (weighted) least squares problem.
Applied problems
The applied exercises are focused on applying the concepts to analyze data.
All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.
Write all narrative using complete sentences and include informative axis labels / titles on visualizations.
The applied problems will use the data set below. The data were sourced from kaggle and the original data was collected by Robert Detrano, M.D., Ph.D of the Cleveland Clinic Foundation. See here for protocol specifics.
Copy the code chunk below to load the data.
heart = read_csv("https://sta221-fa25.github.io/data/cleveland_heart.csv")The response variable is: chd: presence of coronary heart disease (1 for presence, 0 for absence).
The other variables in the data set:
age: age in yearssex: 1 = male, 0 = femalechol— serum cholesterol in mg/dltrestbps— resting blood pressurefbs— fasting blood sugar > 120 mg/dl (1 = true; 0 = false)thalach— maximum heart rate achieved
Exercise 7
Exploratory data analysis
Create a visualization showing the how the prevalence of heart disease varies by sex.
Create a visualization showing how prevalence varies by age categories, “under 45”, “under 61 (but over 45)”, “over 61”. Hint: you can use
case_whenin R to make each category within amutate(). See?case_whenfor an example.
Exercise 8
Fit the model in R
- Fit a model to predict coronary heart disease using the predictors
age(numeric) andsex. Allow for an interaction effect between the variables. (b) Interpret each variable’s effect on the odds of heart disease.
Exercise 9
Receiver operating characteristic
Now fit a model using all the predictors in the data set (no interaction effects) and compare your model to that from exercise 8 using AUC of ROC. Plot the ROC curves together on the same graph. You can use the code below as a template.
roc_curve1 <- roc(heart$chd, prediction_prob_model_1)
roc_curve2 <- roc(heart$chd, prediction_prob_model_2)
plot(roc_curve1, main = "ROC Curves Comparison", col = "blue")
lines(roc_curve2, col = "red")Also compare the two by computing AIC scores for each.
Which model do you prefer? Why?
Is AUC alone a good metric to use for model selection? Why or why not?
Exercise 10
Sensitivity and specificity
Using the “better” model from exercise 9, predict the probability \(\hat{p}_i\) that each individual in the data set has heart disease. Next you will choose a threshold \(p^*\), such that if \(\hat{p}_i \geq p^*\), you predict the individual has heart disease.
If you want \(p^*\) such that you have an empirical sensitivity of at least 0.95, what is the maximum number of False Negatives you can afford for a data set of your size?
Report \(p^*\) so that you have maximize your specificity while having a sensitivity of at least 0.95.
Use your \(p^*\) to create a confusion matrix.
Submission
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.
Instructions to combine PDFs:
Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/
To submit your assignment:
Access Gradescope through the menu on the Canvas website.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
References
Footnotes
Exercise adapted from an exercise in Categorical Data Analysis by Agresti.↩︎