Example: logistic ridge

Author

Dr. Alexander Fisher

View libraries and data used in these notes

library(tidyverse)
library(DT)

colon = read_csv("https://sta221-fa25.github.io/data/colon.csv")
annotations = read_csv("https://sta221-fa25.github.io/data/mapped_annotations.csv") %>%
  distinct(ProbeID, .keep_all = TRUE)

Data

The data below are microarray data from the colons of 62 individuals. The data were curated by Alon et al. (1999) and were analyzed in a regression setting by Liang et al. (2013). Microarrays detect the gene expression levels of thousands of biological interactions. Expression levels are detected at targeted DNA probes that adhere directly to the array.

dim(colon)

[1]   62 1912

colon %>%
  select(1:5) %>%
  glimpse()

Rows: 62
Columns: 5
$ y         <dbl> -1, 1, -2, 2, -3, 3, -4, 4, -5, 5, -6, 6, -7, 7, -8, 8, -9, …
$ Hsa.3004  <dbl> 8589.416, 9164.254, 3825.705, 6246.449, 3230.329, 2510.325, …
$ Hsa.13491 <dbl> 5468.241, 6719.529, 6970.361, 7823.534, 3694.450, 1960.655, …
$ Hsa.37254 <dbl> 4064.936, 3718.159, 4705.650, 3975.564, 3463.586, 3072.816, …
$ Hsa.541   <dbl> 1997.893, 2015.221, 1166.554, 2002.613, 2181.420, 1810.205, …

y: identity of the 62 tissues sampled. The numbers correspond to patients, a positive sign to a normal tissue, and a negative sign to a tumor tissue.

The rest of the columns contains the expression of the 1911 genes with highest minimal gene expression “intensity” across the 62 tissues. Columns are named based on genetic probe IDs. Partial matching information about some of the probe IDs can be found in the file mapped_annotations.csv.

annotations %>%
  glimpse()

Rows: 1,911
Columns: 8
$ ProbeID         <chr> "Hsa.3004", "Hsa.13491", "Hsa.37254", "Hsa.541", "Hsa.…
$ Accession       <chr> "H55933", "R39465", "R85482", "U14973", "R02593", "T51…
$ Region          <chr> "3' UTR", "3' UTR", "3' UTR", "gene", "3' UTR", "3' UT…
$ Other1          <chr> "1", "2a", "2a", "1", "2a", "1", "2a", "1", "1", "1", …
$ UniGeneID       <dbl> 203417, 23933, 180093, NA, 124094, 71488, 240814, 8163…
$ GeneDescription <chr> "H.sapiens mRNA for homologue to yeast ribosomal prote…
$ SYMBOL          <chr> NA, NA, NA, "RPS29", NA, NA, NA, NA, NA, NA, NA, "ACTB…
$ GENENAME        <chr> NA, NA, NA, "ribosomal protein S29", NA, NA, NA, NA, N…

Transform outcome

colon =
  colon %>%
  mutate(y = ifelse(y < 0, "tumor", "normal"))

colon %>%
  count(y)

# A tibble: 2 × 2
  y          n
  <chr>  <int>
1 normal    22
2 tumor     40

What model is appropriate here?

Exercise

Any concerns with simply fitting via glm here? Why?

EDA

colon[,-1] %>%
  colMeans() %>%
  range()

[1]   30.94248 7015.78671

Notice how different the scale of the predictors is. We need to standardize our predictors for ridge regression.

X = colon[,-1] %>%
  as.matrix() %>% # matrix important for glmnet later
  scale() # scaling is important for ridge regression!
A = cor(X) - diag(1, nrow = ncol(X))
max(A)

[1] 0.9865134

colon %>%
  ggplot(aes(x = y, y = Hsa.36689)) +
  geom_boxplot() +
  theme_bw() +
  labs(x = "Tissue", y = "Expression level", title = "Expression of Hsa.36689 by tissue")

Read the abstract of Cessie and Houwelingen (1992). Read further and identify the objective function that we optimize to find $\hat{\beta}_{logistic-ridge}$.

\[ \hat{\beta}_{logistic-ridge} = \underset{\beta}{\mathrm{argmin}} ~ - l(\beta) + \lambda \beta^T \beta \] where $l(\beta)$ is the log-likelihood under the logistic regression model.

Equivalently,

\[ \hat{\beta}_{logistic-ridge} = \underset{\beta}{\mathrm{argmax}} ~ l(\beta) - \lambda \beta^T \beta \]

Note, we can also equivalently write $\beta^T \beta = ||\beta||^2$.

Re-format outcome

We want $y$ to be formatted as 0 or 1. Which should be 0 and which should be 1 if we want to know the effect of a gene on presence of a tumor?

y = colon %>%
  select(y) %>%
  mutate(y = ifelse(y == "tumor", 1, 0)) %>%
  pull()

y

 [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 0 1 1 0 0 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 1 0 1 0

Fit with `glmnet`

Instead of hard-coding lambda, the efunction cv.glmnet fits the model for a range of lambdas using nfold cross-validation. See ?glmnet::cv.glmnet

What are important considerations in choosing the number of folds?

set.seed(221)
cvfit <- glmnet::cv.glmnet(X, y, alpha = 0,
                           intercept = FALSE,
                           family = "binomial",
                           nfold = 10)

cvfit


Call:  glmnet::cv.glmnet(x = X, y = y, nfolds = 10, alpha = 0, intercept = FALSE,      family = "binomial") 

Measure: Binomial Deviance 

    Lambda Index Measure      SE Nonzero
min  12.20    70   1.138 0.12406    1911
1se  90.16    27   1.260 0.04199    1911

Binomial deviance is $-2 \log\text{likelihood}$.

“Min lambda” is the lambda value that gives minimum cross-validated error.
“1se lambda” is the largest lambda that is within 1 standard error of the minimum.

cvfit contains

cvfit$lambda - all $\lambda$ values fitted
cvfit$cvm - mean CV error at each $\lambda$
cvfit$cvsd - standard error of CV error
cvfit$lambda.min - minimum lambda
cvfit$lambda.1se - 1se lambda

Plot
Quick plot

plot(cvfit)

Warning

If we did not standardize the predictors, we might expect some monotonic, pathological CV curve plot, such as

set.seed(221)
X2 = colon[,-1] %>%
  as.matrix()
cvfit2 <- glmnet::cv.glmnet(X2, y, alpha = 0,
                           intercept = FALSE,
                           family = "binomial",
                           nfold = 10)
plot(cvfit2)

Show coefficients
Code

Display top 10 coefficients, by magnitude.

beta_min = coef(cvfit, s = "lambda.min")
beta_min_mat = as.matrix(beta_min)
beta_min_mat %>%
  as.data.frame() %>%
  arrange((dplyr::desc(abs(lambda.min)))) %>%
  rename("betaHat" = "lambda.min") %>%
  rownames_to_column(var = "ProbeID") %>%
  mutate(num = 1:nrow(beta_min_mat)) %>%
  filter(num <= 10) %>%
  left_join(annotations, by = c("ProbeID" = "ProbeID")) %>%
  select(-c("Other1", "UniGeneID", "Region", "Accession")) %>%
datatable(rownames = FALSE, options = list(pageLength = 5),
           caption = "")

References

Alon, Uri, Naama Barkai, Daniel A Notterman, Kurt Gish, Suzanne Ybarra, Daniel Mack, and Arnold J Levine. 1999. “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays.” Proceedings of the National Academy of Sciences 96 (12): 6745–50. https://www.pnas.org/doi/abs/10.1073/pnas.96.12.6745.

Cessie, S Le, and JC Van Houwelingen. 1992. “Ridge Estimators in Logistic Regression.” Journal of the Royal Statistical Society Series C: Applied Statistics 41 (1): 191–201. https://www.jstor.org/stable/2347628.

Liang, Yong, Cheng Liu, Xin-Ze Luan, Kwong-Sak Leung, Tak-Ming Chan, Zong-Ben Xu, and Hai Zhang. 2013. “Sparse Logistic Regression with a L1/2 Penalty for Gene Selection in Cancer Classification.” BMC Bioinformatics 14 (1): 198. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-198.