We will use data from the 2009 British Social Attitudes Survey. You can download the data here. The codebook is available here.
Our goal is to predict the following output variable (the name in brackets corresponds to the variable name in the codebook):
We will use the following input variables:
Let’s have a look at the distribution of the output variable. The red dashed line shows the true percentage of non-Western immigrants (see www.migrationwatchuk.org).
The plot shows that 756 of 1,044 respondents (or 72.4%) overestimate the percentage of non-Western immigrants in the UK.
The original data set contains 1,044 respondents.
library(foreign)
library(dplyr)
# Set working directory
# setwd(...)
# Load data set
load("bsas_short.RData")
# Declare factor variables
bsas_data <- bsas_data %>%
dplyr::mutate(resp_urban_area = factor(resp_urban_area,
levels = 1:4,
labels = c("rural", "rather rural",
"rather urban", "urban")),
resp_health = factor(resp_health,
levels = 0:3,
labels = c("bad", "fair", "fairly good", "good")))
We will use the glmnet()
function in the glmnet
package to perform ridge regression and the lasso. Before doing so, we use the model.matrix()
function to create a matrix of input variables. This function automatically transforms any qualitative variables into dummy variables. This is important because glmnet()
can only take quantitative inputs.
We remove the intercept from the matrix produced by model.matrix()
because glmnet()
will automatically include an intercept. We also exclude the input resp_party_cons, which will serve as the baseline in our model.
# Matrix of input variables (remove the intercept and resp_party_cons)
x <- model.matrix(imm_brit ~ . -1 -resp_party_cons, bsas_data)
# Output variable
y <- bsas_data$imm_brit
The goal of the first exercise is to fit a ridge regression to our data.
glmnet()
function in the glmnet
package. The glmnet()
function has an alpha
argument that determines what type of model is fit. If alpha = 0
, then a ridge regression model is fit, and if alpha = 1
, then a lasso model is fit.plot()
function to create a graph that shows the shrinkage of the coefficient estimates as a function of \(\lambda\).cv.glmnet()
in the glmnet
package. Use the plot()
function to create a graph that shows the CV estimate of the expected test MSE associated with each value of \(\lambda\).The goal of the second exercise is to fit a lasso model to our data.
glmnet()
function, setting the argument alpha = 1
. Use the plot()
function to create a graph that shows the shrinkage of the estimated coefficients as a function of \(\lambda\).