Data

We will use data from the 2009 British Social Attitudes Survey. You can download the data here. The codebook is available here.

Our goal is to predict the following output variable (the name in brackets corresponds to the variable name in the codebook):

We will use the following input variables:

Let’s have a look at the distribution of the output variable. The red dashed line shows the true percentage of non-Western immigrants (see www.migrationwatchuk.org).

The plot shows that 756 of 1,044 respondents (or 72.4%) overestimate the percentage of non-Western immigrants in the UK.

Data Preparation

The original data set contains 1,044 respondents.

library(foreign)
library(dplyr)

# Set working directory
# setwd(...)

# Load data set
load("bsas_short.RData")

# Declare factor variables
bsas_data <- bsas_data %>%
  dplyr::mutate(resp_urban_area = factor(resp_urban_area,
                                         levels = 1:4,
                                         labels = c("rural", "rather rural",
                                                    "rather urban", "urban")),
                resp_health = factor(resp_health,
                                     levels = 0:3,
                                     labels = c("bad", "fair", "fairly good", "good")))

We will use the glmnet() function in the glmnet package to perform ridge regression and the lasso. Before doing so, we use the model.matrix() function to create a matrix of input variables. This function automatically transforms any qualitative variables into dummy variables. This is important because glmnet() can only take quantitative inputs.

We remove the intercept from the matrix produced by model.matrix() because glmnet() will automatically include an intercept. We also exclude the input resp_party_cons, which will serve as the baseline in our model.

# Matrix of input variables (remove the intercept and resp_party_cons)
x <- model.matrix(imm_brit ~ . -1 -resp_party_cons, bsas_data)

# Output variable
y <- bsas_data$imm_brit

Exercise 1: Ridge Regression

The goal of the first exercise is to fit a ridge regression to our data.

Problem 1.1

We first split the data into a training set and a test set.

# Set seed
set.seed(1234)

# Split data into a training and test set
train <- sample(1:nrow(x), nrow(x) / 2)
y_test <- y[-train]

We next define a vector of values for the tuning parameter \(\lambda\).

# Define grid of lambda values
grid <- 10^seq(10, -2, length = 100)

We then use the training set to fit a ridge regression for each \(\lambda\) value. We do this by using the glmnet() function in the glmnet package. The glmnet() function has an alpha argument that determines what type of model is fit. If alpha = 0, then a ridge regression model is fit, and if alpha = 1, then a lasso model is fit. Note that by default, the glmnet() function standardizes the variables so that they are on the same scale.

library(glmnet)

# Fit ridge regression models using the glmnet function
ridge_out <- glmnet(x[train, ], y[train], alpha = 0, lambda = grid)

Associated with each value of \(\lambda\) is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In our case, this is a \(23 \times 100\) matrix, with a row for each input variable and the intercept and a column for each value of \(\lambda\).

dim(coef(ridge_out))
## [1]  23 100

Finally, we use the plot() function to create a graph that shows the shrinkage of the coefficient estimates as a function of \(\lambda\).

# Plot shrinkage of coefficient estimates
plot(ridge_out, xvar = "lambda", label = TRUE)

Problem 1.2

We use cross-validation to choose the optimal tuning parameter \(\lambda\). We do this by using the cross-validation function cv.glmnet() in the glmnet package. By default, the function performs 10-fold cross-validation, though we can change this using the argument nfolds. Here, we do 5-fold cross-validation.

# Perform cross-validation
ridge_cv_out <- cv.glmnet(x[train, ], y[train], alpha = 0, lambda = grid, nfolds = 5)

We then use the plot() function to create a graph that shows the CV estimate of the expected test MSE associated with each value of \(\lambda\).

# Plot MSE associated with each lambda value
plot(ridge_cv_out)

Next, we determine the optimal value of \(\lambda\) and use the test set to estimate the test MSE associated with our optimal model.

# Determine optimal lambda value
(lambda_opt <- ridge_cv_out$lambda.min)
## [1] 4.641589
# Use test set to estimate test error
ridge_out_pred <- predict(ridge_out, s = lambda_opt, newx = x[-train, ])
mean((ridge_out_pred - y_test)^2)
## [1] 387.6958

Let’s see what the benefit is of performing ridge regression instead of least squares regression.

ridge_out_pred_ls <- predict(ridge_out, s = 0, x = x[train, ], y = y[train],
                          newx = x[-train, ], exact = TRUE)
mean((ridge_out_pred_ls - y_test)^2)
## [1] 386.4369

For this data set, ridge regression actually increases the test error. In a case like this we might simply use the least squares solution.

Finally, we re-fit our ridge regression model on the full data set, using the optimal value of \(\lambda\).

ridge_final <- glmnet(x, y, alpha = 0)
predict(ridge_final, type = "coefficients", s = lambda_opt)[1:23, ]
##                  (Intercept)                  resp_female 
##                 37.031952483                  5.937727674 
##                     resp_age          resp_household_size 
##                 -0.041434850                  1.148784733 
##               resp_party_lab            resp_party_libdem 
##                 -2.577587782                 -3.895195596 
##               resp_party_snp             resp_party_green 
##                  3.818745888                 -3.339799903 
##              resp_party_ukip               resp_party_bnp 
##                 -4.152818311                  8.821819329 
##             resp_party_other               resp_newspaper 
##                  2.725760372                  1.612797275 
##            resp_internet_hrs               resp_religious 
##                 -0.025656308                  0.511561245 
## resp_time_current_employment         resp_urban_arearural 
##                 -0.002295805                 -1.223758130 
##  resp_urban_arearather rural  resp_urban_arearather urban 
##                 -0.860122923                  0.286077178 
##         resp_urban_areaurban              resp_healthfair 
##                  1.664619958                 -1.214776029 
##       resp_healthfairly good              resp_healthgood 
##                 -0.238974681                 -0.102325897 
##        resp_household_income 
##                 -1.200245890

Exercise 2: The Lasso

The goal of the second exercise is to fit a lasso model to our data.

Problem 2.1

In order to fit a Lasso model, we again use the glmnet() function. However, we now set the argument alpha = 1. We then use the plot() function to create a graph that shows the shrinkage of the estimated coefficients.

# Fit Lasso regression models using the glmnet function
lasso_out <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)

# Plot shrinkage of coefficient estimates
plot(lasso_out, xvar = "lambda", label = TRUE)

We now perform cross-validation, plot the CV estimate of the expected test MSE for each lambda, and use the test set to compute the test error for the model with the optimal \(\lambda\).

set.seed(1234)

# Perform cross-validation
lasso_cv_out <- cv.glmnet(x[train, ], y[train], alpha = 1, lambda = grid, nfolds = 5)

# Plot estimated test MSE for each lambda value
plot(lasso_cv_out)

# Determine optimal lambda value
(lambda_opt <- lasso_cv_out$lambda.min)
## [1] 0.869749
# Fit Lasso model using the optimal lambda value
lasso_out_pred <- predict(lasso_out, s = lambda_opt, newx = x[-train, ])

# Test error
mean((lasso_out_pred - y_test)^2)
## [1] 386.9932

This value is very similar to the test MSE of ridge regression with \(\lambda\) chosen by cross-validation.

Finally, we again re-fit our lasso model on the full data set, using the optimal value of \(\lambda\).

lasso_final <- glmnet(x, y, alpha = 1)
predict(lasso_final, type = "coefficients", s = lambda_opt)[1:23, ]
##                  (Intercept)                  resp_female 
##                36.7124923555                 5.3438555077 
##                     resp_age          resp_household_size 
##                -0.0121796559                 0.9336677630 
##               resp_party_lab            resp_party_libdem 
##                -0.5800356519                -1.4922918182 
##               resp_party_snp             resp_party_green 
##                 0.0000000000                 0.0000000000 
##              resp_party_ukip               resp_party_bnp 
##                -0.0002045927                 5.8155536429 
##             resp_party_other               resp_newspaper 
##                 2.5567523020                 0.0068577233 
##            resp_internet_hrs               resp_religious 
##                 0.0000000000                 0.0000000000 
## resp_time_current_employment         resp_urban_arearural 
##                 0.0000000000                 0.0000000000 
##  resp_urban_arearather rural  resp_urban_arearather urban 
##                 0.0000000000                 0.0000000000 
##         resp_urban_areaurban              resp_healthfair 
##                 0.1538081862                 0.0000000000 
##       resp_healthfairly good              resp_healthgood 
##                 0.0000000000                 0.0000000000 
##        resp_household_income 
##                -1.2840413798