Here we apply bagging to the 2005 BES survey data, using the randomForest
package in R
. Recall that bagging is a special case of a random forest with \(m = p\). Therefore, the randomForest()
function can be used to perform both random forests and bagging. We perform bagging as follows:
library(randomForest)
# Load data set
load("bes_2005.RData")
# Create training and test sets
set.seed(1234)
train <- sample(1:nrow(bes_2005), size = as.integer(nrow(bes_2005) / 2))
bes_2005_test <- bes_2005[-train, ]
vote_2005_test <- bes_2005$vote_2005[-train]
# Bagging
set.seed(1234)
bag_1 <- randomForest(household_income ~ ., data = bes_2005, subset = train,
mtry = 7, importance = TRUE)
bag_1
##
## Call:
## randomForest(formula = household_income ~ ., data = bes_2005, mtry = 7, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 8.813623
## % Var explained: 14.23
Note that the argument mtry = 7
indicates that all 7 predictors should be considered for each split of the tree—in other words, that bagging should be done. How well does this bagged model perform on the test set?
# Predict outcomes
bag_1_pred <- predict(bag_1, newdata = bes_2005_test)
# MSE
household_income_test <- bes_2005$household_income[-train]
mean((bag_1_pred - household_income_test)^2)
## [1] 11.3849
Growing a random forest proceeds in exactly the same way, except that we use a smaller value of the mtry
argument. By default, randomForest()
uses \(p/3\) variables when building a random forest of regression trees, and \(\sqrt{p}\) variables when building a random forest of classification trees. Here we use mtry = 3
.
# Random forest
set.seed(1234)
rf_1 <- randomForest(household_income ~ ., data = bes_2005, subset = train,
mtry = 3, importance = TRUE)
# Predict outcomes
rf_1_pred <- predict(rf_1, newdata = bes_2005_test)
# MSE
mean((rf_1_pred - household_income_test)^2)
## [1] 10.49424
The test set MSE is 10.49. This indicates that random forests yielded an improvement over bagging in this case. Using the importance()
and varImpPlot()
functions, we can view the importance of each variable.
# Variable importance
importance(rf_1)
## %IncMSE IncNodePurity
## vote_2005 11.665744 157.8210
## vote_2001 23.246025 186.5598
## female 5.225216 162.7271
## party_id 16.091845 190.4920
## phone 22.284473 125.1718
## education 66.508981 1164.5620
## duty 13.134319 375.1277
# Variable importance plot
varImpPlot(rf_1, main = "Variable Importance Plot")
We use the gbm()
function from the gbm
package to fit boosted regression trees to the 2005 BES survey data. We run gbm()
with the option distribution = "gaussian"
since this is a regression problem. If it were a binary classification problem, we would use distribution = "bernoulli"
. The argument n.trees = 5000
indicates that we want 5000 trees, and the option interaction.depth = 4
limits the depth of each tree. Note that the learning rate is related to the tree size: when the learning rate is smaller, we need more trees. The summary()
function produces a relative influence plot and also outputs the relative influence statistics.
library(gbm)
# Boosting
set.seed(1234)
boost_1 <- gbm(household_income ~ ., data = bes_2005[train, ],
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 4,
shrinkage = 0.001)
summary(boost_1)
## var rel.inf
## education education 51.094136
## duty duty 15.868715
## party_id party_id 8.338671
## vote_2001 vote_2001 7.632323
## female female 7.059228
## phone phone 5.389219
## vote_2005 vote_2005 4.617708
We now use the boosted model to predict household income on the test set:
# Predict outcomes
boost_1_pred <- predict(boost_1, newdata = bes_2005_test, n.trees = 5000)
# MSE
mean((boost_1_pred - household_income_test)^2)
## [1] 9.837483
The test MSE obtained is 9.84; superior to both the test MSE for random forests and that for bagging.