Here we apply bagging to the 2005 BES survey data, using the randomForest
package in R
. Recall that bagging is a special case of a random forest with \(m = p\). Therefore, the randomForest()
function can be used to perform both random forests and bagging. We perform bagging as follows:
library(randomForest)
# Load data set
load("bes_2005.RData")
# Create training and test sets
set.seed(1234)
train <- sample(1:nrow(bes_2005), size = nrow(bes_2005) / 2)
bes_2005_test <- bes_2005[-train, ]
# Bagging
bag_1 <- randomForest(household_income ~ ., data = bes_2005, subset = train,
mtry = 7, importance = TRUE)
bag_1
##
## Call:
## randomForest(formula = household_income ~ ., data = bes_2005, mtry = 7, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 11.48014
## % Var explained: 0.53
Note that the argument mtry = 7
indicates that all 7 predictors should be considered for each split of the tree—in other words, that bagging should be done. How well does this bagged model perform on the test set?
# Predict outcomes
bag_1_pred <- predict(bag_1, newdata = bes_2005_test)
# MSE
household_income_test <- bes_2005$household_income[-train]
mean((bag_1_pred - household_income_test)^2)
## [1] 9.262259
Growing a random forest proceeds in exactly the same way, except that we use a smaller value of the mtry
argument. By default, randomForest()
uses \(p/3\) variables when building a random forest of regression trees, and \(\sqrt{p}\) variables when building a random forest of classification trees. Here we use mtry = 3
.
# Random forest
rf_1 <- randomForest(household_income ~ ., data = bes_2005, subset = train,
mtry = 3, importance = TRUE)
# Predict outcomes
rf_1_pred <- predict(rf_1, newdata = bes_2005_test)
# MSE
mean((rf_1_pred - household_income_test)^2)
## [1] 8.66902
The test set MSE is 8.67. This indicates that random forests yielded an improvement over bagging in this case. Using the importance()
and varImpPlot()
functions, we can view the importance of each variable.
# Variable importance
importance(rf_1)
## %IncMSE IncNodePurity
## vote_2005 -1.2701606 139.7457
## vote_2001 11.2980485 157.8962
## female 1.9630625 179.4282
## party_id 0.8740649 163.6504
## phone 15.6507824 119.4684
## education 48.1921635 1190.5085
## duty 5.7742732 413.7412
# Variable importance plot
varImpPlot(rf_1, main = "Variable Importance Plot")
We use the gbm()
function from the gbm
package to fit boosted regression trees to the 2005 BES survey data. We run gbm()
with the option distribution = "gaussian"
since this is a regression problem. If it were a binary classification problem, we would use distribution = "bernoulli"
. The argument n.trees = 5000
indicates that we want 5000 trees, and the option interaction.depth = 4
limits the depth of each tree. Note that the learning rate is related to the tree size: when the learning rate is smaller, we need more trees. The summary()
function produces a relative influence plot and also outputs the relative influence statistics.
library(gbm)
# Boosting
set.seed(1234)
boost_1 <- gbm(household_income ~ ., data = bes_2005[train, ],
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 4,
shrinkage = 0.001)
summary(boost_1)
## var rel.inf
## education education 52.923929
## duty duty 19.926347
## party_id party_id 7.272264
## female female 6.586894
## vote_2001 vote_2001 6.308393
## vote_2005 vote_2005 3.491710
## phone phone 3.490462
We now use the boosted model to predict household income on the test set:
# Predict outcomes
boost_1_pred <- predict(boost_1, newdata = bes_2005_test, n.trees = 5000)
# MSE
mean((boost_1_pred - household_income_test)^2)
## [1] 8.343382
The test MSE obtained is 9.84; superior to both the test MSE for random forests and that for bagging.