Here we apply bagging to the 2005 BES survey data, using the `randomForest`

package in `R`

. Recall that bagging is a special case of a random forest with \(m = p\). Therefore, the `randomForest()`

function can be used to perform both random forests and bagging. We perform bagging as follows:

```
library(randomForest)
# Load data set
load("bes_2005.RData")
# Create training and test sets
set.seed(1234)
train <- sample(1:nrow(bes_2005), size = nrow(bes_2005) / 2)
bes_2005_test <- bes_2005[-train, ]
# Bagging
bag_1 <- randomForest(household_income ~ ., data = bes_2005, subset = train,
mtry = 7, importance = TRUE)
bag_1
```

```
##
## Call:
## randomForest(formula = household_income ~ ., data = bes_2005, mtry = 7, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 11.48014
## % Var explained: 0.53
```

Note that the argument `mtry = 7`

indicates that all 7 predictors should be considered for each split of the tree—in other words, that bagging should be done. How well does this bagged model perform on the test set?

```
# Predict outcomes
bag_1_pred <- predict(bag_1, newdata = bes_2005_test)
# MSE
household_income_test <- bes_2005$household_income[-train]
mean((bag_1_pred - household_income_test)^2)
```

`## [1] 9.262259`

Growing a random forest proceeds in exactly the same way, except that we use a smaller value of the `mtry`

argument. By default, `randomForest()`

uses \(p/3\) variables when building a random forest of regression trees, and \(\sqrt{p}\) variables when building a random forest of classification trees. Here we use `mtry = 3`

.

```
# Random forest
rf_1 <- randomForest(household_income ~ ., data = bes_2005, subset = train,
mtry = 3, importance = TRUE)
# Predict outcomes
rf_1_pred <- predict(rf_1, newdata = bes_2005_test)
# MSE
mean((rf_1_pred - household_income_test)^2)
```

`## [1] 8.66902`

The test set MSE is 8.67. This indicates that random forests yielded an improvement over bagging in this case. Using the `importance()`

and `varImpPlot()`

functions, we can view the importance of each variable.

```
# Variable importance
importance(rf_1)
```

```
## %IncMSE IncNodePurity
## vote_2005 -1.2701606 139.7457
## vote_2001 11.2980485 157.8962
## female 1.9630625 179.4282
## party_id 0.8740649 163.6504
## phone 15.6507824 119.4684
## education 48.1921635 1190.5085
## duty 5.7742732 413.7412
```

```
# Variable importance plot
varImpPlot(rf_1, main = "Variable Importance Plot")
```

We use the `gbm()`

function from the `gbm`

package to fit boosted regression trees to the 2005 BES survey data. We run `gbm()`

with the option `distribution = "gaussian"`

since this is a regression problem. If it were a binary classification problem, we would use `distribution = "bernoulli"`

. The argument `n.trees = 5000`

indicates that we want 5000 trees, and the option `interaction.depth = 4`

limits the depth of each tree. Note that the learning rate is related to the tree size: when the learning rate is smaller, we need more trees. The `summary()`

function produces a relative influence plot and also outputs the relative influence statistics.

```
library(gbm)
# Boosting
set.seed(1234)
boost_1 <- gbm(household_income ~ ., data = bes_2005[train, ],
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 4,
shrinkage = 0.001)
summary(boost_1)
```

```
## var rel.inf
## education education 52.923929
## duty duty 19.926347
## party_id party_id 7.272264
## female female 6.586894
## vote_2001 vote_2001 6.308393
## vote_2005 vote_2005 3.491710
## phone phone 3.490462
```

We now use the boosted model to predict household income on the test set:

```
# Predict outcomes
boost_1_pred <- predict(boost_1, newdata = bes_2005_test, n.trees = 5000)
# MSE
mean((boost_1_pred - household_income_test)^2)
```

`## [1] 8.343382`

The test MSE obtained is 9.84; superior to both the test MSE for random forests and that for bagging.