This exercise is based on James et al. 2013, chapter 10.4.
We load data on violent crimes by state in the US. The data set is included in the MASS
package.
# Load data on violent crimes by US state
library(MASS)
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
# Create a variable for US states
states <- row.names(USArrests)
states
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
# Mean and variance of features in the data set
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
Principal Components Analysis (PCA) is a dimension reduction method. We run PCA on the data to capture some underlying measure of violence. To run PCA we use the prcomp()
function.
# Run PCA
pca_out <- prcomp(USArrests, center = TRUE, scale = TRUE)
# Check model object
names(pca_out)
## [1] "sdev" "rotation" "center" "scale" "x"
We can check the mean and standard deviation of the features before centering and scaling.
pca_out$center
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
pca_out$scale
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
We can access the loadings of the principal components as follows.
pca_out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
Similary, we can access the scores of the principal components as follows.
dim(pca_out$x)
## [1] 50 4
head(pca_out$x)
## PC1 PC2 PC3 PC4
## Alabama -0.9756604 1.1220012 -0.43980366 0.154696581
## Alaska -1.9305379 1.0624269 2.01950027 -0.434175454
## Arizona -1.7454429 -0.7384595 0.05423025 -0.826264240
## Arkansas 0.1399989 1.1085423 0.11342217 -0.180973554
## California -2.4986128 -1.5274267 0.59254100 -0.338559240
## Colorado -1.4993407 -0.9776297 1.08400162 0.001450164
We can create a biplot showing the loading vectors and principal component scores (The scale = 0
argument to biplot()
ensures that the arrows are scaled to represent the loadings). We can also examine the effect of rotation.
biplot(pca_out, scale = 0, cex = .8)
# Turning things around
pca_out$rotation <- -pca_out$rotation
pca_out$x <- -pca_out$x
biplot(pca_out, scale = 0, cex = .8)
We are really interested in how much of the variance in the data our components capture. So we get the standard deviation on the new scales and compute the variance. We can then evaluate how much of the total variance is explained by each individual principal component.
# Standard deviation on new scales
pca_out$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
# Compute variance
pca_var <- pca_out$sdev^2
pca_var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
# Proportion of variance explained by each principal component
pve <- pca_var / sum(pca_var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
We can plot the proportion of variance explained (PVE) by each component. The purpose is to determine how many components we need to account for most of the variance in the data set. It is common to look for an “elbow” (although this is not a disciplined approach).
# Finding the ellbow
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained ",
ylim = c(0, 1), type = "b")
We can also plot the cumulative proportion of variance explained. To do so, we use the cumsum()
function, which returns the cumulative sums of a vector.
# Function to compute cumulative sums
a <- c(1, 2, 8, -3)
cumsum(a)
## [1] 1 3 11 8
# What is the cumulative proportion of variance explained by principal components?
plot(cumsum(pve), xlab = "Principal Component ",
ylab = " Cumulative Proportion of Variance Explained ",
ylim = c(0, 1), type = "b")