This exercise is based on James et al. 2013, chapter 10.4.

Data

We load data on violent crimes by state in the US. The data set is included in the MASS package.

# Load data on violent crimes by US state
library(MASS)

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
# Create a variable for US states
states <- row.names(USArrests)
states
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
# Mean and variance of features in the data set
apply(USArrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
apply(USArrests, 2, var)
##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

PCA

Principal Components Analysis (PCA) is a dimension reduction method. We run PCA on the data to capture some underlying measure of violence. To run PCA we use the prcomp() function.

# Run PCA
pca_out <- prcomp(USArrests, center = TRUE, scale = TRUE)

# Check model object
names(pca_out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

We can check the mean and standard deviation of the features before centering and scaling.

pca_out$center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
pca_out$scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

We can access the loadings of the principal components as follows.

pca_out$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

Similary, we can access the scores of the principal components as follows.

dim(pca_out$x)
## [1] 50  4
head(pca_out$x)
##                   PC1        PC2         PC3          PC4
## Alabama    -0.9756604  1.1220012 -0.43980366  0.154696581
## Alaska     -1.9305379  1.0624269  2.01950027 -0.434175454
## Arizona    -1.7454429 -0.7384595  0.05423025 -0.826264240
## Arkansas    0.1399989  1.1085423  0.11342217 -0.180973554
## California -2.4986128 -1.5274267  0.59254100 -0.338559240
## Colorado   -1.4993407 -0.9776297  1.08400162  0.001450164

We can create a biplot showing the loading vectors and principal component scores (The scale = 0 argument to biplot() ensures that the arrows are scaled to represent the loadings). We can also examine the effect of rotation.

biplot(pca_out, scale = 0, cex = .8)

# Turning things around
pca_out$rotation <- -pca_out$rotation
pca_out$x <- -pca_out$x
biplot(pca_out, scale = 0, cex = .8)

We are really interested in how much of the variance in the data our components capture. So we get the standard deviation on the new scales and compute the variance. We can then evaluate how much of the total variance is explained by each individual principal component.

# Standard deviation on new scales
pca_out$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
# Compute variance
pca_var <- pca_out$sdev^2
pca_var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
# Proportion of variance explained by each principal component
pve <- pca_var / sum(pca_var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

We can plot the proportion of variance explained (PVE) by each component. The purpose is to determine how many components we need to account for most of the variance in the data set. It is common to look for an “elbow” (although this is not a disciplined approach).

# Finding the ellbow
plot(pve, xlab = "Principal Component", 
ylab = "Proportion of Variance Explained ", 
ylim = c(0, 1), type = "b")

We can also plot the cumulative proportion of variance explained. To do so, we use the cumsum() function, which returns the cumulative sums of a vector.

# Function to compute cumulative sums
a <- c(1, 2, 8, -3)
cumsum(a)
## [1]  1  3 11  8
# What is the cumulative proportion of variance explained by principal components?
plot(cumsum(pve), xlab = "Principal Component ",
ylab = " Cumulative Proportion of Variance Explained ", 
ylim = c(0, 1), type = "b")