This exercise is based on James et al. 2013, chapter 10.4.

Data

We load data on violent crimes by state in the US. The data set is included in the MASS package.

# Load data on violent crimes by US state
library(MASS)

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
# Create a variable for US states
states <- row.names(USArrests)
states
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
# Mean and variance of features in the data set
apply(USArrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
apply(USArrests, 2, var)
##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

PCA

Principal Components Analysis (PCA) is a dimension reduction method. We run PCA on the data to capture some underlying measure of violence. To run PCA we use the prcomp() function.

# Run PCA
pca_out <- prcomp(USArrests, center = TRUE, scale = TRUE)

# Check model object
names(pca_out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

We can check the mean and standard deviation of the features before centering and scaling.

pca_out$center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
pca_out$scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

We can access the loadings of the principal components as follows.

pca_out$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

Similary, we can access the scores of the principal components as follows.

dim(pca_out$x)
## [1] 50  4
head(pca_out$x)
##                   PC1        PC2         PC3          PC4
## Alabama    -0.9756604  1.1220012 -0.43980366  0.154696581
## Alaska     -1.9305379  1.0624269  2.01950027 -0.434175454
## Arizona    -1.7454429 -0.7384595  0.05423025 -0.826264240
## Arkansas    0.1399989  1.1085423  0.11342217 -0.180973554
## California -2.4986128 -1.5274267  0.59254100 -0.338559240
## Colorado   -1.4993407 -0.9776297  1.08400162  0.001450164

We can create a biplot showing the loading vectors and principal component scores (The scale = 0 argument to biplot() ensures that the arrows are scaled to represent the loadings). We can also examine the effect of rotation.

biplot(pca_out, scale = 0, cex = .8)

# Turning things around
pca_out$rotation <- -pca_out$rotation
pca_out$x <- -pca_out$x
biplot(pca_out, scale = 0, cex = .8)