Analysis on Grocery price
1. Explore Data
data <- read.csv("/Users/keonhoseo/Documents/Q2/STAT 630/Week 3/food.csv")
data
rownames(data)=data[,1]
data=data[,-1]
unscaled_data <- data
data=scale(data)
data=as.data.frame(data)
#### Step 1 , Explore the data
dim(data)
var1=var(data$Bread)
var2=var(data$Hamburger)
var3=var(data$Butter)
var4=var(data$Apples)
var5=var(data$Tomato)
This dataset contains grocery price in U.S cities table. It has information about 24 U.S cities and 5 kinds of groceries. So I can consider it is 5 dimensions of data and it is difficult to identify comparable cities. The purpose of conducting PCA here is to explain the cities with a few variables, then we can compare the cities easily.
2. Define the problem in terms of PCs
sigma <- var(data)
sigma
vars=diag(sigma)
percentvars=vars/sum(vars)
The data is standardized above and compute covariances, because we want to know how
3. Compute all Eigenvalues/Eigenvectors
eigenvalues=eigen(sigma)$values
eigenvectors=eigen(sigma)$vectors
eigenvalues
eigenvectors
y=as.matrix(data)%*%eigenvectors
sum(vars)
sum(eigenvalues)
Decompose dataset with Eigenvectors and Eigenvalues. Eigenvectors mean directions of variables, so it indicates relations between variables and principal components. Eigenvalues reveal the amount of paired eigenvector can explain origin variances.
On here, all grocery has a negative relation wiht PC1, and only hamburger and tomato have positive relationship with PC2.
4. Check variance estimates of the pcs and all other properties
percentvars_pc = eigenvalues / sum(eigenvalues)
percentvars_pc
It reveals each eigenvalues’ proportion out of total eigenvalue. First two principal components possess about 67.37% of the value, which means the two components clarify 67.37% of variances in data.
ts.plot(cbind(percentvars,percentvars_pc),col=c("blue","red"),xlab="ith vector",ylab="percent variance")
Graph a scree plot. Red line shows eigenvalues and the slope is getting slower after first 2 components. We can decide the first two components as variables for PCA.
5. Check correlation between components
y1=as.matrix(data)%*%(eigenvectors[,1])
y2=as.matrix(data)%*%(eigenvectors[,2])
y3=as.matrix(data)%*%(eigenvectors[,3])
rho_pc <- cor(y)
rho_pc
This is correlations between principal components. Off-diagonal values are converging to 0, we don’t have to worry about multicollinearity on the components.
6. Regression
Let’s compare explanatory power for variances of origin variables and Principal components through regression analysis.
set.seed(1002)
dv=rowSums(unscaled_data)+rnorm(24,mean=0,sd=10)
summary(lm(dv~as.matrix(data)))
summary(lm(dv~as.matrix(y)))
If we put complete dataset of origin and principal components into regression analysis, R-squared value and residual standard error are same in both analysis. This means component analysis explains whole varainces in origin data and PCA only finds linear lines through linear combination, not manipulating origin data.
#cor(dv,data)
summary(lm(dv~y1+y2))
summary(lm(dv~data$Hamburger+data$Tomato))
Let’s put Hamberguer and Tomato into regression as input and compare with principal components regression result. Hambugrger and Tomato combination get 0.7652 R-squared score, while the first two principal components scored 0.9259. So, we can say two of the PCA components explain more variances than any of two origin variables.
7. Draw a plot
plot(y1,y2)
text(y1,y2, labels = rownames(data), pos = 3)
Observable data are scattered by following principal components value. X-axis(PC1) explains about 48.8% of variances, and Y-axis(PC2) accounts for 18.6% of variances.
Since every grocery variables have negative relations with PC1, positioning on left of X-axis means expensive in general. On the contrary, grocery price in the cities on right of X-axis should be cheaper. So livinging in Anchorage and Honolulu should cost a lot, and it makes sense.
On PC2(Y-axis), the most important variable on this is apple. Apple has a high negative relation with PC2. So we can expect the apple price in the cities on upper in Y-axis is cheaper than cities in bottom. Also, PC2 has positive relations with Hamburger and Tomato, which implies if Hamburger and Tomato are expensive the city should be on upper in Y-axis.
Through the PCA plot, it is easy to compare grocery prices in U.S cities and guesses living costs in the cities.
8. Unstandardized
data <- read.csv("/Users/keonhoseo/Documents/Q2/STAT 630/Week 3/food.csv")
rownames(data)=data[,1]
data=data[,-1]
rho=cor(data)
## Step 2 - Define the problem in terms of principal components
sigma=var(data)
vars=diag(sigma)
percentvars=vars/sum(vars)
## Step 3 - Compute all the eigenvalues and eigenvectors in R
eigenvalues=eigen(sigma)$values
eigenvectors=eigen(sigma)$vectors
# define principal componenets
y1=as.matrix(data)%*%(eigenvectors[,1])
y2=as.matrix(data)%*%(eigenvectors[,2])
y=as.matrix(data)%*%eigenvectors
set.seed(1002)
dv=rowSums(data)+rnorm(24,mean=0,sd=10)
summary(lm(dv~y1+y2))
summary(lm(dv~data$Hamburger+data$Tomato))
The PCA regression results above are little bit different from the one that we saw in #6.Regression, while origin data regression is same. This is because I put unstandardized data as an input into regression. PCA works by finding maximum variances. So if data is not scaled, results should be distorted.
plot(y1,y2)
text(y1,y2, labels = rownames(data), pos = 4)
Also a plot is quite similar with standardized one, but little different from the scaled one.