October 10, 2023 by Yohandi
In data science, Principal Component Analysis (PCA) is a popular technique to reduce the dimension of a dataset. Picture this: data can sometimes be overwhelming with its many dimensions, much like a city with a lot of streets and alleys. In this analogy, PCA doesn't just show us every winding alley – it points us to the main roads and iconic landmarks.
In a more technical term, PCA transforms the data into a set of new variables known as principal components. Each of these components is ranked based on its significance, meaning the most impactful ones capture the majority of the data's variance. As a result, patterns, trends, and outliers become more evident or even discovered in the reduced data; therefore, one can simply focus on these main components.
The primary objective of PCA is to reduce the complexity of data while preserving as much information as possible. So, instead of dealing with, let's say, 100 different factors, we can just look at the main 2 or 3 that capture most of the story. A simple analogy can be found when recognizing a face: "I know that all faces have similar structures, but this uniqueness of nose and eyebrow combo is John's, for sure!"
The concepts and derivations presented in the following are inspired by and adapted from a session on Principal Component Analysis that I attended in 2022, delivered by Prof. Baoyuan Wu. Any interpretations, extensions, or errors are my own.
There are two objectives for PCA that we are after. First, when reducing the dimension of a dataset, we want to make variance as large as possible. This concept is to spread the data points as widely as possible while keeping their relative positions to each other, pertaining its similarity when reduced. Second, we want to make the reconstruction error as minimum as possible, making sure that we don't lose too much original information. Both ideas' derivation are shown below.
As an conceptual overview, let's first understand the main reason of having the variance to be maximal. For instance, what does it mean to maximize it?
When we are reducing the dimensions of data using PCA, we are essentially trying to condense the information. By maximizing variance, we ensure that we are focusing on the most crucial patterns or structures in the data. By focusing on high variance, PCA gives us a summary of our data.
Suppose we have a dataset
, where
Denote
This implies that:
Again, as an conceptual overview, let's first understand the main reason of having the reconstruction error to be minimal.
When we project our data onto a lower-dimensional subspace, we are essentially compressing the data, resulting in some information loss. The reconstruction error measures this loss: the difference between the original data and the data reconstructed from its lower-dimensional representation. Ideally, we want this error to be as small as possible, implying our compressed representation is a good approximation of the original data.
By definition, our objective is
By Pythagorean Theorem,
Since
I touched upon Lagrange Relaxation in a previous blog post, which you can find here. If this topic interests you, I recommend starting there for some foundational understanding.
, where
, where
If you are unfamiliar with Singular Value Decomposition, I recommend referring to some resources, one of which is this very blog.
The previous suggests that the optimal solution
, where
import numpy as np from numpy.linalg import eig def PCA(D, K): n = len(D) D = np.transpose(np.array(D)) mu = np.mean(D, axis = 1)[:, None] D_mu = D - mu # Calculate empirical covariance matrix empirical_covariance = np.dot(D_mu, np.transpose(D_mu)) / n # Perform SVD decomposition for empirical covariance matrix eigenvalues, eigenvectors = eig(empirical_covariance) # Take the top K eigenvectors sorted_indices = np.argsort(eigenvalues)[::-1] U = eigenvectors[:, sorted_indices[:K]] return np.transpose(np.matmul(np.transpose(U), D - mu)).tolist() D = [ [-1, 2, -3], [2, 0, -1], [1, 1, 1], [-1, -2, 0], [2, 1, 3], [0, -1, 2], [-2, 1, -1], [1, -2, 2], [3, 0, -2], [0, 1, 1] ] print(PCA(D, 1))
We have seen Principal Component Analysis' ability to simplify complex data. By focusing on what's essential and reducing unnecessary noise, PCA provides a clearer perspective on the underlying patterns in our data. It's a key tool in data science, but it's important to use it wisely, ensuring we don't overlook critical details in our simplicity quest.
© 2023 Yohandi. All rights reserved.