In data science, Principal Component Analysis (PCA) is a popular technique to reduce the dimension of a dataset. Picture this: data can sometimes be overwhelming with its many dimensions, much like a city with a lot of streets and alleys. In this analogy, PCA doesn't just show us every winding alley – it points us to the main roads and iconic landmarks.

In a more technical term, PCA transforms the data into a set of new variables known as principal components. Each of these components is ranked based on its significance, meaning the most impactful ones capture the majority of the data's variance. As a result, patterns, trends, and outliers become more evident or even discovered in the reduced data; therefore, one can simply focus on these main components.

Principal Component Analysis

The primary objective of PCA is to reduce the complexity of data while preserving as much information as possible. So, instead of dealing with, let's say, 100 different factors, we can just look at the main 2 or 3 that capture most of the story. A simple analogy can be found when recognizing a face: "I know that all faces have similar structures, but this uniqueness of nose and eyebrow combo is John's, for sure!"

The concepts and derivations presented in the following are inspired by and adapted from a session on Principal Component Analysis that I attended in 2022, delivered by Prof. Baoyuan Wu. Any interpretations, extensions, or errors are my own.

There are two objectives for PCA that we are after. First, when reducing the dimension of a dataset, we want to make variance as large as possible. This concept is to spread the data points as widely as possible while keeping their relative positions to each other, pertaining its similarity when reduced. Second, we want to make the reconstruction error as minimum as possible, making sure that we don't lose too much original information. Both ideas' derivation are shown below.

Derivation by Maximal Variance

As an conceptual overview, let's first understand the main reason of having the variance to be maximal. For instance, what does it mean to maximize it?

When we are reducing the dimensions of data using PCA, we are essentially trying to condense the information. By maximizing variance, we ensure that we are focusing on the most crucial patterns or structures in the data. By focusing on high variance, PCA gives us a summary of our data.

Suppose we have a dataset

KaTeX can only parse string typed expression

D = {d_{1}, d_{2}, \dots, d_{n}} \in R^{N}

and we want to map it into a

K

-dimensional subspace

S

such that the variance of reconstructions is maximal, i.e.,

max \frac{1}{n} i = 1 \sum n ∥ \tilde{d}_{i} - \overset{μ}{˜} ∥^{2}

, where

KaTeX can only parse string typed expression

\overset{μ}{˜} = \frac{1}{n} \sum_{i = 1}^{n} \tilde{d}_{i}

and

\tilde{d}_{i} \in \tilde{D}, \forall i \in {1, 2, \dots, n}

\tilde{D}

is the new projected dataset. Of course, the projected

d_{i}

\tilde{d}_{i} = (\frac{1}{n} \sum_{i = 1}^{n} d_{i}) + Proj_{S} (d_{i} - μ)

, is spanned by an orthonormal basis

U = {u_{i}}_{i = 1}^{K}

where:

$u_{i} \in R^{N}$ ,
$∥ u_{i} ∥= 1, \forall i$ , and
$u_{i}^{T} u_{j} = 0, i \neq = j, \forall i, j$

Denote

KaTeX can only parse string typed expression

μ = \frac{1}{n} \sum_{i = 1}^{n}

. Then,

\overset{μ}{˜} = \frac{1}{n} i = 1 \sum n \tilde{d}_{i} = \frac{1}{n} i = 1 \sum n μ + Proj_{S} (d_{i} - μ) = μ + \frac{1}{n} i = 1 \sum n U (U^{T} (d_{i} - μ)) = μ + U U^{T} \frac{1}{n} i = 1 \sum n (d_{i} - μ) = μ + I^{T} \frac{1}{n} i = 1 \sum n (d_{i} - μ) = μ + \frac{1}{n} i = 1 \sum n (d_{i} - μ) = μ

This implies that:

max \frac{1}{n} i = 1 \sum n ∥ \tilde{d}_{i} - \overset{μ}{˜} ∥^{2} = max \frac{1}{n} i = 1 \sum n ∥ \tilde{d}_{i} - μ ∥^{2}

Derivation by Minimal Reconstruction Error

Again, as an conceptual overview, let's first understand the main reason of having the reconstruction error to be minimal.

When we project our data onto a lower-dimensional subspace, we are essentially compressing the data, resulting in some information loss. The reconstruction error measures this loss: the difference between the original data and the data reconstructed from its lower-dimensional representation. Ideally, we want this error to be as small as possible, implying our compressed representation is a good approximation of the original data.

By definition, our objective is

min \frac{1}{n} i = 1 \sum n ∥ d_{i} - \tilde{d}_{i} ∥^{2}

Equivalence of Both Derivations

By Pythagorean Theorem,

∥ \tilde{d}_{i} - μ ∥^{2} + ∥ d_{i} - \tilde{d}_{i} ∥^{2} =∥ d_{i} - μ ∥^{2}, \forall i \in {1, 2, \dots, n} \Rightarrow \frac{1}{n} i \sum n ∥ \tilde{d}_{i} - μ ∥^{2} + ∥ d_{i} - \tilde{d}_{i} ∥^{2} = \frac{1}{n} i \sum n ∥ d_{i} - μ ∥^{2} \Rightarrow \frac{1}{n} i \sum n ∥ \tilde{d}_{i} - μ ∥^{2} + \frac{1}{n} i \sum n ∥ d_{i} - \tilde{d}_{i} ∥^{2} = \frac{1}{n} i \sum n ∥ d_{i} - μ ∥^{2}

Since

KaTeX can only parse string typed expression

\frac{1}{n} \sum_{i}^{n} ∥ d_{i} - μ ∥^{2}

is a constant, we conclude that the objectives of both derivations are equivalent.

Lagrange Function Formulation

I touched upon Lagrange Relaxation in a previous blog post, which you can find here. If this topic interests you, I recommend starting there for some foundational understanding.

max \frac{1}{n} i = 1 \sum n ∥ \tilde{d}_{i} - \overset{μ}{˜} ∥^{2} = max \frac{1}{n} i = 1 \sum n ∥ \tilde{d}_{i} - μ ∥^{2} = max \frac{1}{n} i = 1 \sum n ∥ Proj_{S} (d_{i} - μ) ∥^{2} = max \frac{1}{n} i = 1 \sum n ∥ U U^{T} (d_{i} - μ) ∥^{2} \equiv max \frac{1}{n} i = 1 \sum n ∥ U^{T} (d_{i} - μ) ∥^{2} = max i = 1 \sum K u_{i}^{T} Σ u_{i} = max Trace (U^{T} Σ U)

, where

KaTeX can only parse string typed expression

Σ = \frac{1}{n} \sum_{i = 1}^{n} (d_{i} - μ) (d_{i} - μ)^{T}

. The formulation of our optimization problem to a Lagrange function is as follows:

L (U, Λ) = f (U) + Λ (g (U) - B) = Trace (U^{T} Σ U) + Trace (Λ (I - U^{T} U))

, where

KaTeX can only parse string typed expression

Λ = diag (Λ_{1}, Λ_{2}, \dots, Λ_{K})

. Then, the optimal solution satisfies:

\frac{\partial L ( U , Λ )}{\partial U} = 0 \Rightarrow Σ u_{i} = Λ_{i} u_{i}, \forall i = {1, 2, \dots, K}

Singular Value Decomposition

If you are unfamiliar with Singular Value Decomposition, I recommend referring to some resources, one of which is this very blog.

The previous suggests that the optimal solution

KaTeX can only parse string typed expression

u_{i}

must be one of the eigenvectors of

Σ

. We can then perform Singular Value Decomposition on

U^{T} Σ U

, as demonstrated below:

max Trace (U^{T} Σ U) = max i = 1 \sum K u_{i}^{T} Σ u_{i} = max i = 1 \sum K j = 1 \sum n u_{i}^{T} λ_{j} v_{j} v_{j}^{T} u_{i} = max i = 1 \sum K j = 1 \sum n λ_{j} (u_{i}^{T} v_{j}) (u_{i}^{T} v_{j})^{T} = i = 1 \sum K λ_{(i)}

, where

KaTeX can only parse string typed expression

λ_{(i)}

denotes the

i

-th largest value in

{λ_{1}, λ_{2}, \dots, λ_{n}}

Implementation

import numpy as np
from numpy.linalg import eig

def PCA(D, K):
    n = len(D)
    D = np.transpose(np.array(D))

    mu = np.mean(D, axis = 1)[:, None]
    D_mu = D - mu

    # Calculate empirical covariance matrix
    empirical_covariance = np.dot(D_mu, np.transpose(D_mu)) / n

    # Perform SVD decomposition for empirical covariance matrix
    eigenvalues, eigenvectors = eig(empirical_covariance)

    # Take the top K eigenvectors
    sorted_indices = np.argsort(eigenvalues)[::-1]
    U = eigenvectors[:, sorted_indices[:K]]

    return np.transpose(np.matmul(np.transpose(U), D - mu)).tolist()

D = [
    [-1, 2, -3],
    [2, 0, -1],
    [1, 1, 1],
    [-1, -2, 0],
    [2, 1, 3],
    [0, -1, 2],
    [-2, 1, -1],
    [1, -2, 2],
    [3, 0, -2],
    [0, 1, 1]
]
print(PCA(D, 1))

Closing

We have seen Principal Component Analysis' ability to simplify complex data. By focusing on what's essential and reducing unnecessary noise, PCA provides a clearer perspective on the underlying patterns in our data. It's a key tool in data science, but it's important to use it wisely, ensuring we don't overlook critical details in our simplicity quest.

References

Baoyuan Wu, "Principal Component Analysis," The Chinese University of Hong Kong, Shenzhen, 2022.
Hervé Abdi, Lynne J. Williams, "Principal component analysis," Wiley Interdisciplinary Reviews: Computational Statistics, 2010. https://doi.org/10.1002/wics.101.

← [Tutorial] Lagrange Relaxation Natural Parallelism: Disclosing Efficient Techniques for Image Processing →

1

10

11

12

13

14

15

16

17

18

19

2

20

2020-Fall

2021-Fall

2021-Spring

2021-Summer

2022-Fall

2022-Spring

2023-Fall

2023-Spring

2024-Spring

20K-smooth

20K-smooth (no image)

21

22

23

24

25

26

27

28

29

3

30

31

32

33

34

35

36

37

4

5

6

7

8

9

appendix

appendix

appendix

archive

BIO1001-General-Biology

bonus

bonus

bucketsort

CHM1001-General-Chemistry

collections

Competitive-Programming

components

cpu

CSC1001-Introduction-to-Computer-Science-Programming-Methodology

CSC1002-Computational-Laboratory

CSC3001-Discrete-Mathematics

CSC3002-Assignment-1-src

CSC3002-Assignment-2-src

CSC3002-Assignment-3-src

CSC3002-Assignment-4-src

CSC3002-Assignment-5-src

CSC3002-Assignment-6-src

CSC3002-Introduction-to-Computer-Science-Programming-Paradigms

CSC3050-Computer-Architecture

CSC3050-Project-1-assembler

CSC3050-Project-2-simulator

CSC3050-Project-3-verilog

CSC3050-Project-4-CPU

CSC3100-Data-Structures

CSC3150-Assignment-1

CSC3150-Assignment-2

CSC3150-Assignment-3