Principal Component Analysis, Part 2

Principal Component Analysis, Part 2#

A comprehensive guide to understanding PCA from mathematical foundations to practical implementation

Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique that identifies an optimal $r$-dimensional basis capturing maximum variance in the data. Its mathematical foundation derives from finding orthogonal directions (principal components) that sequentially maximize the projected variance.

Given a centered data matrix $\mathbf{X}$, we seek unit vectors $\mathbf{u}$ that maximize the variance of projected data $\mathbf{X}\mathbf{u}$:

\[\max_{\|\mathbf{u}\|=1} \text{Var}(\mathbf{X}\mathbf{u}) \]

This optimization leads to the key eigenvalue problem:

\[\mathbf{C}\mathbf{u} = \lambda\mathbf{u}\]

where $\mathbf{C}$ is a covariance matrix.

In the following sections, we will systematically derive the solution to this problem and demonstrate its geometric interpretation.

Show code cell source Hide code cell source

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Generate synthetic data with elliptical shape
np.random.seed(42)
n_samples = 100
mean = [0, 0]
cov = [[1, 0.6], [0.6, 0.5]]  # Covariance matrix for elliptical distribution
data = np.random.multivariate_normal(mean, cov, n_samples)

# Center the data
centered_data = data - np.mean(data, axis=0)

# Perform PCA
pca = PCA()
pca.fit(centered_data)
components = pca.components_
explained_variance = pca.explained_variance_ratio_

# Project data onto principal components
projected_data = pca.transform(centered_data)

# Create figure
plt.figure(figsize=(12, 8))

# Plot original data
plt.scatter(centered_data[:, 0], centered_data[:, 1], alpha=0.5, label='Original Data', color='blue')

# Plot principal components as elegant vectors
pc_scale = 2.1  # Scaling factor for PC vectors
plt.quiver(0, 0, components[0,0]*pc_scale, components[0,1]*pc_scale, 
          color='lightcoral', scale=1, scale_units='xy', angles='xy', width=0.002,
          label=f'PC1 (Variance: {100*explained_variance[0]:.1f}%)')
plt.quiver(0, 0, components[1,0]*pc_scale, components[1,1]*pc_scale, 
          color='lightgreen', scale=1, scale_units='xy', angles='xy', width=0.002,
          label=f'PC2 (Variance: {100*explained_variance[1]:.1f}%)')      

# Plot projections onto PC1 (first principal component)
for point in centered_data:
    projection_pc1 = np.dot(point, components[0]) * components[0]
    plt.plot([point[0], projection_pc1[0]], [point[1], projection_pc1[1]], 
             'r--', alpha=0.3, linewidth=0.7)

# Plot projections onto PC2 (second principal component)
for point in centered_data:
    projection_pc2 = np.dot(point, components[1]) * components[1]
    plt.plot([point[0], projection_pc2[0]], [point[1], projection_pc2[1]], 
             'g--', alpha=0.2, linewidth=0.6)

# Plot the projected points on PC1
projected_pc1 = np.dot(projected_data[:, 0][:, np.newaxis], components[0:1, :])
plt.scatter(projected_pc1[:, 0], projected_pc1[:, 1], color='red', alpha=0.7, 
            label='Projections on PC1', s=30)

# Plot the projected points on PC2
projected_pc2 = np.dot(projected_data[:, 1][:, np.newaxis], components[1:2, :])
plt.scatter(projected_pc2[:, 0], projected_pc2[:, 1], color='green', alpha=0.7, 
            label='Projections on PC2', s=30)

# Add annotations and legend
plt.axhline(0, color='black', linestyle='--', alpha=0.2)
plt.axvline(0, color='black', linestyle='--', alpha=0.2)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('PCA: Principal Components and Data Projections\n'
          f'PC1 explains {100*explained_variance[0]:.1f}% of variance\n'
          f'PC2 explains {100*explained_variance[1]:.1f}% of variance')
plt.legend(loc="lower right")
plt.grid(alpha=0.2)
plt.axis('equal')
plt.show()

_images/45204a67ac059c816dfffcebc3f984dd4ed14b391e3c4158029e5192aedec441.png

Linear Algebra Foundations#

Data Matrix Representation#

A data matrix $\mathbf{X}$ with $n$ rows (samples) and $d$ columns (features) can be written as:

\[\begin{split} \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d}\\ x_{21} & x_{22} & \cdots & x_{2d}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix} \end{split}\]

Each row $\mathbf{x}_i = (x_{i1}, x_{i2}, \ldots, x_{id})^T$ represents one sample in $d$-dimensional space.

Vector Operations#

Dot Product#

\[\mathbf{a}^T\mathbf{b} = \sum_{i=1}^m a_i b_i\]

Vector Length (Norm)#

\[\|\mathbf{a}\| = \sqrt{\mathbf{a}^T\mathbf{a}} = \sqrt{\sum_{i=1}^m a_i^2}\]

Unit Vector#

\[\mathbf{u} = \frac{\mathbf{a}}{\|\mathbf{a}\|}\]

Distance between Vectors#

\[\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_{i=1}^m (a_i-b_i)^2}\]

Angle between Vectors#

\[\cos \theta = \frac{\mathbf{a}^T\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}\]

# Example: Vector operations
a = np.array([3, 4])
b = np.array([1, 2])

# Dot product
dot_product = np.dot(a, b)
print(f"Dot product: {dot_product}")

# Vector norms
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
print(f"Norm of a: {norm_a:.2f}")
print(f"Norm of b: {norm_b:.2f}")

# Angle between vectors
cos_theta = dot_product / (norm_a * norm_b)
theta_rad = np.arccos(cos_theta)
theta_deg = np.degrees(theta_rad)
print(f"Angle between vectors: {theta_deg:.2f} degrees")

Dot product: 11
Norm of a: 5.00
Norm of b: 2.24
Angle between vectors: 10.30 degrees

Vector Projections (Derivation & Geometric Intuition)#

Derivation of Projection Formula#

We begin with fundamental vector operations to derive the projection formula naturally:

Dot Product and Angle Relationship:
From our previous definitions, we know:

\[ \mathbf{a}^T \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos \theta \]

This measures alignment between vectors.
Scalar Projection (Length of Shadow):
The length of $\mathbf{b}$’s shadow along $\mathbf{a}$ is:

\[ \text{Scalar projection} = \|\mathbf{b}\| \cos \theta = \frac{\mathbf{a}^T \mathbf{b}}{\|\mathbf{a}\|} \]

For unit vectors, this simplifies to $\mathbf{a}^T \mathbf{b}$.
Vector Projection:
To get the vector form, we multiply the scalar projection by $\mathbf{a}$’s unit vector:

\[ \mathbf{b}_{\parallel} = \left( \frac{\mathbf{a}^T \mathbf{b}}{\|\mathbf{a}\|} \right) \left( \frac{\mathbf{a}}{\|\mathbf{a}\|} \right) = \frac{\mathbf{a}^T \mathbf{b}}{\|\mathbf{a}\|^2} \mathbf{a} \]

Final Projection Formula#

\[ \mathbf{b}_{\parallel} = \left( \frac{\mathbf{a}^T \mathbf{b}}{\|\mathbf{a}\|^2} \right) \mathbf{a} \]

Key Insights:

Dot Product Interpretation:
- The numerator $\mathbf{a}^T \mathbf{b}$ measures alignment
- The denominator $\|\mathbf{a}\|^2$ normalizes for vector length
Special Case - Unit Vector: When $\|\mathbf{a}\|=1$:

\[ \mathbf{b}_{\parallel} = (\mathbf{a}^T \mathbf{b}) \mathbf{a} \]
Geometric Meaning:
- $\mathbf{b}_{\parallel}$ is $\mathbf{b}$’s shadow on $\mathbf{a}$
- The residual $\mathbf{b}_{\perp} = \mathbf{b} - \mathbf{b}_{\parallel}$ is orthogonal to $\mathbf{a}$

Example

Let $\mathbf{a} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$ and $\mathbf{b} = \begin{bmatrix} 3 \\ 2 \end{bmatrix}$.

Step 1: Compute $\mathbf{a}^T \mathbf{b} = 1 \times 3 + 0 \times 2 = 3$.
Step 2: Compute $\|\mathbf{a}\|^2 = 1^2 + 0^2 = 1$.
Step 3: Projection $\mathbf{b}_{\parallel} = 3\begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 3 \\ 0 \end{bmatrix}$.
Perpendicular Component: $\mathbf{b}_{\perp} = \begin{bmatrix} 3 \\ 2 \end{bmatrix} - \begin{bmatrix} 3 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \end{bmatrix}$.

Visualization:

$\mathbf{b}_{\parallel}$ retains the x-component of $\mathbf{b}$ (since $\mathbf{a}$ is along the x-axis).
$\mathbf{b}_{\perp}$ captures the remaining y-component, which is orthogonal to $\mathbf{a}$.

Next Steps:
We will now see how this projection concept extends to multiple dimensions and leads to the PCA solution via eigenvalue decomposition.

# Example: Vector projection
def project_vector(b, a):
    """Project vector b onto vector a"""
    projection = (np.dot(a, b) / np.dot(a, a)) * a
    perpendicular = b - projection
    return projection, perpendicular

# Example vectors
a = np.array([1, 0])  # Direction vector
b = np.array([3, 2])  # Vector to project

b_parallel, b_perp = project_vector(b, a)

print(f"Original vector b: {b}")
print(f"Projection onto a: {b_parallel}")
print(f"Perpendicular component: {b_perp}")
print(f"Verification (should equal b): {b_parallel + b_perp}")

Original vector b: [3 2]
Projection onto a: [3. 0.]
Perpendicular component: [0. 2.]
Verification (should equal b): [3. 2.]

_images/cf55bdde38f68b14271ee4a13c13bb5befb43e42bc838f616c404f1f38a6d2ea.png

Vector a: [1 0], magnitude: 1.00
Vector b: [3 2], magnitude: 3.61
b_parallel: [3. 0.], magnitude: 3.00
b_perpendicular: [0. 2.], magnitude: 2.00

Basis Transformation#

Change of Basis#

Any vector $\mathbf{x}$ can be expressed in terms of a new orthonormal basis $\{\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_d\}$:

\[\mathbf{x} = a_1\mathbf{u}_1 + a_2\mathbf{u}_2 + \cdots + a_d\mathbf{u}_d = \mathbf{U}\mathbf{a}\]

where $\mathbf{U}$ is the matrix with columns $\mathbf{u}_i$ and $\mathbf{a}$ contains the coordinates in the new basis:

\[\mathbf{a} = \mathbf{U}^T\mathbf{x}\]

# Example: Basis transformation
# Original vector
x = np.array([0.5, 1, 2])

# Standard basis (i, j, k)
standard_basis = np.eye(3)
i = standard_basis[:, 0]
j = standard_basis[:, 1]
k = standard_basis[:, 2]

print(f"Original vector x: {x}")
print(f"In standard basis: {x[0]}i + {x[1]}j + {x[2]}k")

# Create a new orthonormal basis (rotated 45° in xy-plane)
u1 = np.array([1/np.sqrt(2), 1/np.sqrt(2), 0])
u2 = np.array([-1/np.sqrt(2), 1/np.sqrt(2), 0])
u3 = np.array([0, 0, 1])

# New basis matrix
U = np.column_stack([u1, u2, u3])
print(U)

Original vector x: [0.5 1.  2. ]
In standard basis: 0.5i + 1.0j + 2.0k
[[ 0.70710678 -0.70710678  0.        ]
 [ 0.70710678  0.70710678  0.        ]
 [ 0.          0.          1.        ]]

_images/3f01ca6bf0d8b345ea8b51b3c957414490805de3990a51ca693b658f526cf42f.png

Rotation Matrix Mathematical Foundation#

2D Rotation Fundamentals#

A rotation matrix transforms vectors by rotating them through a specified angle θ while preserving their length. For a 45° rotation in the xy-plane (θ = π/4 radians), the fundamental rotation matrix is:

\[\begin{split} \mathbf{R}(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta & 0 \\ \sin\theta & \cos\theta & 0 \\ 0 & 0 & 1 \end{bmatrix} \end{split}\]

For θ = 45°:

\[ \cos(45°) = \sin(45°) = \frac{1}{\sqrt{2}} \approx 0.7071 \]

Thus, the 45° rotation matrix becomes:

\[\begin{split} \mathbf{R}_{45°} = \begin{bmatrix} \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} & 0 \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 0 \\ 0 & 0 & 1 \end{bmatrix} \end{split}\]

Basis Transformation#

In your code, this rotation matrix defines new basis vectors:

u1 = np.array([1/np.sqrt(2), 1/np.sqrt(2), 0])  # Rotated x-axis
u2 = np.array([-1/np.sqrt(2), 1/np.sqrt(2), 0]) # Rotated y-axis 
u3 = np.array([0, 0, 1])                        # Unchanged z-axis

These form an orthonormal basis:

$\mathbf{u}_1$: Original x-axis rotated 45° counterclockwise
$\mathbf{u}_2$: Original y-axis rotated 45° counterclockwise
$\mathbf{u}_3$: Original z-axis (unchanged)

Key Properties#

Orthonormality:
- $\mathbf{u}_i^T\mathbf{u}_j = \begin{cases} 1 & \text{if } i=j \\ 0 & \text{otherwise} \end{cases}$
Determinant: $\det(\mathbf{R}) = 1$ (preserves orientation)
Inverse: $\mathbf{R}^{-1} = \mathbf{R}^T$ (transpose reverses rotation)

Vector Transformation#

For any vector $\mathbf{x} = [x,y,z]^T$, the rotated version is:

\[\begin{split} \mathbf{x}' = \mathbf{R}\mathbf{x} = \begin{bmatrix} \frac{x - y}{\sqrt{2}} \\ \frac{x + y}{\sqrt{2}} \\ z \end{bmatrix} \end{split}\]

This matches your basis transformation approach where:

\[ \mathbf{x}' = x\mathbf{u}_1 + y\mathbf{u}_2 + z\mathbf{u}_3 \]

Why This Works#

The matrix columns are the transformed standard basis vectors. When we multiply:

\[\begin{split} \mathbf{R}\begin{bmatrix}1\\0\\0\end{bmatrix} = \mathbf{u}_1, \quad \mathbf{R}\begin{bmatrix}0\\1\\0\end{bmatrix} = \mathbf{u}_2, \quad \mathbf{R}\begin{bmatrix}0\\0\\1\end{bmatrix} = \mathbf{u}_3 \end{split}\]

# Transform to new basis
a = U.T @ x

print(f"\nIn new basis: {a[0]:.2f}u1 + {a[1]:.2f}u2 + {a[2]:.2f}u3")

# Verify: transform back to original
x_reconstructed = U @ a
print(f"\nReconstructed x: {x_reconstructed}")
print(f"Reconstruction error: {np.linalg.norm(x - x_reconstructed):.10f}")

In new basis: 1.06u1 + 0.35u2 + 2.00u3

Reconstructed x: [0.5 1.  2. ]
Reconstruction error: 0.0000000000

PCA Mathematical Foundation#

Variance Maximization Problem#

Principal Component Analysis (PCA) seeks an orthonormal basis $\{\mathbf{u}_1, \dots, \mathbf{u}_r\}$ that captures maximal variance in the data through sequential optimization:

First Principal Component:
Find unit vector $\mathbf{u}_1$ maximizing projected variance:

\[ \max_{\|\mathbf{u}_1\|=1} \text{Var}(\mathbf{X}\mathbf{u}_1) = \max_{\|\mathbf{u}_1\|=1} \frac{1}{n}\|\mathbf{X}\mathbf{u}_1\|^2 \]
Subsequent Components:
Each $\mathbf{u}_k$ is found by maximizing variance orthogonal to previous components:

\[ \max_{\|\mathbf{u}_k\|=1, \, \mathbf{u}_k \perp \{\mathbf{u}_1, \dots, \mathbf{u}_{k-1}\}} \text{Var}(\mathbf{X}\mathbf{u}_k) \]

Simply put, the first principal component represents the most significant direction in the data, retaining the maximum variability. Subsequent principal components can be extracted, but each must be orthogonal to the previous components to ensure they capture new and independent information.

Finding the first principal component#

We seek a unit vector u that maximizes the variance of the projected data:

\[ \max_{\|\mathbf{u}\|=1} \text{Var}(\mathbf{X}\mathbf{u}) = \max_{\|\mathbf{u}\|=1} \frac{1}{n} \sum_{i=1}^n \left( \mathbf{u}^T \mathbf{x}_i - \mu_{\mathbf{u}} \right)^2 \]

where:

$\mu_{\mathbf{u}} = \frac{1}{n} \sum_{i=1}^n \mathbf{u}^T \mathbf{x}_i = \mathbf{u}^T \left( \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \right) = \mathbf{u}^T \bar{\mathbf{x}}$ is the mean of projected data

Expanding the variance expression:

Let $a_i = \mathbf{u}^T \mathbf{x}_i$ be the projected value
The variance becomes:

\[ \sigma^2_{\mathbf{u}} = \frac{1}{n} \sum_{i=1}^n (a_i - \mu_{\mathbf{u}})^2 = \frac{1}{n} \sum_{i=1}^n \left( \mathbf{u}^T \mathbf{x}_i - \mathbf{u}^T \bar{\mathbf{x}} \right)^2 = \frac{1}{n} \sum_{i=1}^n \left[ \mathbf{u}^T (\mathbf{x}_i - \bar{\mathbf{x}}) \right]^2 \]

Let $\mathbf{z}_i = \mathbf{x}_i - \bar{\mathbf{x}}$ be the centered data points. Then:

Expand using $\|A\|^2 = A^TA$:

\[ \sigma^2_{\mathbf{u}} = \frac{1}{n} \sum_{i=1}^n (\mathbf{u}^T \mathbf{z}_i)^2 = \frac{1}{n} \sum_{i=1}^n (\mathbf{u}^T \mathbf{z}_i)(\mathbf{z}_i^T \mathbf{u}) \]

Apply the identity $A^TB = B^TA$ to the second term:

\[ = \frac{1}{n} \sum_{i=1}^n \mathbf{u}^T \mathbf{z}_i \mathbf{z}_i^T \mathbf{u} \]

Factor out $\mathbf{u}^T$ and $\mathbf{u}$:

\[ = \mathbf{u}^T \left( \frac{1}{n} \sum_{i=1}^n \mathbf{z}_i \mathbf{z}_i^T \right) \mathbf{u} \]

Recognize the covariance matrix $\mathbf{C} = \frac{1}{n} \sum_{i=1}^n \mathbf{z}_i \mathbf{z}_i^T$:

\[ = \mathbf{u}^T \mathbf{C} \mathbf{u} \]

Express in terms of the centered data matrix $\mathbf{Z} = [\mathbf{z}_1 \cdots \mathbf{z}_n]^T$:

\[ = \frac{1}{n} \mathbf{u}^T \mathbf{Z}^T \mathbf{Z} \mathbf{u} = \frac{1}{n} \|\mathbf{Z}\mathbf{u}\|^2 \]

Hence the variance of the projected data is:

\[ \boxed{\sigma^2_{\mathbf{u}} = \mathbf{u}^T \mathbf{C} \mathbf{u}} \]

Optimization Problem#

We should maximize the projected variance $\sigma^2_\mathbf{u} = \mathbf{u}^T \mathbf{C} \mathbf{u}$ subject to the constraint $\mathbf{u}^T\mathbf{u}=1$. This constrained optimization problem can be solved using Lagrange multipliers:

\[ \max_\mathbf{u} J(\mathbf{u}) = \mathbf{u}^T \mathbf{C} \mathbf{u} - \lambda (\mathbf{u}^T\mathbf{u}-1) \]

Deriving the Solution#

Taking the derivative with respect to $\mathbf{u}$ and setting it to zero:

\[ \frac{\partial}{\partial \mathbf{u}} \left(\mathbf{u}^T \mathbf{C} \mathbf{u} - \lambda (\mathbf{u}^T\mathbf{u}-1)\right) = \mathbf{0} \]

\[ 2 \mathbf{C} \mathbf{u} - 2 \lambda \mathbf{u} = \mathbf{0} \]

\[ \mathbf{C} \mathbf{u} = \lambda \mathbf{u} \]

This shows that $\lambda$ must be an eigenvalue of the covariance matrix $\mathbf{C}$ with $\mathbf{u}$ as the corresponding eigenvector.

Maximizing the Variance#

The projected variance can be expressed as:

\[ \sigma^2_\mathbf{u} = \mathbf{u}^T\mathbf{C}\mathbf{u} = \mathbf{u}^T \lambda \mathbf{u} = \lambda \mathbf{u}^T\mathbf{u} = \lambda \]

Therefore, to maximize the projected variance, we must choose the largest eigenvalue $\lambda_1$ of $\mathbf{C}$. The corresponding eigenvector $\mathbf{u}_1$ gives the direction of maximum variance, known as the first principal component.

# Center the data
X_mean = np.mean(X, axis=0)
Z = X - X_mean

# Compute covariance matrix
cov_matrix = np.dot(Z.T, Z) / (len(Z) - 1)

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Sort by descending eigenvalues
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Extract PC1 (first principal component)
u1 = eigenvectors[:, 0]
λ1 = eigenvalues[0]
print(f"u1 = {u1}, λ1 = {λ1:.2f}")

u1 = [-0.83818843 -0.54538075], λ1 = 1.03

_images/38253dcc9fb0d26204d1def0c345a99050c63981701fecf619efca0b88d13fad.png

PCA Algorithm#

Center the data: $\mathbf{Z} = \mathbf{X} - \mathbf{1}\cdot\mathbf{\mu}^T$
Compute covariance matrix: $\mathbf{C} = \frac{1}{n}\mathbf{Z}^T\mathbf{Z}$
Find eigenvalues and eigenvectors of $\mathbf{C}$
Sort eigenvalues in descending order
Select top $r$ eigenvectors as principal components
Project data: $\mathbf{Z}' = \mathbf{Z}\mathbf{U}_r$

def pca_from_scratch(X, n_components):
    """
    Implement PCA from scratch
    
    Parameters:
    X: data matrix (n_samples, n_features)
    n_components: number of principal components to keep
    
    Returns:
    X_transformed: transformed data
    components: principal components
    eigenvalues: eigenvalues
    mean: data mean
    """
    # Step 1: Center the data
    X_mean = np.mean(X, axis=0)
    Z = X - X_mean
    
    # Step 2: Compute covariance matrix
    n_samples = X.shape[0]
    cov_matrix = np.dot(Z.T, Z) / (n_samples - 1)
    
    # Step 3: Compute eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
    
    # Step 4: Sort eigenvalues and eigenvectors in descending order
    idx = eigenvalues.argsort()[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Step 5: Select top n_components
    components = eigenvectors[:, :n_components]
    
    # Step 6: Project the data
    X_transformed = np.dot(Z, components)
    
    return X_transformed, components, eigenvalues, X_mean

def reconstruct_data(X_transformed, components, mean):
    """Reconstruct original data from PCA representation"""
    X_reconstructed = np.dot(X_transformed, components.T) + mean
    return X_reconstructed

PCA Workflow: From Single Sample to Full Matrix#

Projection & Reconstruction (Single Data Point)#

Step 1: Mean-Centering#

Given a raw data vector: $ \mathbf{x} = [x_1, x_2, \dots, x_d]^T \quad \text{(original space)} $
Subtract the mean vector: $ \mathbf{z} = \mathbf{x} - \mathbf{\mu} \quad \text{(centered)} $

where $\mathbf{\mu}$ is the mean of all samples.

Step 2: Projection (Dimensionality Reduction)#

Project $\mathbf{z}$ onto the first $r$ principal components (PCs) $\mathbf{u}_1, \dots, \mathbf{u}_r$: $a_i = \mathbf{u}_i^T \mathbf{z}$ (scalar, projection of $\mathbf{z}$ onto $\mathbf{u}_i$)
The reduced representation (in PCA space) is: $\mathbf{a} = [a_1, a_2, \dots, a_r]^T $

Step 3: Reconstruction (Approximation)#

Reconstruct the centered data using the PCs: $ \mathbf{z}' = \sum_{i=1}^r a_i \mathbf{u}_i = a_1 \mathbf{u}_1 + a_2 \mathbf{u}_2 + \dots + a_r \mathbf{u}_r$
Add back the mean to return to original space: $ \mathbf{x}_{\text{recon}} = \mathbf{z}' + \mathbf{\mu}$

Geometric Interpretation#

The reconstruction $\mathbf{z}'$ is the closest approximation of $\mathbf{z}$ using only $r$ directions.
The error $\|\mathbf{z} - \mathbf{z}'\|$ comes from discarding the remaining PCs.

Generalization to Full Data Matrix#

Step 1: Mean-Centering (Matrix Form)#

Original data matrix:

\[ \mathbf{X} = [\mathbf{x}_1 | \mathbf{x}_2 | \dots | \mathbf{x}_n]^T \quad \text{($n \times d$)} \]
Mean-centered data:

\[ \mathbf{Z} = \mathbf{X} - \mathbf{1} \mathbf{\mu}^T \]

where $\mathbf{1}$ is a column vector of ones.

Step 2: Projection (PCA Transformation)#

Project all samples onto the first $r$ PCs ($\mathbf{U}_r = [\mathbf{u}_1 | \dots | \mathbf{u}_r]$):

\[ \mathbf{A} = \mathbf{Z} \mathbf{U}_r \quad \text{($n \times r$ matrix of PCA coordinates)} \]
- Each row of $\mathbf{A}$ contains the reduced representation of a sample.

Step 3: Reconstruction (Matrix Form)#

Reconstruct the centered data:

\[ \mathbf{Z}' = \mathbf{A} \mathbf{U}_r^T = \sum_{i=1}^r \mathbf{a}_i \mathbf{u}_i^T \]
- $\mathbf{a}_i$ = $i$-th column of $\mathbf{A}$ (scores for PC $i$).
- Each term $\mathbf{a}_i \mathbf{u}_i^T$ is a rank-1 matrix.
Return to original space:

\[ \mathbf{X}_{\text{recon}} = \mathbf{Z}' + \mathbf{1} \mathbf{\mu}^T \]

Connection to Python Implementation#

Projection (Dimensionality Reduction)#

X_transformed = Z @ U_r  # ≡ Z U_r (n × r matrix)

Reconstruction#

Z_recon = X_transformed @ U_r.T  # ≡ A U_r^T (n × d)
X_recon = Z_recon + mean        # Add back mean

Key Takeaways#

Single Sample → Full Matrix:
- Start with a single vector to understand projection/reconstruction.
- Generalize to the full dataset using matrix operations.
Lossy Compression:
- Keeping only $r < d$ PCs means the reconstruction is approximate.
- The error depends on the discarded eigenvalues.
PCA as a Rotation:
- Projection = Rotate data to align with maximum variance directions.
- Reconstruction = Rotate back, using only the most important directions.

Equivalence of PCA’s Two Optimization Perspectives and Error Analysis (Further Reading)#

PCA can be derived from two equivalent perspectives:

Variance Maximization: Find directions that maximize projected variance.
Reconstruction Error Minimization: Find directions that minimize the mean squared reconstruction error.

We’ll prove their equivalence mathematically and analyze the error.

1. Variance Maximization Formulation#

Goal: Find unit vector $ \mathbf{u} $ that maximizes the variance of projected data:

\[ \max_{\mathbf{u}} \frac{1}{n} \sum_{j=1}^n (\mathbf{u}^T \mathbf{z}_j)^2 \quad \text{subject to} \quad \|\mathbf{u}\| = 1. \]

The solution is the eigenvector of $ \mathbf{Z}^T \mathbf{Z} $ with largest eigenvalue.

2. Reconstruction Error Minimization#

Goal: Find $ \mathbf{u} $ that minimizes the MSE between original data $ \mathbf{z}_j $ and its reconstruction $ \mathbf{z}_j' = (\mathbf{u}^T \mathbf{z}_j)\mathbf{u} $:

\[ \min_{\mathbf{u}} \frac{1}{n} \sum_{j=1}^n \|\mathbf{z}_j - (\mathbf{u}^T \mathbf{z}_j)\mathbf{u}\|^2 \quad \text{subject to} \quad \|\mathbf{u}\| = 1. \]

3. Proof of Equivalence#

Expand the reconstruction error:

\[ \|\mathbf{z}_j - (\mathbf{u}^T \mathbf{z}_j)\mathbf{u}\|^2 = \|\mathbf{z}_j\|^2 - (\mathbf{u}^T \mathbf{z}_j)^2. \]

Thus, the MSE becomes:

\[ \text{MSE} = \frac{1}{n} \sum_{j=1}^n \|\mathbf{z}_j\|^2 - \frac{1}{n} \sum_{j=1}^n (\mathbf{u}^T \mathbf{z}_j)^2. \]

Key Observations:

The term $ \frac{1}{n} \sum \|\mathbf{z}_j\|^2 $ is constant (total variance in data).
Minimizing MSE is equivalent to maximizing $ \frac{1}{n} \sum (\mathbf{u}^T \mathbf{z}_j)^2 $ (projected variance).

4. Error Analysis in PCA#

Projection and Residual Components#

For a data point $ \mathbf{z}_j $ (centered) and its reconstruction $ \mathbf{z}_j' $ using $ r $ PCs:

\[ \mathbf{z}_j = \underbrace{\sum_{i=1}^r (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i}_{\text{Reconstruction } \mathbf{z}_j'} + \underbrace{\sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i}_{\text{Error } \mathbf{\epsilon}_j} \]

Error vector: $ \mathbf{\epsilon}_j = \mathbf{z}_j - \mathbf{z}_j' = \sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i $.

Mean Squared Error (MSE) Derivation#

The total MSE across all samples is:

\[ \text{MSE} = \frac{1}{n} \sum_{j=1}^n \|\mathbf{\epsilon}_j\|^2 = \frac{1}{n} \sum_{j=1}^n \left\| \sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i \right\|^2. \]

\[ \|\mathbf{\epsilon}_j\|^2 = \sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)^2 \underbrace{\|\mathbf{u}_i\|^2}_{=1} = \sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)^2 \]

\[ \text{MSE} = \frac{1}{n} \sum_{j=1}^n \sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)^2 = \sum_{i=r+1}^d \left( \frac{1}{n} \sum_{j=1}^n (\mathbf{u}_i^T \mathbf{z}_j)^2 \right). \]

Link to Eigenvalues#

By definition, the variance along $ \mathbf{u}_i $ is:

\[ \lambda_i = \frac{1}{n} \sum_{j=1}^n (\mathbf{u}_i^T \mathbf{z}_j)^2. \]

Thus:

\[ \text{MSE} = \sum_{i=r+1}^d \lambda_i. \]

5. Conclusion and Summary#

Dual Formulations: Variance maximization and error minimization are equivalent in PCA.
Eigenvalue Interpretation: The MSE equals the sum of discarded eigenvalues because:
- Each $ \lambda_i $ represents variance along PC $ \mathbf{u}_i $.
- Discarding PCs $ r+1 $ to $ d $ loses exactly $ \sum_{i=r+1}^d \lambda_i $ variance.
Optimality: PCA’s solution simultaneously maximizes preserved variance and minimizes reconstruction error.

Key Relationships:

Concept	Formula	Interpretation
Projection	$ \mathbf{z}_j' = \sum_{i=1}^r (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i $	Reconstruction using top $ r $ PCs
Error	$ \mathbf{\epsilon}_j = \sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i $	Residual from discarded PCs
MSE	$ \text{MSE} = \sum_{i=r+1}^d \lambda_i $	Sum of variances of discarded PCs

Examples and Applications#

Iris Dataset Example#

Original data shape: (150, 4)
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Explained variance ratio:
PC1: 0.925 (92.5%)
PC2: 0.053 (5.3%)
Total: 0.978 (97.8%)

_images/fa6ff02f870b3e3c1884065565dfdeb3cfa2fea45f4550d7e0ea19386f6d689e.png

Reconstruction Error Analysis#

_images/3484ec3b07e9e2e0cb540c1872500f628d6ec8a4ef350b599a0487089369cbd5.png

Reconstruction errors:
components: MSE = 0.085604
components: MSE = 0.025341
components: MSE = 0.005919
components: MSE = 0.000000

High-Dimensional Data Example#

Generated data shape: (200, 50)
True intrinsic dimensionality: 3

Explained variance by first 10 components:
PC1: 0.420 (cumulative: 0.420)
PC2: 0.342 (cumulative: 0.763)
PC3: 0.234 (cumulative: 0.997)
PC4: 0.000 (cumulative: 0.997)
PC5: 0.000 (cumulative: 0.997)
PC6: 0.000 (cumulative: 0.997)
PC7: 0.000 (cumulative: 0.997)
PC8: 0.000 (cumulative: 0.998)
PC9: 0.000 (cumulative: 0.998)
PC10: 0.000 (cumulative: 0.998)

_images/5434631d3e36d864ece41438c356aeb94b888f5d7dd36fa5848e608229ab9e54.png

Number of components needed for 95% variance: 3
This is close to the true intrinsic dimensionality of 3

Comparison with Scikit-learn#

Comparison of implementations:
Our explained variance ratio: [0.92461872 0.05306648]
Sklearn explained variance ratio: [0.92461872 0.05306648]

Mean absolute difference in transformed data: 0.0000000000

_images/eadebf8b240f5ad53969e1865fc21cbf0171fbf8a1201b798faaa307e27361ed.png

Some of my previous papers are about PCA, including [AFF07, AF22, AP15, FMA08]

Summary#

This tutorial covered:

Linear Algebra Foundations: Vector operations, norms, and angles
Vector Projections: Mathematical foundation for dimensionality reduction
Basis Transformation: How to represent data in different coordinate systems
PCA Theory: Variance maximization and eigenvalue decomposition
Implementation: Step-by-step PCA algorithm from scratch
Applications: Real examples with Iris dataset and high-dimensional data

Key Takeaways:#

PCA finds directions of maximum variance in the data
Principal components are eigenvectors of the covariance matrix
The eigenvalues represent the amount of variance captured by each component
PCA provides optimal linear dimensionality reduction in terms of reconstruction error
The number of components can be chosen based on desired explained variance threshold

Concept	Formula	Interpretation
Projection	\( \mathbf{z}_j' = \sum_{i=1}^r (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i \)	Reconstruction using top \( r \) PCs
Error	\( \mathbf{\epsilon}_j = \sum_{i=r+1}^d (\mathbf{u}_i^T \mathbf{z}_j)\mathbf{u}_i \)	Residual from discarded PCs
MSE	\( \text{MSE} = \sum_{i=r+1}^d \lambda_i \)	Sum of variances of discarded PCs

Principal Component Analysis, Part 2

Contents

Principal Component Analysis, Part 2#

Linear Algebra Foundations#

Data Matrix Representation#

Vector Operations#

Dot Product#

Vector Length (Norm)#

Unit Vector#

Distance between Vectors#

Angle between Vectors#

Vector Projections (Derivation & Geometric Intuition)#

Derivation of Projection Formula#

Final Projection Formula#

Basis Transformation#

Change of Basis#

Rotation Matrix Mathematical Foundation#

2D Rotation Fundamentals#

Basis Transformation#

Key Properties#

Vector Transformation#

Why This Works#

PCA Mathematical Foundation#

Variance Maximization Problem#

Finding the first principal component#

Optimization Problem#

Deriving the Solution#

Maximizing the Variance#

PCA Algorithm#

PCA Workflow: From Single Sample to Full Matrix#

Projection & Reconstruction (Single Data Point)#

Step 1: Mean-Centering#

Step 2: Projection (Dimensionality Reduction)#

Step 3: Reconstruction (Approximation)#

Geometric Interpretation#

Generalization to Full Data Matrix#

Step 1: Mean-Centering (Matrix Form)#

Step 2: Projection (PCA Transformation)#

Step 3: Reconstruction (Matrix Form)#

Connection to Python Implementation#

Projection (Dimensionality Reduction)#

Reconstruction#

Key Takeaways#

Equivalence of PCA’s Two Optimization Perspectives and Error Analysis (Further Reading)#

1. Variance Maximization Formulation#

2. Reconstruction Error Minimization#

3. Proof of Equivalence#

4. Error Analysis in PCA#

Projection and Residual Components#

Mean Squared Error (MSE) Derivation#

Link to Eigenvalues#

5. Conclusion and Summary#

Examples and Applications#

Iris Dataset Example#

Reconstruction Error Analysis#

High-Dimensional Data Example#

Comparison with Scikit-learn#

Summary#

Key Takeaways:#