Mahalanobis Distance#

Understanding Covariance Matrix and Its Role in Bayesian Decision Theory

In this tutorial, we will explore the covariance matrix, its geometric interpretation, and its role in Bayesian decision theory, particularly in Mahalanobis Distance. We will use resources from Vision Dummy, SAS Blogs and Wiki to explain the concepts, along with Chapter 2 of Duda’s book [DHS00]


Why Do We Need Mahalanobis Distance?#

In many real-world applications, we deal with multivariate data where features are often correlated and have different scales. The Euclidean distance, which measures the straight-line distance between two points, is a common choice for measuring distances. However, Euclidean distance has a significant limitation: it assumes that all features are uncorrelated and have the same variance. This assumption often does not hold in practice, leading to misleading results.

For example, consider a dataset where one feature is measured in meters and another in kilometers. The Euclidean distance would disproportionately weigh the feature measured in kilometers, even if it is less relevant to the problem. Similarly, if features are correlated, Euclidean distance fails to account for the underlying structure of the data.

This is where the Mahalanobis distance comes into play. Unlike Euclidean distance, the Mahalanobis distance takes into account the covariance structure of the data. It measures the distance between a point and a distribution, normalized by the variance of each feature and the correlations between them. This makes it a more robust and accurate measure for multivariate data.

Python Example: Euclidean vs. Mahalanobis Distance#

Let’s illustrate this with a Python example. We’ll generate a set of 2D Gaussian random points and highlight two points that have the same Euclidean distance to the center but different Mahalanobis distances.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import mahalanobis
from scipy.stats import multivariate_normal

# Generate 2D Gaussian random points
np.random.seed(42)
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # Covariance matrix
points = np.random.multivariate_normal(mean, cov, 100)

# Compute the center of the points
# center = np.mean(points, axis=0)
center = mean

# Select two points with equal Euclidean distance to the center
point1 = np.array([1.5, 1.5])
point2 = np.array([-1.5, 1.5])

# Compute Euclidean distances
euclidean_dist1 = np.linalg.norm(point1 - center)
euclidean_dist2 = np.linalg.norm(point2 - center)

# Compute Mahalanobis distances
inv_cov = np.linalg.inv(cov)
mahalanobis_dist1 = mahalanobis(point1, center, inv_cov)
mahalanobis_dist2 = mahalanobis(point2, center, inv_cov)

# Plot the points
plt.scatter(points[:, 0], points[:, 1], alpha=0.5, label='Random Points')
plt.scatter(center[0], center[1], color='red', label='Center')
plt.scatter(point1[0], point1[1], color='green', marker='*', s=150, label=f'Point 1 (Euclidean: {euclidean_dist1:.2f}, Mahalanobis: {mahalanobis_dist1:.2f})')
plt.scatter(point2[0], point2[1], color='blue', marker='*', s=150, label=f'Point 2 (Euclidean: {euclidean_dist2:.2f}, Mahalanobis: {mahalanobis_dist2:.2f})')

# Create a grid of points
x, y = np.mgrid[-3:3:.01, -3:3:.01]
pos = np.dstack((x, y))

# Compute the Gaussian distribution on the grid
rv = multivariate_normal(mean, cov)
plt.contour(x, y, rv.pdf(pos), levels=3, cmap='viridis')

plt.legend()
plt.title('Euclidean vs. Mahalanobis Distance')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.axis('equal') 
plt.show()
_images/2fea368edd3f29a3d3a3e5405305655f5eb54429880c12ed069ae6d1c26f341e.png

Introduction to Covariance Matrix#

The covariance matrix is a square matrix that summarizes the variances and covariances of a set of random variables. For a dataset with \(d\) features, the covariance matrix \(\mathbf{\Sigma}\) is a \(d \times d\) matrix where the diagonal elements represent the variances of each feature, and the off-diagonal elements represent the covariances between pairs of features.

Mathematically, the covariance matrix is defined as:

\[\begin{split} \mathbf{\Sigma} = \begin{bmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1d} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{d1} & \sigma_{d2} & \cdots & \sigma_{dd} \end{bmatrix} \end{split}\]

where:

  • \(\sigma_{ii}\) is the variance of the \(i\)-th feature.

  • \(\sigma_{ij}\) is the covariance between the \(i\)-th and \(j\)-th features.

The covariance matrix is symmetric (\(\sigma_{ij} = \sigma_{ji}\)) and positive semi-definite.


Geometric Interpretation of Covariance Matrix#

The covariance matrix can be interpreted geometrically as describing the shape and orientation of the data distribution in the feature space. Here’s how:

  1. Eigenvalues and Eigenvectors: The eigenvectors of the covariance matrix represent the directions (axes) of maximum variance in the data, while the eigenvalues represent the magnitude of variance along these directions.

  2. Ellipsoid Representation: The covariance matrix defines an ellipsoid in the feature space. The eigenvectors determine the orientation of the ellipsoid, and the eigenvalues determine the lengths of its axes. For example, in a 2D space, the covariance matrix defines an ellipse.

  3. Scaling and Rotation: The covariance matrix can be decomposed into scaling and rotation components. The eigenvalues represent the scaling factors, and the eigenvectors represent the rotation of the ellipsoid.

Covariance Shape of the Data

The covariance matrix defines the shape of the data. Diagonal spread is captured by the covariance, while axis-aligned spread is captured by the variance.

Covariance Ellipse

Geometric interpretation of the covariance matrix as an ellipse in 2D space. The eigenvectors determine the orientation, and the eigenvalues determine the lengths of the axes.


Covariance Matrix in Bayesian Decision Theory#

In Bayesian decision theory, the covariance matrix plays a crucial role in defining the class-conditional probability density functions for multivariate normal distributions. The probability density function for a multivariate normal distribution is given by:

\[ p(\mathbf{x}|\omega_i) = \frac{1}{(2\pi)^{d/2}|\mathbf{\Sigma}_i|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_i)^T \mathbf{\Sigma}_i^{-1} (\mathbf{x} - \boldsymbol{\mu}_i)\right) \]

where:

  • \(\mathbf{x}\) is the feature vector.

  • \(\boldsymbol{\mu}_i\) is the mean vector for class \(\omega_i\).

  • \(\mathbf{\Sigma}_i\) is the covariance matrix for class \(\omega_i\).

The covariance matrix \(\mathbf{\Sigma}_i\) determines the shape and orientation of the distribution for class \(\omega_i\). If the covariance matrix is diagonal, the features are uncorrelated, and the distribution is axis-aligned. If the covariance matrix is non-diagonal, the features are correlated, and the distribution is skewed.


Mahalanobis Distance and Its Role#

The Mahalanobis distance is a measure of the distance between a point and a distribution, taking into account the covariance structure of the distribution. It is defined as:

\[ D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})} \]

The Mahalanobis distance is used in Bayesian decision theory to classify data points based on their distance from the mean of a distribution, normalized by the covariance matrix. This distance is particularly useful when the features are correlated or have different scales.


Relationship to Normal Random Variables#

In general, given a normal (Gaussian) random variable \(X\) with variance \(S=1\) and mean \(\mu = 0\), any other normal random variable \(R\) (with mean \(\mu_1\) and variance \(S_1\)) can be defined in terms of \(X\) by the equation:

\[ R = \mu_1 + \sqrt{S_1}X. \]

Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for:

\[ X = \frac{R - \mu_1}{\sqrt{S_1}}. \]

If we square both sides and take the square root, we get an equation for a metric that looks a lot like the Mahalanobis distance:

\[ D = \sqrt{X^2} = \sqrt{\frac{(R - \mu_1)^2}{S_1}} = \sqrt{(R - \mu_1) S_1^{-1} (R - \mu_1)}. \]

The resulting magnitude is always non-negative and varies with the distance of the data from the mean, attributes that are convenient when trying to define a model for the data.


Definition of Mahalanobis Distance#

Given a probability distribution \(Q\) on \(\mathbb{R}^N\), with mean \(\vec{\mu} = (\mu_1, \mu_2, \mu_3, \dots , \mu_N)^\mathsf{T}\) and positive semi-definite covariance matrix \(\mathbf{\Sigma}\), the Mahalanobis distance of a point \(\vec{x} = (x_1, x_2, x_3, \dots, x_N )^\mathsf{T}\) from \(Q\) is:

\[ d_M(\vec{x}, Q) = \sqrt{(\vec{x} - \vec{\mu})^\mathsf{T} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{\mu})}. \]

Given two points \(\vec{x}\) and \(\vec{y}\) in \(\mathbb{R}^N\), the Mahalanobis distance between them with respect to \(Q\) is:

\[ d_M(\vec{x}, \vec{y}; Q) = \sqrt{(\vec{x} - \vec{y})^\mathsf{T} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{y})}. \]

This means that \(d_M(\vec{x}, Q) = d_M(\vec{x}, \vec{\mu}; Q)\).

Since \(\mathbf{\Sigma}\) is positive semi-definite, so is \(\mathbf{\Sigma}^{-1}\), thus the square roots are always defined.


Intuitive Explanation of Mahalanobis Distance#

Consider the problem of estimating the probability that a test point in \(N\)-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the centroid or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.

However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not. The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.

This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be:

\[ \frac{\lVert x - \mu\rVert_2}{\sigma}, \]

which reads:

\[ \frac{\text{testpoint} - \text{sample mean}}{\text{standard deviation}}. \]

By plugging this into the normal distribution, we can derive the probability of the test point belonging to the set.

The drawback of the above approach is that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis, the test point must be closer, while in those where the axis is long, the test point can be further away from the center.

Putting this on a mathematical basis, the ellipsoid that best represents the set’s probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

Whitening Transformation#

Whitening is a process that transforms data so that its covariance matrix becomes the identity matrix. This transformation removes correlations between features and scales the features to have unit variance. Whitening is particularly useful in machine learning and signal processing because it simplifies the structure of the data, making it easier to apply algorithms that assume uncorrelated features.

Figure 2.8 of Duda

Figure 2.8 [DHS00]: The action of a linear transformation on the feature space will convert an arbitrary normal distribution into another normal distribution.

Mathematical Formulation#

Given a dataset with mean \(\boldsymbol{\mu}\) and covariance matrix \(\mathbf{\Sigma}\), the whitening transformation is defined as:

\[ \mathbf{x}_{\text{whitened}} = W^T (\mathbf{x} - \boldsymbol{\mu}), \]

where \(W\) is the whitening matrix. The whitening matrix is constructed using the eigenvectors \(\mathbf{\Phi}\) and eigenvalues \(\mathbf{\Lambda}\) of the covariance matrix \(\mathbf{\Sigma}\):

\[ W = \mathbf{\Phi} \mathbf{\Lambda}^{-1/2}. \]

Here:

  • \(\mathbf{\Phi}\) is the matrix of eigenvectors of \(\mathbf{\Sigma}\).

  • \(\mathbf{\Lambda}\) is the diagonal matrix of eigenvalues of \(\mathbf{\Sigma}\).

  • \(\mathbf{\Lambda}^{-1/2}\) is the diagonal matrix of the inverse square roots of the eigenvalues.

After applying the whitening transformation, the covariance matrix of the transformed data becomes the identity matrix \(\mathbf{I}\):

\[ \text{Cov}(\mathbf{x}_{\text{whitened}}) = \mathbf{I}. \]

Geometric Interpretation#

Whitening can be thought of as a two-step process:

  1. Rotation: The data is rotated using the eigenvectors \(\mathbf{\Phi}\) to align the axes of the ellipsoid (defined by the covariance matrix) with the coordinate axes.

  2. Scaling: The data is scaled along each axis by the inverse square root of the eigenvalues \(\mathbf{\Lambda}^{-1/2}\) to normalize the variances.

In the whitened space, the data is uncorrelated, and the Euclidean distance between points reflects the true structure of the data, accounting for the original covariance.

Relationship to Mahalanobis Distance#

The Mahalanobis distance in the original space is equivalent to the Euclidean distance in the whitened space. This is because whitening normalizes the data so that the Euclidean distance accounts for the covariance structure. Specifically:

\[ D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})} = \|\mathbf{x}_{\text{whitened}}\|_2, \]

where \(\|\cdot\|_2\) is the Euclidean norm.

Example: Whitening in Python#

The following Python code demonstrates how to whiten a dataset:

import numpy as np

# Generate 2D Gaussian random points
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # Covariance matrix
points = np.random.multivariate_normal(mean, cov, 100)

# Compute the whitening matrix
eigenvals, eigenvecs = np.linalg.eigh(cov)
D = np.diag(1.0 / np.sqrt(eigenvals))  # Diagonal matrix of inverse square roots of eigenvalues
W = np.dot(eigenvecs, D)  # Whitening matrix

# Whiten the data
whitened_points = np.dot(points - mean, W)
Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import mahalanobis
from scipy.stats import multivariate_normal

# Generate 2D Gaussian random points
np.random.seed(42)
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # Covariance matrix
points = np.random.multivariate_normal(mean, cov, 100)

# The center
center = mean

# Select two points with equal Euclidean distance to the center
point1 = np.array([1.5, 1.5])
point2 = np.array([-1.5, 1.5])

# Compute Euclidean distances
euclidean_dist1 = np.linalg.norm(point1 - center)
euclidean_dist2 = np.linalg.norm(point2 - center)

# Compute Mahalanobis distances
inv_cov = np.linalg.inv(cov)
mahalanobis_dist1 = mahalanobis(point1, center, inv_cov)
mahalanobis_dist2 = mahalanobis(point2, center, inv_cov)

# Whitening the data based on Duda's book
# Step 1: Compute the eigenvalues and eigenvectors of the covariance matrix
eigenvals, eigenvecs = np.linalg.eigh(cov)

# Step 2: Construct the whitening matrix
D = np.diag(1.0 / np.sqrt(eigenvals))  # Diagonal matrix of inverse square roots of eigenvalues
W = np.dot(eigenvecs, D)  # Whitening matrix

# Step 3: Whiten the data
whitened_points = np.dot(points - center, W)  # Subtract mean and apply whitening transformation
whitened_point1 = np.dot(point1 - center, W)
whitened_point2 = np.dot(point2 - center, W)

# Compute Euclidean distances in the whitened space
whitened_euclidean_dist1 = np.linalg.norm(whitened_point1)
whitened_euclidean_dist2 = np.linalg.norm(whitened_point2)

# Plot the original and whitened data
plt.figure(figsize=(12, 6))

# Plot original data
plt.subplot(1, 2, 1)
plt.scatter(points[:, 0], points[:, 1], alpha=0.5, label='Random Points')
plt.scatter(center[0], center[1], color='red', label='Center')
plt.scatter(point1[0], point1[1], color='green', marker='*', s=150, label=f'Point 1 (Euclidean: {euclidean_dist1:.2f}, Mahalanobis: {mahalanobis_dist1:.2f})')
plt.scatter(point2[0], point2[1], color='blue', marker='*', s=150, label=f'Point 2 (Euclidean: {euclidean_dist2:.2f}, Mahalanobis: {mahalanobis_dist2:.2f})')

# Create a grid of points for the original data
x, y = np.mgrid[-3:3:.01, -3:3:.01]
pos = np.dstack((x, y))

# Compute the Gaussian distribution on the grid
rv = multivariate_normal(mean, cov)
plt.contour(x, y, rv.pdf(pos), levels=3, cmap='viridis')

plt.legend()
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.axis('equal')

# Plot whitened data
plt.subplot(1, 2, 2)
plt.scatter(whitened_points[:, 0], whitened_points[:, 1], alpha=0.5, label='Whitened Points')
plt.scatter(0, 0, color='red', label='Center')
plt.scatter(whitened_point1[0], whitened_point1[1], color='green', marker='*', s=150, label=f'Whitened Point 1 (Euclidean: {whitened_euclidean_dist1:.2f})')
plt.scatter(whitened_point2[0], whitened_point2[1], color='blue', marker='*', s=150, label=f'Whitened Point 2 (Euclidean: {whitened_euclidean_dist2:.2f})')

# Create a grid of points for the whitened data
x_whitened, y_whitened = np.mgrid[-3:3:.01, -3:3:.01]
pos_whitened = np.dstack((x_whitened, y_whitened))

# Compute the Gaussian distribution on the grid for whitened data
rv_whitened = multivariate_normal([0, 0], np.eye(2))  # Covariance matrix is identity after whitening
plt.contour(x_whitened, y_whitened, rv_whitened.pdf(pos_whitened), levels=3, cmap='viridis')

plt.legend()
plt.title('Whitened Data')
plt.xlabel('Whitened Feature 1')
plt.ylabel('Whitened Feature 2')
plt.axis('equal')

plt.tight_layout()
plt.show()
_images/604a8e263d65a4a5f4c06e489d0012da910bd2a6cade069667e5f799064cbaf7.png
  • Left: The original data with points and their Euclidean/Mahalanobis distances.

  • Right: The whitened data with points and their Euclidean distances.

Applications of Whitening#

  1. Data Preprocessing: Whitening is often used as a preprocessing step in machine learning algorithms, such as Principal Component Analysis (PCA) and Independent Component Analysis (ICA), to improve performance.

  2. Signal Processing: In signal processing, whitening is used to decorrelate signals and remove redundancy.

  3. Visualization: Whitening can simplify the visualization of high-dimensional data by removing correlations and scaling features.


Conclusion#

The covariance matrix is a fundamental concept in multivariate statistics and Bayesian decision theory. It describes the shape, orientation, and scale of data distributions and plays a key role in defining class-conditional probability densities. The Mahalanobis distance, which incorporates the covariance matrix, is a powerful tool for classification tasks, especially when dealing with correlated features.

By understanding the geometric interpretation of the covariance matrix and its role in Bayesian decision theory, you can better analyze and classify multivariate data.