Bayesian Decision Theory#
Chapter 2 of Pattern Classification [DHS00]
Introduction#
Bayesian Decision Theory is a statistical approach to pattern classification.
It quantifies tradeoffs between classification decisions using probabilities and costs.
Assumes all relevant probabilities are known.
Example: Classifying fish (sea bass vs. salmon) based on features like lightness.
Key Concepts#
State of Nature (
): Represents the true category (e.g., sea bass or salmon).Prior Probability (
): Probability of a category before observing any data.Class-Conditional Probability Density (
): Probability of observing feature given category .Posterior Probability (
): Probability of category after observing feature .
Bayes’ Formula#
Bayes’ Theorem:
Posterior = (Likelihood × Prior) / Evidence
Evidence (
): Normalizing constant, often ignored in classification.
The probability
Notice that it is the product of the likelihood and the prior probability that is most important in determining the posterior probability; the evidence factor,
Figure 2.1: Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value
Decision Rule#
Minimum Error Rate Classification:
Decide
if , otherwise decide .Equivalent to:
Probability of error:
Figure 2.2: Posterior probabilities for the particular priors
Some additional insight can be obtained by considering a few special cases:
Case 1: If for some
we have , then that particular observation gives us no information about the state of nature. In this case, the decision hinges entirely on the prior probabilities and .Case 2: If
, then the states of nature are equally probable. In this case, the decision is based entirely on the likelihoods .General Case: In general, both the prior probabilities and the likelihoods are important in making a decision. The Bayes decision rule combines these factors to achieve the minimum probability of error.
Generalization to More Than Two Classes#
The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. To minimize the average probability of error, we should select the class
This rule generalizes naturally to multiple classes (
Discriminant Functions#
Discriminant Function (
): Used to assign a feature vector to class .For minimum error rate:
Can also be expressed as:
Or in log form:
Figure 2.5: The functional structure of a general statistical pattern classifier which includes
Figure 2.6: In this two-dimensional two-category classifier, the probability densities are Gaussian (with
Normal Density#
Univariate Normal Density:
For which the expected value of
and where the expected squared deviation or variance is:
The univariate normal density is completely specified by two parameters: its mean
Figure 2.7: A univariate normal distribution has roughly
Multivariate Normal Density#
Mean vector (
) and covariance matrix ( ) describe the distribution.
Formal Definitions#
Formally, we have:
and
where the expected value of a vector or a matrix is found by taking the expected values of its components. In other words, if
and
Properties of the Covariance Matrix#
The covariance matrix
is always symmetric and positive semidefinite.We restrict our attention to the case where
is positive definite, so that the determinant of is strictly positive.The diagonal elements
are the variances of the respective (i.e., ).The off-diagonal elements
are the covariances of and .Example: For the length and weight features of a population of fish, we would expect a positive covariance.
If
and are statistically independent, then .If all off-diagonal elements are zero,
reduces to the product of the univariate normal densities for the components of .
Multivariate Normal Density#
The multivariate normal density is completely specified by
The
elements of the mean vector .The
independent elements of the covariance matrix .
Samples drawn from a normal population tend to fall in a single cluster (see Figure 2.9):
The center of the cluster is determined by the mean vector
.The shape of the cluster is determined by the covariance matrix
.
The loci of points of constant density are hyperellipsoids defined by the quadratic form:
$
The principal axes of these hyperellipsoids are given by the eigenvectors of
(denoted by ).The eigenvalues (denoted by
) determine the lengths of these axes.
The quantity
The contours of constant density are hyperellipsoids of constant Mahalanobis distance to
.The volume of these hyperellipsoids measures the scatter of the samples about the mean.
Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on the mean
Discriminant Functions for Normal Density#
Case 1: Equal Covariance Matrices ( ):#
The discriminant function is linear:
How?
Let’s derive the simplified form of the log-likelihood
Given:#
The probability density function (PDF) for class
where:
(isotropic covariance, equal across classes). (determinant of ). (inverse of diagonal matrix).
Substitute into the PDF#
Simplify the exponent:
where
Thus, the PDF becomes:
Apply Bayes’ Rule for Posterior Probability#
The posterior probability
where
The log-posterior (discriminant function
Take the Logarithm of the PDF#
Compute
Simplify the first term:
Thus:
Combine with Prior and Ignore Constants#
The discriminant function becomes:
Since
Decision boundaries are hyperplanes.
If the prior probabilities are not equal, the squared distance
is normalized by the variance and offset by adding . This means that if is equally near two different mean vectors, the optimal decision will favor the a priori more likely category.It is not necessary to compute distances explicitly. Expanding the quadratic form
yields:The quadratic term
is the same for all , so it can be ignored as an additive constant. This simplifies the discriminant function to:where:
and
Here,
is called the threshold or bias in the direction.A classifier that uses linear discriminant functions is called a linear machine. The decision surfaces for a linear machine are pieces of hyperplanes defined by the linear equations
for the two categories with the highest posterior probabilities. For our case, this equation can be written as:where:
and
This defines a hyperplane through the point
and orthogonal to the vector . Since , the hyperplane separating and is orthogonal to the line linking the means. If , the hyperplane is the perpendicular bisector of the line between the means. If the prior probabilities are unequal, the point shifts away from the more likely mean.
Figure 2.10: If the covariances of two distributions are equal and proportional to the identity matrix, the distributions are spherical in
The equation
Interpretation
: A normal vector to the hyperplane (defines the orientation of the hyperplane). : A fixed point on the hyperplane. : A variable point on the hyperplane.
The equation states that the vector
Visualization in 2D
In 2D space (
: The normal vector to the line. : A fixed point on the line. : A variable point on the line.
The equation becomes:
This is the equation of a line in 2D.
Drawing the Hyperplane (Line in 2D) Here’s how to draw it:
Plot the point
: This is a fixed point on the line.Draw the normal vector
: This vector is perpendicular to the line.Draw the line: The line passes through
and is perpendicular to .
Example Let’s use the following values:
(normal vector). (a point on the line).
The equation becomes:
import matplotlib.pyplot as plt
import numpy as np
# Define the normal vector w and point x0
w = np.array([2, 1]) # Normal vector
x0 = np.array([1, 1]) # Point on the line
# Define the line equation: 2x + y - 3 = 0 => y = -2x + 3
x_values = np.linspace(-5, 5, 100) # Range of x values
y_values = -2 * x_values + 3 # Corresponding y values
# Plot the line
plt.plot(x_values, y_values, label="Line: $2x + y - 3 = 0$")
# Plot the point x0
plt.scatter(x0[0], x0[1], color="red", label="Point $\mathbf{x}_0 = (1, 1)$")
# Plot the normal vector w starting from x0
plt.quiver(x0[0], x0[1], w[0], w[1], angles='xy', scale_units='xy', scale=1, color="green", label="Normal vector $\mathbf{w} = (2, 1)$")
# Add labels and legend
plt.xlabel("x")
plt.ylabel("y")
plt.axhline(0, color="black", linewidth=0.5) # x-axis
plt.axvline(0, color="black", linewidth=0.5) # y-axis
plt.grid(True)
plt.legend()
plt.title("Line and Normal Vector Visualization")
plt.xlim(-5, 5) # Adjust x-axis limits
plt.ylim(-5, 5) # Adjust y-axis limits
plt.gca().set_aspect("equal", adjustable="box") # Equal aspect ratio
plt.show()

Hyperplane and Decision Boundary
The equation:
defines a hyperplane through the point
If
, the second term in Eq. 56 vanishes, and the point is halfway between the means. In this case, the hyperplane is the perpendicular bisector of the line between the means (see Figure 2.11).If the prior probabilities are unequal, the point
shifts away from the more likely mean.Note that if the variance
is small relative to the squared distance , the position of the decision boundary is relatively insensitive to the exact values of the prior probabilities.
Minimum Distance Classifier
If the prior probabilities
are the same for all classes, the term becomes an unimportant additive constant and can be ignored.In this case, the optimum decision rule simplifies to:
Measure the Euclidean distance
from to each of the mean vectors.Assign
to the category of the nearest mean.
Such a classifier is called a minimum distance classifier.
If each mean vector is considered an ideal prototype or template for patterns in its class, this is essentially a template-matching procedure (see Figure 2.10). This technique is similar to nearest-neighbor algorithm visited in the previous chapter.
Figure 2.11: As the priors are changed, the decision boundary shifts; for sufficiently disparate priors the boundary will not lie between the means of these 1-, 2- and 3-dimensional spherical Gaussian distributions
Case 2: Equal but Arbitrary Covariance Matrices ( for all )#
When the covariance matrices for all classes are identical but arbitrary (
Now, let’s expand the quadratic term:
Since
Substituting this back into the expression for
Notice that the term
This discriminant function is linear in
where the weight vector
Since the discriminant function
Substituting the expressions for
This equation defines a hyperplane. Let
where
The vector
The location of the hyperplane is influenced by the prior probabilities
Geometric Interpretation:
The decision boundaries are hyperplanes. Unlike Case 1 where the hyperplanes are orthogonal to the line connecting the means, here the orientation of the hyperplanes depends on the shared covariance matrix
Figure 2.12 (from DHS book) illustrates this scenario for Gaussian distributions. It shows how the equal covariance matrices lead to linear decision boundaries, but these boundaries are not necessarily perpendicular to the line connecting the class means, unlike the simpler Case 1. The shape and orientation of the ellipses (representing contours of equal probability density) are the same for both classes, reflecting the shared covariance matrix
Exercise: Validating the Decision Boundary in Case 2 and Comparing with GaussianNB
Objective In this exercise, you will:
Generate two groups of data with Gaussian distributions.
Compute the decision boundary for Case 2 (equal but arbitrary covariance matrices) using the theoretical approach.
Compare the results with the decision boundary generated by Scikit-Learn’s
GaussianNB
.Visualize the results in a single plot or side-by-side plots.
Steps to Follow
Generate Data:
Create two classes of data points, each following a multivariate Gaussian distribution with the same covariance matrix but different mean vectors.
Use the following parameters for the two classes:
Class 1: Mean vector
, Covariance matrix .Class 2: Mean vector
, Covariance matrix .
Generate 200 data points for each class.
Compute the Decision Boundary (Theoretical Approach):
Use the linear discriminant function for Case 2:
Assume equal prior probabilities (
).The decision boundary is the set of points where
.
Compute the Decision Boundary (GaussianNB):
Use Scikit-Learn’s
GaussianNB
to fit the data and predict the labels for test points.
Generate Test Points:
Create a grid of test points covering the entire plot area (e.g., using
np.meshgrid
).Predict the class labels for these test points using both the theoretical decision boundary and
GaussianNB
.
Visualize the Results:
Plot the original data points for both classes.
Plot the test points with their predicted labels (use transparency to avoid obscuring the original data).
Draw the decision boundary for both the theoretical approach and
GaussianNB
.Draw the line connecting the two class centers and the perpendicular bisector.
Expected Output
A figure with two subplots:
Theoretical Decision Boundary:
The original data points for both classes.
The test points with their predicted labels (shaded regions).
The theoretical decision boundary (black line).
The line connecting the two class centers (dashed black line).
The perpendicular bisector (dotted green line).
GaussianNB Decision Boundary:
The original data points for both classes.
The test points with their predicted labels (shaded regions).
The decision boundary generated by
GaussianNB
(black line).The line connecting the two class centers (dashed black line).
The perpendicular bisector (dotted green line).
Case 3: Arbitrary Covariance Matrices ( ):#
This is the most general case for Gaussian distributions. Here, each class
We start again with the general discriminant function, dropping the constant term
Now, we expand the quadratic term
Since
Substituting this back into the expression for
Rearranging the terms to group by powers of
We can see this is a quadratic function of
where the components are identified by comparing the terms:
Matrix
: Determines the quadratic term.Vector
: Determines the linear term. Note that .Scalar
: Represents the constant or threshold term.
Since the discriminant function
The decision boundary between two classes
Because
These hyperquadric decision boundaries can take various forms, including:
Hyperplanes (only if
, which reduces to Case 2)Pairs of hyperplanes
Hyperspheres
Hyperellipsoids
Hyperparaboloids
Hyperhyperboloids
In 2D space, these correspond to familiar shapes like lines, pairs of lines, circles, ellipses, parabolas, and hyperbolas. The specific shape depends on the eigenvalues of the matrix
This general case allows for much more complex decision boundaries compared to the linear boundaries of Cases 1 and 2, reflecting the greater flexibility afforded by allowing each class to have its own unique covariance matrix.
The figures referenced illustrate this complexity:
Figure 2.14: Shows various quadratic decision boundaries (ellipses, hyperbolas, parabolas, lines) that can arise in 2D when covariance matrices are arbitrary.
Figure 2.14: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. Conversely, given any hyperquadratic, one can find two Gaussian distributions whose Bayes decision boundary is that hyperquadric.
This quadratic classifier is the most general form for Gaussian distributions under the Bayesian framework.
The extension of these results to more than two categories is straightforward though we need to keep clear which two of the total c categories are responsible for any boundary segment. Figure 2.16 shows the decision surfaces for a four-category case made up of Gaussian distributions. Of course, if the distributions are more complicated, the decision regions can be even more complex, though the same underlying theory holds there too.
Figure 2.16: The decision regions for four normal distributions. Even with such a low number of categories, the shapes of the boundary regions can be rather complex.
Summary#
Bayesian Decision Theory provides a framework for optimal classification under uncertainty.
Key components:
Prior probabilities.
Likelihoods (class-conditional densities).
Posterior probabilities.
Decision rules minimize error rates or expected loss.
Normal distributions are commonly used due to their mathematical tractability.