Bayesian Decision Theory#

Chapter 2 of Pattern Classification [DHS00]

Introduction#

  • Bayesian Decision Theory is a statistical approach to pattern classification.

  • It quantifies tradeoffs between classification decisions using probabilities and costs.

  • Assumes all relevant probabilities are known.

  • Example: Classifying fish (sea bass vs. salmon) based on features like lightness.


Key Concepts#

  • State of Nature (ω): Represents the true category (e.g., sea bass or salmon).

  • Prior Probability (P(ωj)): Probability of a category before observing any data.

  • Class-Conditional Probability Density (p(x|ωj)): Probability of observing feature x given category ωj.

  • Posterior Probability (P(ωj|x)): Probability of category ωj after observing feature x.


Bayes’ Formula#

  • Bayes’ Theorem:

    P(ωj|x)=p(x|ωj)P(ωj)p(x)
    • Posterior = (Likelihood × Prior) / Evidence

  • Evidence (p(x)): Normalizing constant, often ignored in classification.

    p(x)=j=1cp(x|ωj)P(ωj)

The probability P(ωj) is converted to the a posteriori probability (or posterior probability) P(ωj|x) — the probability of the state of nature being ωj given that feature value x has been measured. We call p(x|ωj) the likelihood of ωj with respect to x (a term chosen to indicate that, other things being equal, the category ωj for which p(x|ωj) is large is more “likely” to be the true category).

Notice that it is the product of the likelihood and the prior probability that is most important in determining the posterior probability; the evidence factor, p(x), can be viewed as merely a scale factor that guarantees that the posterior probabilities sum to one, as all good probabilities must. The variation of P(ωj|x) with x is illustrated in Figure 2.2 for the case P(ω1)=23 and P(ω2)=13.

Figure 2.1

Figure 2.1: Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category ωi. If x represents the length of a fish, the two curves might describe the difference in length of populations of two types of fish. Density functions are normalized, and thus the area under each curve is 1.0.

Decision Rule#

  • Minimum Error Rate Classification:

    • Decide ω1 if P(ω1|x)>P(ω2|x), otherwise decide ω2.

    • Equivalent to: Decide ω1 if p(x|ω1)P(ω1)>p(x|ω2)P(ω2)

    • Probability of error: P(error|x)=min[P(ω1|x),P(ω2|x)]

Figure 2.2

Figure 2.2: Posterior probabilities for the particular priors P(ω1)=23 and P(ω2)=13 for the class-conditional probability densities shown in Figure 2.1. Thus, in this case, given that a pattern is measured to have feature value x=14, the probability it is in category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sum to 1.0.

Some additional insight can be obtained by considering a few special cases:

  1. Case 1: If for some x we have p(x|ω1)=p(x|ω2), then that particular observation gives us no information about the state of nature. In this case, the decision hinges entirely on the prior probabilities P(ω1) and P(ω2).

  2. Case 2: If P(ω1)=P(ω2), then the states of nature are equally probable. In this case, the decision is based entirely on the likelihoods p(x|ωj).

  3. General Case: In general, both the prior probabilities and the likelihoods are important in making a decision. The Bayes decision rule combines these factors to achieve the minimum probability of error.


Generalization to More Than Two Classes#

The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. To minimize the average probability of error, we should select the class i that maximizes the posterior probability P(ωi|x). In other words, for minimum error rate:

Decide ωi if P(ωi|x)>P(ωj|x)for all ji.

This rule generalizes naturally to multiple classes (c>2). For each class ωi, we compute the posterior probability P(ωi|x) and assign the feature vector x to the class with the highest posterior probability. This ensures that the probability of error is minimized.


Discriminant Functions#

  • Discriminant Function (gi(x)): Used to assign a feature vector x to class ωi.

    • For minimum error rate: gi(x)=P(ωi|x)

    • Can also be expressed as: gi(x)=p(x|ωi)P(ωi)

    • Or in log form: gi(x)=lnp(x|ωi)+lnP(ωi)

Figure 2.5: The functional structure of a general statistical pattern classifier which includes d inputs and c discriminant functions gi(x). A subsequent step determines which of the discriminant values is the maximum, and categorizes the input pattern accordingly. The arrows show the direction of the flow of information, though frequently the arrows are omitted when the direction of flow is self-evident.

Figure 2.6

Figure 2.6: In this two-dimensional two-category classifier, the probability densities are Gaussian (with 1/e ellipses shown), the decision boundary consists of two hyperbolas, and thus the decision region R2 is not simply connected.


Normal Density#

  • Univariate Normal Density: p(x)=12πσexp[12(xμσ)2]

For which the expected value of x (an average, here taken over the feature space) is:

μ=E[x]=xp(x)dx

and where the expected squared deviation or variance is:

σ2=E[(xμ)2]=(xμ)2p(x)dx

The univariate normal density is completely specified by two parameters: its mean μ and variance σ2. For simplicity, we often abbreviate p(x) by writing p(x)N(μ,σ2) to say that x is distributed normally with mean μ and variance σ2. Samples from normal distributions tend to cluster about the mean, with a spread related to the standard deviation σ (see Figure 2.7).

Figure 2.7

Figure 2.7: A univariate normal distribution has roughly 95% of its area in the range |xμ|2σ, as shown. The peak of the distribution has value p(μ)=12πσ.

Multivariate Normal Density#

p(x)=1(2π)d/2|Σ|1/2exp[12(xμ)TΣ1(xμ)]
  • Mean vector (μ) and covariance matrix (Σ) describe the distribution.


Formal Definitions#

Formally, we have:

μ=E[x]=xp(x)dx

and

Σ=E[(xμ)(xμ)T]

where the expected value of a vector or a matrix is found by taking the expected values of its components. In other words, if xi is the ith component of x, μi the ith component of μ, and σij the ijth component of Σ, then:

μi=E[xi]

and

σij=E[(xiμi)(xjμj)].(42)

Properties of the Covariance Matrix#

  • The covariance matrix Σ is always symmetric and positive semidefinite.

  • We restrict our attention to the case where Σ is positive definite, so that the determinant of Σ is strictly positive.

  • The diagonal elements σii are the variances of the respective xi (i.e., σi2).

  • The off-diagonal elements σij are the covariances of xi and xj.

    • Example: For the length and weight features of a population of fish, we would expect a positive covariance.

  • If xi and xj are statistically independent, then σij=0.

  • If all off-diagonal elements are zero, p(x) reduces to the product of the univariate normal densities for the components of x.

Multivariate Normal Density#

The multivariate normal density is completely specified by d+d(d+1)2 parameters:

  • The d elements of the mean vector μ.

  • The d(d+1)2 independent elements of the covariance matrix Σ.

Samples drawn from a normal population tend to fall in a single cluster (see Figure 2.9):

  • The center of the cluster is determined by the mean vector μ.

  • The shape of the cluster is determined by the covariance matrix Σ.

The loci of points of constant density are hyperellipsoids defined by the quadratic form: $(xμ)TΣ1(xμ)=constant.$

  • The principal axes of these hyperellipsoids are given by the eigenvectors of Σ (denoted by Φ).

  • The eigenvalues (denoted by Λ) determine the lengths of these axes.

The quantity r2=(xμ)TΣ1(xμ) is called the squared Mahalanobis distance from x to μ. Thus:

  • The contours of constant density are hyperellipsoids of constant Mahalanobis distance to μ.

  • The volume of these hyperellipsoids measures the scatter of the samples about the mean.

Figure 2.9

Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on the mean μ. The red ellipses show lines of equal probability density of the Gaussian.


Discriminant Functions for Normal Density#

Case 1: Equal Covariance Matrices (Σi=σ2I):#

  • The discriminant function is linear:

gi(x)=wiTx+b,

How?

Let’s derive the simplified form of the log-likelihood gi(x) for the case where the covariance matrices are equal and isotropic (Σi=σ2I). We’ll start with the given multivariate normal density and show how it reduces to the stated form.


Given:#

The probability density function (PDF) for class ωi is:

p(xωi)=1(2π)d/2|Σi|1/2exp[12(xμi)TΣi1(xμi)],

where:

  • Σi=σ2I (isotropic covariance, equal across classes).

  • |Σi|=(σ2)d (determinant of σ2I).

  • Σi1=1σ2I (inverse of diagonal matrix).


Substitute Σi=σ2I into the PDF#

p(xωi)=1(2π)d/2(σ2)d/2exp[12(xμi)T(1σ2I)(xμi)].

Simplify the exponent:

(xμi)T(1σ2I)(xμi)=1σ2(xμi)T(xμi)=xμi2σ2,

where xμi2 is the squared Euclidean distance.

Thus, the PDF becomes:

p(xωi)=1(2πσ2)d/2exp[xμi22σ2].

Apply Bayes’ Rule for Posterior Probability#

The posterior probability P(ωix) is:

P(ωix)=p(xωi)P(ωi)p(x),

where p(x)=jp(xωj)P(ωj) is the evidence (ignored in discriminant functions).

The log-posterior (discriminant function gi(x)) is:

gi(x)=lnp(xωi)+lnP(ωi).

Take the Logarithm of the PDF#

Compute lnp(xωi):

lnp(xωi)=ln(1(2πσ2)d/2)xμi22σ2.

Simplify the first term:

ln((2πσ2)d/2)=d2ln(2πσ2).

Thus:

lnp(xωi)=d2ln(2πσ2)xμi22σ2.

Combine with Prior and Ignore Constants#

The discriminant function becomes:

gi(x)=lnp(xωi)+lnP(ωi)=xμi22σ2d2ln(2πσ2)+lnP(ωi).

Since d2ln(2πσ2) is constant across all classes ωi, it does not affect the classification decision (we compare gi(x) across i, and the constant cancels out). Therefore, we drop it and write:

gi(x)=xμi22σ2+lnP(ωi)
  • Decision boundaries are hyperplanes.

  • If the prior probabilities are not equal, the squared distance xμi2 is normalized by the variance σ2 and offset by adding lnP(ωi). This means that if x is equally near two different mean vectors, the optimal decision will favor the a priori more likely category.

  • It is not necessary to compute distances explicitly. Expanding the quadratic form (xμi)T(xμi) yields:

    gi(x)=12σ2[xTx2μiTx+μiTμi]+lnP(ωi)

    The quadratic term xTx is the same for all i, so it can be ignored as an additive constant. This simplifies the discriminant function to:

    gi(x)=wiTx+wi0

    where:

    wi=μiσ2(52)

    and

    wi0=μiTμi2σ2+lnP(ωi)

    Here, wi0 is called the threshold or bias in the ith direction.

  • A classifier that uses linear discriminant functions is called a linear machine. The decision surfaces for a linear machine are pieces of hyperplanes defined by the linear equations gi(x)=gj(x) for the two categories with the highest posterior probabilities. For our case, this equation can be written as:

    wT(xx0)=0

    where:

    w=μiμj

    and

    x0=12(μi+μj)σ2μiμj2lnP(ωi)P(ωj)(μiμj)(56).

    This defines a hyperplane through the point x0 and orthogonal to the vector w. Since w=μiμj, the hyperplane separating Ri and Rj is orthogonal to the line linking the means. If P(ωi)=P(ωj), the hyperplane is the perpendicular bisector of the line between the means. If the prior probabilities are unequal, the point x0 shifts away from the more likely mean.


Figure 2.10

Figure 2.10: If the covariances of two distributions are equal and proportional to the identity matrix, the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d1 dimensions, perpendicular to the line separating the means.

The equation wT(xx0)=0 represents a hyperplane in n-dimensional space. Here’s how to interpret and visualize it:


Interpretation

  1. w: A normal vector to the hyperplane (defines the orientation of the hyperplane).

  2. x0: A fixed point on the hyperplane.

  3. x: A variable point on the hyperplane.

The equation states that the vector xx0 is perpendicular to w, meaning all points x on the hyperplane satisfy this condition.


Visualization in 2D In 2D space (n=2), the equation reduces to a line. Let’s break it down:

  1. w=[w1w2]: The normal vector to the line.

  2. x0=[x0y0]: A fixed point on the line.

  3. x=[xy]: A variable point on the line.

The equation becomes: w1(xx0)+w2(yy0)=0

This is the equation of a line in 2D.


Drawing the Hyperplane (Line in 2D) Here’s how to draw it:

  1. Plot the point x0: This is a fixed point on the line.

  2. Draw the normal vector w: This vector is perpendicular to the line.

  3. Draw the line: The line passes through x0 and is perpendicular to w.


Example Let’s use the following values:

  • w=[21] (normal vector).

  • x0=[11] (a point on the line).

The equation becomes: 2(x1)+1(y1)=0 Simplify: 2x2+y1=02x+y3=0

import matplotlib.pyplot as plt
import numpy as np

# Define the normal vector w and point x0
w = np.array([2, 1])  # Normal vector
x0 = np.array([1, 1])  # Point on the line

# Define the line equation: 2x + y - 3 = 0 => y = -2x + 3
x_values = np.linspace(-5, 5, 100)  # Range of x values
y_values = -2 * x_values + 3  # Corresponding y values

# Plot the line
plt.plot(x_values, y_values, label="Line: $2x + y - 3 = 0$")

# Plot the point x0
plt.scatter(x0[0], x0[1], color="red", label="Point $\mathbf{x}_0 = (1, 1)$")

# Plot the normal vector w starting from x0
plt.quiver(x0[0], x0[1], w[0], w[1], angles='xy', scale_units='xy', scale=1, color="green", label="Normal vector $\mathbf{w} = (2, 1)$")

# Add labels and legend
plt.xlabel("x")
plt.ylabel("y")
plt.axhline(0, color="black", linewidth=0.5)  # x-axis
plt.axvline(0, color="black", linewidth=0.5)  # y-axis
plt.grid(True)
plt.legend()
plt.title("Line and Normal Vector Visualization")
plt.xlim(-5, 5)  # Adjust x-axis limits
plt.ylim(-5, 5)  # Adjust y-axis limits
plt.gca().set_aspect("equal", adjustable="box")  # Equal aspect ratio
plt.show()
_images/cc698d936a75538507b2f88c616465118688f2439d1a9a189d631f083e917d1d.png

Hyperplane and Decision Boundary

The equation:

wT(xx0)=0

defines a hyperplane through the point x0 and orthogonal to the vector w. Since w=μiμj, the hyperplane separating Ri and Rj is orthogonal to the line linking the means.

  • If P(ωi)=P(ωj), the second term in Eq. 56 vanishes, and the point x0 is halfway between the means. In this case, the hyperplane is the perpendicular bisector of the line between the means (see Figure 2.11).

  • If the prior probabilities are unequal, the point x0 shifts away from the more likely mean.

  • Note that if the variance σ2 is small relative to the squared distance μiμj2, the position of the decision boundary is relatively insensitive to the exact values of the prior probabilities.


Minimum Distance Classifier

  • If the prior probabilities P(ωi) are the same for all c classes, the term lnP(ωi) becomes an unimportant additive constant and can be ignored.

  • In this case, the optimum decision rule simplifies to:

    • Measure the Euclidean distance xμi from x to each of the c mean vectors.

    • Assign x to the category of the nearest mean.

  • Such a classifier is called a minimum distance classifier.

  • If each mean vector is considered an ideal prototype or template for patterns in its class, this is essentially a template-matching procedure (see Figure 2.10). This technique is similar to nearest-neighbor algorithm visited in the previous chapter.

Figure 2.11

Figure 2.11: As the priors are changed, the decision boundary shifts; for sufficiently disparate priors the boundary will not lie between the means of these 1-, 2- and 3-dimensional spherical Gaussian distributions

Case 2: Equal but Arbitrary Covariance Matrices (Σi=Σ for all i)#

When the covariance matrices for all classes are identical but arbitrary (Σi=Σ), the discriminant function defined in Eq. 51 simplifies. First, the term ln|Σi| becomes ln|Σ|, which is the same constant for all classes i. Since it doesn’t affect the decision, we can ignore it. Similarly, the term d2ln(2π) is also a constant independent of i and can be dropped. Thus, we have:

gi(x)=12(xμi)TΣ1(xμi)+lnP(ωi)

Now, let’s expand the quadratic term:

(xμi)TΣ1(xμi)=(xTμiT)Σ1(xμi)=xTΣ1xxTΣ1μiμiTΣ1x+μiTΣ1μi

Since Σ is symmetric, its inverse Σ1 is also symmetric. The term xTΣ1μi is a scalar, so it is equal to its transpose: xTΣ1μi=(xTΣ1μi)T=μiT(Σ1)T(xT)T=μiTΣ1x. Therefore:

(xμi)TΣ1(xμi)=xTΣ1x2μiTΣ1x+μiTΣ1μi

Substituting this back into the expression for gi(x):

gi(x)=12(xTΣ1x)+μiTΣ1x12μiTΣ1μi+lnP(ωi)

Notice that the term 12(xTΣ1x) is independent of the class i. Since we are interested in comparing gi(x) and gj(x) (e.g., finding the maximum), any term that does not depend on i can be dropped without affecting the decision rule or the location of the decision boundaries. Thus, we can simplify the discriminant function to:

gi(x)=μiTΣ1x12μiTΣ1μi+lnP(ωi)

This discriminant function is linear in x. We can rewrite it in the familiar linear form:

gi(x)=wiTx+wi0

where the weight vector wi and the bias term wi0 are defined as:

wi=Σ1μi
wi0=12μiTΣ1μi+lnP(ωi)

Since the discriminant function gi(x) is linear, the decision boundaries between any two classes ωi and ωj, defined by gi(x)=gj(x), are hyperplanes. Let’s examine the equation for the decision boundary:

wiTx+wi0=wjTx+wj0
(wiwj)Tx+(wi0wj0)=0

Substituting the expressions for wi, wj, wi0, and wj0:

(Σ1μiΣ1μj)Tx+(12μiTΣ1μi+lnP(ωi))(12μjTΣ1μj+lnP(ωj))=0
(μiμj)TΣ1x12(μiTΣ1μiμjTΣ1μj)+lnP(ωi)P(ωj)=0

This equation defines a hyperplane. Let w=Σ1(μiμj). The equation can be written as:

wTx+w0=0

where

w0=12(μiTΣ1μiμjTΣ1μj)+lnP(ωi)P(ωj)

The vector w=Σ1(μiμj) is normal to the decision hyperplane. Note that this normal vector is generally not in the direction of the difference between the means (μiμj) unless Σ is proportional to the identity matrix (Case 1).

The location of the hyperplane is influenced by the prior probabilities P(ωi) and P(ωj). If the priors are equal (P(ωi)=P(ωj)), then lnP(ωi)P(ωj)=ln1=0. The equation simplifies further, and the threshold w0 is determined solely by the means and the covariance matrix. Specifically, the boundary passes through the point x0=12(μi+μj) if P(ωi)=P(ωj).

Geometric Interpretation: The decision boundaries are hyperplanes. Unlike Case 1 where the hyperplanes are orthogonal to the line connecting the means, here the orientation of the hyperplanes depends on the shared covariance matrix Σ. The surfaces of constant Mahalanobis distance from the means are hyperellipsoids, all having the same shape and orientation determined by Σ. The decision boundaries are linear because the quadratic terms xTΣ1x cancelled out.

Figure 2.12 (from DHS book) illustrates this scenario for Gaussian distributions. It shows how the equal covariance matrices lead to linear decision boundaries, but these boundaries are not necessarily perpendicular to the line connecting the class means, unlike the simpler Case 1. The shape and orientation of the ellipses (representing contours of equal probability density) are the same for both classes, reflecting the shared covariance matrix Σ. The decision boundary shifts depending on the prior probabilities.


Exercise: Validating the Decision Boundary in Case 2 and Comparing with GaussianNB

Objective In this exercise, you will:

  1. Generate two groups of data with Gaussian distributions.

  2. Compute the decision boundary for Case 2 (equal but arbitrary covariance matrices) using the theoretical approach.

  3. Compare the results with the decision boundary generated by Scikit-Learn’s GaussianNB.

  4. Visualize the results in a single plot or side-by-side plots.

Steps to Follow

  1. Generate Data:

    • Create two classes of data points, each following a multivariate Gaussian distribution with the same covariance matrix but different mean vectors.

    • Use the following parameters for the two classes:

      • Class 1: Mean vector μ1=[2,3], Covariance matrix Σ=[2112].

      • Class 2: Mean vector μ2=[6,5], Covariance matrix Σ=[2112].

    • Generate 200 data points for each class.

  2. Compute the Decision Boundary (Theoretical Approach):

    • Use the linear discriminant function for Case 2:

      gi(x)=12(xμi)TΣ1(xμi)+lnP(ωi).
    • Assume equal prior probabilities (P(ω1)=P(ω2)=0.5).

    • The decision boundary is the set of points where g1(x)=g2(x).

  3. Compute the Decision Boundary (GaussianNB):

    • Use Scikit-Learn’s GaussianNB to fit the data and predict the labels for test points.

  4. Generate Test Points:

    • Create a grid of test points covering the entire plot area (e.g., using np.meshgrid).

    • Predict the class labels for these test points using both the theoretical decision boundary and GaussianNB.

  5. Visualize the Results:

    • Plot the original data points for both classes.

    • Plot the test points with their predicted labels (use transparency to avoid obscuring the original data).

    • Draw the decision boundary for both the theoretical approach and GaussianNB.

    • Draw the line connecting the two class centers and the perpendicular bisector.

Expected Output

  • A figure with two subplots:

    1. Theoretical Decision Boundary:

      • The original data points for both classes.

      • The test points with their predicted labels (shaded regions).

      • The theoretical decision boundary (black line).

      • The line connecting the two class centers (dashed black line).

      • The perpendicular bisector (dotted green line).

    2. GaussianNB Decision Boundary:

      • The original data points for both classes.

      • The test points with their predicted labels (shaded regions).

      • The decision boundary generated by GaussianNB (black line).

      • The line connecting the two class centers (dashed black line).

      • The perpendicular bisector (dotted green line).


Case 3: Arbitrary Covariance Matrices (Σi):#

This is the most general case for Gaussian distributions. Here, each class ωi is modeled by a multivariate normal distribution N(μi,Σi), where both the mean vector μi and the covariance matrix Σi can be different for each class.

We start again with the general discriminant function, dropping the constant term d2ln(2π) as it doesn’t affect the decision:

gi(x)=lnp(x|ωi)+lnP(ωi)
gi(x)=12(xμi)TΣi1(xμi)12ln|Σi|+lnP(ωi)(Eq.62)

Now, we expand the quadratic term (xμi)TΣi1(xμi):

(xμi)TΣi1(xμi)=xTΣi1xxTΣi1μiμiTΣi1x+μiTΣi1μi

Since Σi1 is symmetric, the two middle terms are equal scalar values: xTΣi1μi=(μiTΣi1x)T=μiTΣi1x. So the expansion becomes:

xTΣi1x2μiTΣi1x+μiTΣi1μi

Substituting this back into the expression for gi(x):

gi(x)=12(xTΣi1x2μiTΣi1x+μiTΣi1μi)12ln|Σi|+lnP(ωi)

Rearranging the terms to group by powers of x:

gi(x)=(12xTΣi1x)+(μiTΣi1x)+(12μiTΣi1μi12ln|Σi|+lnP(ωi))

We can see this is a quadratic function of x. It can be written in the general quadratic form:

gi(x)=xTWix+wiTx+wi0(Eq.63)

where the components are identified by comparing the terms:

  • Matrix Wi: Determines the quadratic term.

    Wi=12Σi1
  • Vector wi: Determines the linear term. Note that μiTΣi1x=(Σi1μi)Tx.

    wi=Σi1μi
  • Scalar wi0: Represents the constant or threshold term.

    wi0=12μiTΣi1μi12ln|Σi|+lnP(ωi)

Since the discriminant function gi(x) contains the quadratic term xTWix, where Wi=12Σi1 depends on the class i, the resulting function is quadratic in x.

The decision boundary between two classes ωi and ωj is defined by gi(x)=gj(x), which leads to:

gi(x)gj(x)=(xTWix+wiTx+wi0)(xTWjx+wjTx+wj0)=0
xT(WiWj)x+(wiwj)Tx+(wi0wj0)=0

Because WiWj in general (since ΣiΣj), the quadratic term xT(WiWj)x does not cancel out. This equation describes a hyperquadric surface in the feature space Rd.

These hyperquadric decision boundaries can take various forms, including:

  • Hyperplanes (only if Σi=Σj, which reduces to Case 2)

  • Pairs of hyperplanes

  • Hyperspheres

  • Hyperellipsoids

  • Hyperparaboloids

  • Hyperhyperboloids

In 2D space, these correspond to familiar shapes like lines, pairs of lines, circles, ellipses, parabolas, and hyperbolas. The specific shape depends on the eigenvalues of the matrix (WiWj) and the other terms in the equation.

This general case allows for much more complex decision boundaries compared to the linear boundaries of Cases 1 and 2, reflecting the greater flexibility afforded by allowing each class to have its own unique covariance matrix.

The figures referenced illustrate this complexity:

  • Figure 2.14: Shows various quadratic decision boundaries (ellipses, hyperbolas, parabolas, lines) that can arise in 2D when covariance matrices are arbitrary.

Figure 2.14

Figure 2.14: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. Conversely, given any hyperquadratic, one can find two Gaussian distributions whose Bayes decision boundary is that hyperquadric.

This quadratic classifier is the most general form for Gaussian distributions under the Bayesian framework.

The extension of these results to more than two categories is straightforward though we need to keep clear which two of the total c categories are responsible for any boundary segment. Figure 2.16 shows the decision surfaces for a four-category case made up of Gaussian distributions. Of course, if the distributions are more complicated, the decision regions can be even more complex, though the same underlying theory holds there too.

Figure 2.16

Figure 2.16: The decision regions for four normal distributions. Even with such a low number of categories, the shapes of the boundary regions can be rather complex.

Summary#

  • Bayesian Decision Theory provides a framework for optimal classification under uncertainty.

  • Key components:

    • Prior probabilities.

    • Likelihoods (class-conditional densities).

    • Posterior probabilities.

  • Decision rules minimize error rates or expected loss.

  • Normal distributions are commonly used due to their mathematical tractability.