Lecture 7. Bayesian Learning#

Learning in an uncertain world

Mahmood Amintoosi, Spring 2024

Computer Science Dept, Ferdowsi University of Mashhad

I should mention that the original material of this course was from Open Machine Learning Course, by Joaquin Vanschoren and others.

Hide code cell source
# Auto-setup when running on Google Colab
import os
if 'google.colab' in str(get_ipython()) and not os.path.exists('/content/machine-learning'):
    !git clone -q https://github.com/fum-cs/machine-learning.git /content/machine-learning
    !pip --quiet install -r /content/machine-learning/requirements_colab.txt
    %cd machine-learning/notebooks

# Global imports and settings
%matplotlib inline
from preamble import *
interactive = False # Set to True for interactive plots
if interactive:
    fig_scale = 0.61
    print_config['font.size'] = 7
    print_config['xtick.labelsize'] = 5
    print_config['ytick.labelsize'] = 5
    plt.rcParams.update(print_config)
else: # For printing
    fig_scale = 0.28
    plt.rcParams.update(print_config)
ml XKCD, Randall Monroe

Bayes’ rule#

Rule for updating the probability of a hypothesis \(c\) given data \(x\)

ml

\(P(c|x)\) is the posterior probability of class \(c\) given data \(x\).
\(P(c)\) is the prior probability of class \(c\): what you believed before you saw the data \(x\)
\(P(x|c)\) is the likelihood of data point \(x\) given that the class is \(c\) (computed from your data)
\(P(x)\) is the prior probability of the data (marginal likelihood): the likelihood of the data \(x\) under any circumstance (no matter what the class is)

Bayesian models#

  • Learn the joint distribution \(P(x,y)=P(x|y)P(y)\).

    • Assumes that the data is Gaussian distributed (!)

    • With every input \(x\) you get \(P(y|x)\), hence a mean and standard deviation for \(y\) (blue)

    • For every desired output \(y\) you get \(P(x|y)\), hence you can sample new points \(x\) (red)

  • Easily updatable with new data using Bayes’ rule (‘turning the crank’)

    • Previous posterior \(P(y|x)\) becomes new prior \(P(y)\)

Hide code cell source
import scipy.stats as stats
from matplotlib import cm
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

def plot_joint_distribution(mean,covariance_matrix,x,y,contour=False,labels=['x','y'], plot_intersect=False, ax=None):
    
    delta = 0.05
    xr, yr = np.arange(-3.0, 3.0, delta), np.arange(-3.0, 3.0, delta)
    X, Y = np.meshgrid(xr,yr)
    xy = np.hstack((X.flatten()[:, None], Y.flatten()[:, None]))
    p = stats.multivariate_normal.pdf(xy, mean=mean, cov=covariance_matrix)
    Z = p.reshape(len(X), len(X))
    
    if not ax:
        fig = plt.figure(figsize=plt.figaspect(1)*fig_scale*1.1)
        ax = fig.add_subplot(projection='3d')
        
    if contour:
        ax.plot_surface(X, Y, Z*0.001, alpha=0.9, cmap='jet')
    else:
        ax.plot_surface(X, Y, Z, alpha=0.9, cmap='jet')

    cset = ax.contour(X, Y, Z, zdir='y', offset=3, cmap='Reds')
    cset = ax.contour(X, Y, Z, zdir='x', offset=-3, cmap='Blues')
    Zys = np.zeros_like(Z)
    Zys[60,:] = Z[int((y+3)/delta)]
    cset = ax.contour(X, Y, Zys, zdir='y', offset=3, linewidths=1.5, cmap='Reds')
    Zys = np.zeros_like(Z)
    Zys[:,60] = Z[int((x+3)/delta)]
    cset = ax.contour(X, Y, Zys, zdir='x', offset=-3, linewidths=1.5, cmap='Blues')
    
    if plot_intersect:
        ax.plot([x]*len(xr), yr, 0.001, color='k', alpha=1, ls='-', lw=1)
        ax.plot(xr, [y]*len(yr), 0.001, color='k', alpha=1, ls='-', lw=1)
    
    
    ax.set_xlabel(labels[0], labelpad=-2/fig_scale)
    ax.set_ylabel(labels[1], labelpad=-2/fig_scale)
    ax.set_zlabel('Probability density', labelpad=-3/fig_scale)
    ax.set_xticks([-3,0,3])
    ax.set_yticks([-3,0,3])
    ax.set_xlim(xmin=-3, xmax=3)
    ax.set_ylim(ymin=-3, ymax=3)
    #ax.tick_params(axis='both')
    ax.set_zticks([])
    ax.tick_params(axis='both', width=0, labelsize=10*fig_scale, pad=-3)
    
@interact
def interact_joint_distribution(x=(-3,3,0.5),y=(-3,3,0.5),contour=False):
    # Shape of the joint distribution
    mean = [0, 0]
    covariance_matrix = np.array([[1,-0.8],[-0.8,0.8]])
    plot_joint_distribution(mean,covariance_matrix,x,y,contour, plot_intersect=True)
Hide code cell source
if not interactive:
    mean = [0, 0]
    covariance_matrix = np.array([[1,-0.8],[-0.8,1]])
    
    fig = plt.figure(figsize=plt.figaspect(0.3)*fig_scale*2)
    ax1 = fig.add_subplot(1, 2, 1, projection='3d')
    plot_joint_distribution(mean,covariance_matrix,x=1,y=1,contour=False, plot_intersect=True, ax=ax1)
    ax2 = fig.add_subplot(1, 2, 2, projection='3d')
    plot_joint_distribution(mean,covariance_matrix,x=-1,y=1,contour=False, plot_intersect=True, ax=ax2)
_images/34bf7318b140bf051d28ea03fbfd831bd59340ebd693fb173491be0361ceb88c.png

Generative models#

  • The joint distribution represents the training data for a particular output (e.g. a class)

  • You can sample a new point \(\textbf{x}\) with high predicted likelihood \(P(x,c)\): that new point will be very similar to the training points

  • Generate new (likely) points according to the same distribution: generative model

    • Generate examples that are fake but corresponding to a desired output

    • Generative neural networks (e.g. GANs) can do this very accurately for text, images, …

ml

Naive Bayes#

  • Predict the probability that a point belongs to a certain class, using Bayes’ Theorem

\[P(c|\textbf{x}) = \frac{P(\textbf{x}|c)P(c)}{P(\textbf{x})}\]
  • Problem: since \(\textbf{x}\) is a vector, computing \(P(\textbf{x}|c)\) can be very complex

  • Naively assume that all features are conditionally independent from each other, in which case:
    \(P(\mathbf{x}|c) = P(x_1|c) \times P(x_2|c) \times ... \times P(x_n|c)\)

  • Very fast: only needs to extract statistics from each feature.

On categorical data#

What’s the probability that your friend will play golf if the weather is sunny?

ml

On numeric data#

  • We need to fit a distribution (e.g. Gaussian) over the data points

  • GaussianNB: Computes mean \(\mu_c\) and standard deviation \(\sigma_c\) of the feature values per class: \(p(x=v \mid c)=\frac{1}{\sqrt{2\pi\sigma^2_c}}\,e^{ -\frac{(v-\mu_c)^2}{2\sigma^2_c} }\)

  • We can now make predictions using Bayes’ theorem: \(p(c \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid c) \ p(c)}{p(\mathbf{x})}\)

ml
  • What do the predictions of Gaussian Naive Bayes look like?

Hide code cell source
from sklearn.naive_bayes import GaussianNB

names = ["Naive Bayes"]
classifiers = [GaussianNB()]

mglearn.plots.plot_classifiers(names, classifiers, figuresize=(8*fig_scale,6*fig_scale))
_images/c00f616705299b2a8fb54f6b510f9e8d41adcabd894efed9d938fc6ca265c392.png

Other Naive Bayes classifiers:

  • BernoulliNB

    • Assumes binary data

    • Feature statistics: Number of non-zero entries per class

  • MultinomialNB

    • Assumes count data

    • Feature statistics: Average value per class

    • Mostly used for text classification (bag-of-words data)

Bayesian Networks#

  • What if we know that some variables are not independent?

  • A Bayesian Network is a directed acyclic graph representing variables as nodes and conditional dependencies as edges.

  • If an edge \((A, B)\) connects random variables A and B, then \(P(B|A)\) is a factor in the joint probability distribution. We must know \(P(B|A)\) for all values of \(B\) and \(A\)

  • The graph structure can be designed manually or learned (hard!)

ml
Hide code cell source
### GP implementation (based on example by Neil Lawrence http://inverseprobability.com/mlai2015/)

# Compute covariances
def compute_kernel(X, X2, kernel, **kwargs):
    K = np.zeros((X.shape[0], X2.shape[0]))
    for i in np.arange(X.shape[0]):
        for j in np.arange(X2.shape[0]):
            K[i, j] = kernel(X[i, :], X2[j, :], **kwargs)
    return K

# Exponentiated quadratic kernel (RBF)
def exponentiated_quadratic(x, x_prime, variance, lengthscale):
    squared_distance = ((x-x_prime)**2).sum()
    return variance*np.exp((-0.5*squared_distance)/lengthscale**2)


class GP():
    def __init__(self, X, y, sigma2, kernel, **kwargs):
        self.K = compute_kernel(X, X, kernel, **kwargs)
        self.X = X
        self.y = y
        self.sigma2 = sigma2
        self.kernel = kernel
        self.kernel_args = kwargs
        self.update_inverse()
    
    def update_inverse(self):
        # Precompute the inverse covariance and some quantities of interest
        ## NOTE: Not the correct *numerical* way to compute this! For ease of use.
        self.Kinv = np.linalg.inv(self.K+self.sigma2*np.eye(self.K.shape[0]))
        # the log determinant of the covariance matrix.
        self.logdetK = np.linalg.det(self.K+self.sigma2*np.eye(self.K.shape[0]))
        # The matrix inner product of the inverse covariance
        self.Kinvy = np.dot(self.Kinv, self.y)  
        self.yKinvy = (self.y*self.Kinvy).sum()

    def log_likelihood(self):
        # use the pre-computes to return the likelihood
        return -0.5*(self.K.shape[0]*np.log(2*np.pi) + self.logdetK + self.yKinvy)
    
    def objective(self):
        # use the pre-computes to return the objective function 
        return -self.log_likelihood()  
    
    def posterior_f(self, X_test, y):
        K_star = compute_kernel(self.X, X_test, self.kernel, **self.kernel_args)
        K_starstar = compute_kernel(X_test, X_test, self.kernel, **self.kernel_args)
        A = np.dot(self.Kinv, K_star)
        mu_f = np.dot(A.T, y)
        C_f = K_starstar - np.dot(A.T, K_star)
        return mu_f, C_f

# set covariance function parameters
variance = 16.0
lengthscale = 32
sigma2 = 0.05 # noise variance

# Plotting how GPs sample
def plot_gp(X, Y, X_test, nr_points=10, variance=16, lengthscale=32, add_mean=True, nr_samples=25, show_covariance=True, show_stdev=False, ylim=None):
    # Compute GP
    xs, ys = X[:nr_points], Y[:nr_points]
    model = GP(xs, ys, sigma2, exponentiated_quadratic, variance=variance, lengthscale=lengthscale)
    mu_f, C_f = model.posterior_f(X_test,ys)

    # Plot GP with or without covariance matrix
    if show_covariance:
        fig, axs = plt.subplots(1, 2, figsize=(10*fig_scale, 4*fig_scale))
        gp_axs = axs[0]
    else:
        fig, axs = plt.subplots(1, 1, figsize=(10*fig_scale, 4*fig_scale))
        gp_axs = axs

    # Plot GP
    if add_mean:
        samples = np.random.multivariate_normal(mu_f.T[0], C_f, nr_samples).T;
    else:
        samples = np.random.multivariate_normal(np.zeros(100), C_f, nr_samples).T;
    
    if not show_stdev:
        for i in range(samples.shape[1]):
            gp_axs.plot(X_test, samples[:,i], c='b', ls='-', alpha=0.1);
    gp_axs.plot(X_test, mu_f, lw=1, c='k', ls='-', alpha=0.9); 
        
    gp_axs.plot(xs,ys,'ro', markersize=4*fig_scale)
    gp_axs.set_title("Gaussian Process")
    #gp_axs.set_xlabel("x")
    #gp_axs.set_ylabel("y")
    gp_axs.tick_params(axis='both')
    gp_axs.set_ylim(ylim)
    
    # Plot Covariance matrix
    if show_covariance:
        im = axs[1].imshow(C_f, interpolation='none')
        axs[1].set_title("Covariance matrix")
        fig.colorbar(im);
    
    # Stdev
    if show_stdev:
        var_f = np.diag(C_f)[:, None]
        std_f = np.sqrt(var_f)
        gp_axs.fill(np.concatenate([X_test, X_test[::-1]]),
             np.concatenate([mu_f+2*std_f,
                            (mu_f-2*std_f)[::-1]]),
             alpha=.5, fc='b', ec='None', label='95% confidence interval')

Gaussian processes#

  • Model the data as a Gaussian distribution, conditioned on the training points

Hide code cell source
@interact
def plot_sin(nr_points=(0,20,1)):
    X_sin = np.random.RandomState(seed=3).uniform(0,100,(100,1))
    Y_sin = X_sin/100* np.sin(X_sin/5)*3 + np.random.RandomState(seed=0).randn(100,1)*0.5
    X_sin_test = np.linspace(0, 100, 100)[:, None] 
    plot_gp(X_sin,Y_sin,X_sin_test,lengthscale=12, nr_points=nr_points, nr_samples=100, show_covariance=False, ylim=(-7,7))
    
if not interactive:
    plot_sin(nr_points=10)
_images/107d9a47d604834f64436af63ffefc542097e6c4f6f3fe5074cbec1bdbd8def8.png

Probabilistic interpretation of regression#

Linear regression (recap):

\[y = f(\mathbf{x}_i) = \mathbf{x}_i\mathbf{w} + b \]

For one input feature:

\[y = w_1 \cdot x_1 + b \cdot 1\]

We can solve this via linear algebra (closed form solution): \(w^{*} = (X^{T}X)^{-1} X^T Y\)

w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))

\(\mathbf{X}\) is our data matrix with a \(x_0=1\) column to represent the bias \(b\):

\[\begin{split}\mathbf{X} = \begin{bmatrix} \mathbf{x}_1^\top \\\ \mathbf{x}_2^\top \\\ \vdots \\\ \mathbf{x}_N^\top \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\\ 1 & x_2 \\\ \vdots & \vdots \\\ 1 & x_N \end{bmatrix}\end{split}\]
Hide code cell source
import pods
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

X = np.hstack((np.ones_like(x), x)) # [ones(size(x)) x]
w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))

#print("X: ",X[:5])
#print("xTx: ",np.dot(X.T, X))
#print("xTy: ",np.dot(X.T, y))

Example: Olympic marathon data#

We learned: \( y= w_1 x + w_0 = -0.013 x + 28.895\)

Hide code cell source
x_test = np.linspace(1890, 2020, 130)[:, None]

f_test = w[1]*x_test + w[0]

fig = plt.subplots(figsize=(10*fig_scale, 4*fig_scale))
plt.plot(x_test, f_test, 'b-', lw=1)
plt.plot(x, y, 'ro', markersize=4*fig_scale);
plt.title("Linear regression (w: {})".format(w.T))
plt.tick_params(axis='both')
_images/aa5d30a4c6bfb731a3743da2dbea6c4faaafd817f804450d9af29feeb6886577.png

Polynomial regression (recap)#

We can fit a 2nd degree polynomial by using a basis expansion (adding more basis functions):

\[\mathbf{\Phi} = \left[ \mathbf{1} \quad \mathbf{x} \quad \mathbf{x}^2\right]\]
Hide code cell source
Phi = np.hstack([np.ones(x.shape), x, x**2])
w = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))

f_test = w[2]*x_test**2 + w[1]*x_test + w[0]
fig = plt.subplots(figsize=(10*fig_scale, 4*fig_scale))
plt.plot(x_test, f_test, 'b-', lw=1)
plt.plot(x, y, 'ro', markersize=4*fig_scale);
plt.title("Polynomial regression (w: {})".format(w.T))
plt.tick_params(axis='both')
_images/e4952b36bfafe83ac63a175f9ff56c07c04dd7244eafa05a4ed491961c35f58d.png

Kernelized regression (recap)#

We can also kernelize the model and learn a dual coefficient per data point

Hide code cell source
from sklearn.kernel_ridge import KernelRidge

@interact
def plot_kernel_ridge(poly_degree=(1,8,1), poly_gamma=(-10,-2,1), rbf_gamma=(-10,-2,1), rbf_alpha=(-10,-5,1)):
    data = pods.datasets.olympic_marathon_men()
    x = data['X']
    y = data['Y']
    x_test = np.linspace(1890, 2020, 130)[:, None]

    plt.rcParams['figure.figsize'] = [6*fig_scale, 4*fig_scale]

    reg = KernelRidge(kernel='poly', degree=poly_degree, gamma=np.exp(poly_gamma)).fit(x, y)
    plt.plot(x_test, reg.predict(x_test), label="Kernel Ridge (Polynomial)", lw=1, c='b')

    reg2 = KernelRidge(kernel='rbf', alpha=np.exp(rbf_alpha), gamma=np.exp(rbf_gamma)).fit(x, y)
    plt.plot(x_test, reg2.predict(x_test), label="Kernel Ridge (RBF)", lw=1, c='g')
    
    plt.plot(x, y, 'ro', markersize=4*fig_scale)
    plt.title("Kernel Ridge")
    plt.tick_params(axis='both')
    plt.legend(loc="best");

if not interactive:
    plot_kernel_ridge(poly_degree=4, poly_gamma=-6, rbf_gamma=-6, rbf_alpha=-8)
_images/2910a45e1c20b174067d15dd8464f7d7ced862f1d6778a05abd1c35278e12272.png

Probabilistic interpretation#

  • These models do not give us any indication of the (un)certainty of the predictions

  • Assume that the data is inherently uncertain. This can be modeled explictly by introducing a slack variable, \(\epsilon_i\), known as noise.

\[y_i = w_1 x_i + w_0 + \epsilon_i.\]
  • Assume that the noise is distributed according to a Gaussian distribution with zero mean and variance \(\sigma^2\).

\[\epsilon_i \sim \mathcal{N}(0, \sigma^2)\]
  • That means that \(y(x)\) is now a Gaussian distribution with mean \(\mathbf{wx}\) and variance \(\sigma^2\)

\[y = \mathcal{N}(\mathbf{wx}, \sigma^2)\]

We have an uncertainty prediction, but it is the same for all predictions

  • You would expect to be more certain nearby your training points

Hide code cell source
from sklearn.linear_model import BayesianRidge

@interact
def plot_regression(sigma=(0.001,0.01,0.001)): 
    data = pods.datasets.olympic_marathon_men()
    x = data['X']
    y = data['Y']
    x_test = np.linspace(1890, 2020, 130)[:, None]
    
    ridge = BayesianRidge(alpha_1=1/sigma).fit(x, y)
    plt.figure(figsize=(6*fig_scale, 4*fig_scale))
    plt.plot(x, y, 'o', c='b', markersize=4*fig_scale)
    # We have the fitted sigma but we're ignoring it for now.
    y_pred, sigma = ridge.predict(x_test, return_std=True)
    plt.fill(np.concatenate([x_test, x_test[::-1]]),
         np.concatenate([y_pred - 1.9600 * sigma,
                        (y_pred + 1.9600 * sigma)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')
        
    plt.plot(x_test, y_pred, 'b-', label='Bayesian Ridge')
    plt.tick_params(axis='both')
    plt.legend()
Hide code cell source
if not interactive:
    plot_regression(sigma=0.001)
_images/d0b8e35ead16724cfbccffdad9d51d6ec9e8d15757a8fc763a7b393815331b94.png

How to learn probabilities?#

  • Maximum Likelihood Estimation (MLE): Maximize \(P(\textbf{X}|\textbf{w})\)

    • Corresponds to optimizing \(\mathbf{w}\), using (log) likelihood as the loss function

    • Every prediction has a mean defined by \(\textbf{w}\) and Gaussian noise

    \[P(\textbf{X}|\textbf{w}) = \prod_{i=0}^{n} P(\mathbf{y}_i|\mathbf{x}_i;\mathbf{w}) = \prod_{i=0}^{n} \mathcal{N}(\mathbf{wx,\sigma^2 I})\]
  • Maximum A Posteriori estimation (MAP): Maximize the posterior \(P(\textbf{w}|\textbf{X})\)

    • This can be done using Bayes’ rule after we choose a (Gaussian) prior \(P(\textbf{w})\):

    \[P(\textbf{w}|\textbf{X}) = \frac{P(\textbf{X}|\textbf{w})P(\textbf{w})}{P(\textbf{X})}\]
  • Bayesian approach: model the prediction \(P(y|x_{test},X)\) directly

    • Marginalize \(w\) out: consider all possible models (some are more likely)

    • If prior \(P(\textbf{w})\) is Gaussian, then \(P(y|x_{test},\textbf{X})\) is also Gaussian!

      • A multivariate Gaussian with mean \(\mu\) and covariance matrix \(\Sigma\)

    \[P(y|x_{test},\textbf{X}) = \int_w P(y|x_{test},\textbf{w}) P(\textbf{w}|\textbf{X}) dw = \mathcal{N}(\mathbf{\mu,\Sigma})\]

Gaussian prior \(P(w)\)#

In the Bayesian approach, we assume a prior (Gaussian) distribution for the parameters, \(\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha \mathbf{I})\):

  • With zero mean (\(\mu\)=0) and covariance matrix \(\alpha \mathbf{I}\). For 2D: \( \alpha \mathbf{I} = \begin{bmatrix} \alpha & 0 \\ 0 & \alpha \end{bmatrix}\)

I.e, \(w_i\) is drawn from a Gaussian density with variance \(\alpha\): \(w_i \sim \mathcal{N}(0,\alpha)\)

Hide code cell source
@interact
def interact_prior(alpha=(0.1,1,0.1),contour=False):
    # Shape of the joint distribution
    mean = [0, 0]
    covariance_matrix = np.array([[alpha,0],[0,alpha]])
    plot_joint_distribution(mean,covariance_matrix,0,0,contour,labels=['w1','w2'])
Hide code cell source
if not interactive:
    interact_prior(0.5,contour=False)
_images/8d5c526bf96c9bcb38a5d5d7645c350d4da5b05fdb512708789cfed8efb1e94f.png

Sampling from the prior (weight space)#

We can sample from the prior distribution to see what form we are imposing on the functions a priori (before seeing any data).

  • Draw \(w\) (left) independently from a Gaussian density \(\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha\mathbf{I})\)

    • Use any normally distributed sampling technique, e.g. Box-Mueller transform

  • Every sample yields a polynomial function \(f(\mathbf{x})\) (right): \(f(\mathbf{x}) = \mathbf{w} \boldsymbol{\phi}(\mathbf{x}).\)

    • For example, with \(\boldsymbol{\phi}(\mathbf{x})\) being a polynomial:

Hide code cell source
alpha = 4. # set prior variance on w
degree = 5 # set the order of the polynomial basis set
sigma2 = 0.01 # set the noise variance
num_pred_data = 100 # how many points to use for plotting predictions
x_pred = np.linspace(1880, 2030, num_pred_data)[:, None] # input locations for predictions
cmap = plt.get_cmap('jet')
    
# Build the basis matrices (on Olympics data)
def polynomial(x, degree, loc, scale):
    degrees = np.arange(degree+1)
    return ((x-loc)/scale)**degrees

@interact
def plot_function_space(alpha=(0.1,5,0.5),degree=(1,10,1),random=(0,10,1)):
    data = pods.datasets.olympic_marathon_men()
    x = data['X']
    y = data['Y']
    
    scale = np.max(x) - np.min(x)
    loc = np.min(x) + 0.5*scale
    x_pred = np.linspace(1880, 2030, num_pred_data)[:, None] # input locations for predictions

    Phi_pred = polynomial(x_pred, degree=degree, loc=loc, scale=scale)
    Phi = polynomial(x, degree=degree, loc=loc, scale=scale)

    fig, axs = plt.subplots(1, 2, figsize=(10*fig_scale, 4*fig_scale))
    num_samples = 10
    K = degree+1
    colors = [cmap(i) for i in np.linspace(0, 1, num_samples)]
    for i in range(num_samples):
        z_vec = np.random.normal(size=(K, 1))
        w_sample = z_vec*np.sqrt(alpha)
        f_sample = np.dot(Phi_pred,w_sample)
        axs[0].scatter(w_sample[0], w_sample[1], color=colors[i])
        axs[0].set(xlabel="w_1", ylabel="w_2")
        axs[1].plot(x_pred, f_sample, ls='-', c=colors[i]) 
        axs[0].set_xlim(xmin=-5, xmax=5)
        axs[0].set_ylim(ymin=-5, ymax=5)
        axs[1].set_ylim(ymin=-5, ymax=5)
        axs[0].tick_params(axis='both')
        axs[1].tick_params(axis='both')

        
    x1 = np.linspace(-5, 5, 100)
    X1, X2 = np.meshgrid(x1, x1)
    xy = np.hstack((X1.flatten()[:, None], X2.flatten()[:, None]))
    p = stats.multivariate_normal.pdf(xy, mean=[0,0], cov=[[alpha,0],[0,alpha]])
    P = p.reshape(len(X1), len(X2))
    axs[0].contour(X1, X2, P)
Hide code cell source
if not interactive:
    plot_function_space(alpha=4, degree=5, random=0)
_images/354fea737adef6fe6e0cb88405eaf138411a160daf64d4aef3e0dfcf5a365855.png

Learning Gaussian distributions#

  • We assume that our data is Gaussian distributed: \(P(y|x_{test},\textbf{X}) = \mathcal{N}(\mathbf{\mu,\Sigma})\)

  • Example with learned mean \([m,m]\) and covariance \(\begin{bmatrix} \alpha & \beta \\ \beta & \alpha \end{bmatrix}\)

    • The blue curve is the predicted \(P(y|x_{test},\textbf{X})\)

Hide code cell source
@interact
def interact_prior(x_test=(-3,3,0.5),m=(-0.5,0.5,0.1),alpha=(0.5,1,0.1),beta=(-0.5,0.5,0.1),contour=False):
    # Shape of the joint distribution
    mean_matrix = [m, m]
    covariance_matrix = np.array([[alpha,beta],[beta,alpha]])
    plot_joint_distribution(mean_matrix,covariance_matrix,x_test,0,contour,plot_intersect=True,labels=['x_test','y'])
Hide code cell source
if not interactive:
    fig = plt.figure(figsize=plt.figaspect(0.3)*fig_scale*2)
    ax1 = fig.add_subplot(1, 2, 1, projection='3d')
    m,alpha,beta = 0,1,0
    mean_matrix = [m, m]
    covariance_matrix = np.array([[alpha,beta],[beta,alpha]])
    ax1.set_title("m=0, alpha=1, beta=0", pad=0)
    plot_joint_distribution(mean_matrix,covariance_matrix,0,0,contour=False,labels=['x_test','y'], ax=ax1)
    
    ax2 = fig.add_subplot(1, 2, 2, projection='3d')
    m,alpha,beta = 1,0.5,0.3
    mean_matrix = [m, m]
    covariance_matrix = np.array([[alpha,beta],[beta,alpha]])
    ax2.set_title("m=1, alpha=0.5, beta=0.3", pad=0)
    plot_joint_distribution(mean_matrix,covariance_matrix,0,0,contour=False, labels=['x_test','y'], ax=ax2)
_images/938994fc22c1566861dce4fea092964c9e89ecd6ef554018ac9991d761ee1804.png

Understanding covariances#

  • If two variables \(x_i\) covariate strongly, knowing about \(x_1\) tells us a lot about \(x_2\)

  • If covariance is 0, knowing \(x_1\) tells us nothing about \(x_2\) (the conditional and marginal distributions are the same)

  • For covariance matrix \(\begin{bmatrix} 1 & \beta \\ \beta & 1 \end{bmatrix}\):

Hide code cell source
@interact
def interact_covariance(x1=(-3,3,0.5),x2=(-3,3,0.5),beta=(-0.9,0.9,0.1)):
    # Shape of the joint distribution
    mean = [0, 0]
    covariance_matrix = np.array([[1,beta],[beta,1]])
    plot_joint_distribution(mean,covariance_matrix,x1,x2,contour=True, plot_intersect=True, labels=['x1','x2'])
Hide code cell source
if not interactive:
    fig = plt.figure(figsize=plt.figaspect(0.3)*fig_scale*2)
    ax1 = fig.add_subplot(1, 2, 1, projection='3d')
    mean = [0, 0]
    covariance_matrix = np.array([[1,0],[0,1]])
    ax1.set_title("beta=0",fontsize=12*fig_scale, pad=0)
    plot_joint_distribution(mean,covariance_matrix,1,1,contour=True, plot_intersect=True, labels=['x1','x2'], ax=ax1)
 
    ax2 = fig.add_subplot(1, 2, 2, projection='3d')
    covariance_matrix = np.array([[1,0.9],[0.9,1]])
    ax2.set_title("beta=0.9",fontsize=12*fig_scale, pad=0)
    plot_joint_distribution(mean,covariance_matrix,1,1,contour=True, plot_intersect=True, labels=['x1','x2'], ax=ax2)
_images/ee6bb713b7506a0a849660c3f59a0a7a2f082e256ee9a120a941beb424885005.png

Stochastic processes & sampling from higher-dimensional distributions#

  • Instead of sampling \(\mathbf{w}\) and then multiplying by \(\boldsymbol{\Phi}\), we can also generate examples of \(f(x)\) directly.

  • \(\mathbf{f}\) with \(n\) values can be sampled from an \(n\)-dimensional Gaussian distribution with zero mean and covariance matrix \(\mathbf{K} = \alpha \boldsymbol{\Phi}\boldsymbol{\Phi}^\top\):

    • \(\mathbf{f}\) is a stochastic process: series of normally distributed variables (interpolated in the plot)

\[\mathbf{f} \sim \mathcal{N}(\mathbf{0},\mathbf{K})\]
Hide code cell source
from matplotlib.ticker import MaxNLocator

# Exponentiated quadratic is another name for RBF
def exponentiated_quadratic(x, x_prime, variance, lengthscale):
    squared_distance = ((x-x_prime)**2).sum()
    return variance*np.exp((-0.5*squared_distance)/lengthscale**2)

# Compute covariances directly
def compute_kernel(X, X2, kernel, **kwargs):
    K = np.zeros((X.shape[0], X2.shape[0]))
    for i in np.arange(X.shape[0]):
        for j in np.arange(X2.shape[0]):
            K[i, j] = kernel(X[i, :], X2[j, :], **kwargs)
    return K

def plot_process(alpha=1, dimensions=100, num_samples=5, phi="Polynomial", sigma2=0.01, noise=False, length_scale=10.):
    x_pred = np.linspace(0, dimensions-1, dimensions)[:, None] # input locations for predictions
    if phi == "Identity":
        K = alpha*np.eye(dimensions)
    elif phi == "RBF":
        K = compute_kernel(x_pred, x_pred, exponentiated_quadratic, variance=alpha, lengthscale=length_scale)
    else:
        if phi == "Polynomial":
            degree = 5  
        elif phi == "Linear":
            degree = 1
        scale = np.max(x_pred) - np.min(x_pred)
        loc = np.min(x_pred) + 0.5*scale
        Phi_pred = polynomial(x_pred, degree=degree, loc=loc, scale=scale)
        K = alpha*np.dot(Phi_pred, Phi_pred.T)
    if noise:
        K += sigma2*np.eye(x_pred.size)
    
    if dimensions==2:
        fig, axs = plt.subplots(1, 3, figsize=(10*fig_scale, 4*fig_scale))
        ax_d, ax_s, ax_c = axs[0], axs[1], axs[2]
    else:
        fig, axs = plt.subplots(1, 2, figsize=(10*fig_scale, 4*fig_scale))
        ax_s, ax_c = axs[0], axs[1]

    im = ax_c.imshow(K, interpolation='none')
    ax_c.set_title("Covariance matrix")
    ax_c.tick_params(axis='both')
    ax_c.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax_c.yaxis.set_major_locator(MaxNLocator(integer=True))
    ax_c.tick_params(axis='both', pad=0)
    fig.colorbar(im);
    
    samples = np.random.RandomState(seed=4).multivariate_normal(np.zeros(dimensions), K, num_samples).T;
    colors = [cmap(i) for i in np.linspace(0, 1, num_samples)]
    for i in range(samples.shape[1]):
        ax_s.plot(x_pred, samples[:,i], color=colors[i], ls='-');
    ax_s.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax_s.set_title("Interpolated samples \n from Gaussian")
    ax_s.set_xlabel("X")
    ax_s.tick_params(axis='both')
    ax_s.set_aspect(1./ax_s.get_data_ratio())
    ax_s.tick_params(axis='both', pad=0)

    
    if dimensions==2:
        ax_d.scatter(samples[0],samples[1], color=colors)    
        x1 = np.linspace(-2.5, 2.5, 100)
        X1, X2 = np.meshgrid(x1, x1)
        xy = np.hstack((X1.flatten()[:, None], X2.flatten()[:, None]))
        p = stats.multivariate_normal.pdf(xy, mean=[0,0], cov=K)
        P = p.reshape(len(X1), len(X2))
        ax_d.contour(X1, X2, P)
        ax_d.tick_params(axis='both')
        ax_d.set_aspect(1./ax_d.get_data_ratio())
        ax_d.set_xlabel("x0")
        ax_d.set_ylabel("x1")
        ax_d.xaxis.set_major_locator(MaxNLocator(integer=True))
        ax_d.yaxis.set_major_locator(MaxNLocator(integer=True))
        ax_d.set_title("Samples from 2D Gaussian")
        ax_d.tick_params(axis='both', pad=0)

    
@interact
def plot_process_noiseless(alpha=(0.1,2,0.1), dimensions=(2,100,1), phi=["Identity","Linear","Polynomial"]):
    plot_process(alpha=alpha, dimensions=dimensions, phi=phi, noise=False)
Hide code cell source
if not interactive:
    plot_process(alpha=1, dimensions=2, phi="Polynomial")
_images/4ea716a665e52382e67a00a1945236e3f27f010249fa35c100cd565d72c03c33.png

Repeat for 40 dimensions, with \(\boldsymbol{\Phi}\) the polynomial transform:

Hide code cell source
plot_process(alpha=1, dimensions=40, phi="Polynomial")
_images/e7a6661eb9b54855cff8a806be761123b0ea29a88e62537b58546e09685b4ddc.png

More examples of covariances

Noisy functions#

We normally add Gaussian noise to obtain our observations:

\[ \mathbf{y} = \mathbf{f} + \boldsymbol{\epsilon} \]
Hide code cell source
@interact
def plot_covm_noise(sigma=(0.01,0.1,0.01)):
    plot_process(alpha=1, dimensions=100, phi="Polynomial", sigma2=sigma, noise=True)
Hide code cell source
if not interactive:
    plot_process(alpha=1, dimensions=100, phi="Polynomial", sigma2=0.02, noise=True)
_images/4516ee539edbc8814cf7f573bfdff2095635a6775c66cc79a8528337260c09e7.png

Gaussian Process#

  • Usually, we want our functions to be smooth: if two points are similar/nearby, the predictions should be similar.

    • Hence, we need a similarity measure (a kernel)

  • In a Gaussian process we can do this by specifying the covariance function directly (not as \(\mathbf{K} = \alpha \boldsymbol{\Phi}\boldsymbol{\Phi}^\top\))

    • The covariance matrix is simply the kernel matrix: \(\mathbf{f} \sim \mathcal{N}(\mathbf{0},\mathbf{K})\)

  • The RBF (Gaussian) covariance function (or kernel) is specified by

\[ k(\mathbf{x}, \mathbf{x}^\prime) = \alpha \exp\left( -\frac{\left\Vert \mathbf{x}-\mathbf{x}^\prime\right\Vert^2}{2\ell^2}\right). \]

where \(\left\Vert\mathbf{x} - \mathbf{x}^\prime\right\Vert^2\) is the squared distance between the two input vectors

\[ \left\Vert\mathbf{x} - \mathbf{x}^\prime\right\Vert^2 = (\mathbf{x} - \mathbf{x}^\prime)^\top (\mathbf{x} - \mathbf{x}^\prime) \]

and the length parameter \(l\) controls the smoothness of the function and \(\alpha\) the vertical variation.

Now the influence of a point decreases smoothly but exponentially

  • These are our priors \(P(y) = \mathcal{N}(\mathbf{0,\mathbf{K}})\), with mean 0

  • We now want to condition it on our training data: \(P(y|x_{test},\textbf{X}) = \mathcal{N}(\mathbf{\mu,\Sigma})\)

Hide code cell source
@interact
def plot_gprocess(alpha=(0.1,2,0.1), lengthscale=(1,20,1), dimensions=(2,100,1), nr_samples=(1,21,5)):
    plot_process(alpha=alpha, dimensions=dimensions, phi="RBF", length_scale=lengthscale, num_samples=nr_samples)
    
if not interactive:
    plot_gprocess(alpha=1, lengthscale=10, dimensions=100, nr_samples=12)
_images/0a5740a86a450817a2d7389689b3e0e03e5fd4d9e19a6d82bbd7c8b69ede6240.png

Computing the posterior \(P(\mathbf{y}|\mathbf{X})\)#

  • Assuming that \(P(X)\) is a Gaussian density with a covariance given by kernel matrix \(\mathbf{K}\), the model likelihood becomes:

\[ P(\mathbf{y}|\mathbf{X}) = \frac{P(y) \ P(\mathbf{X} \mid y)}{P(\mathbf{X})} = \frac{1}{(2\pi)^{\frac{n}{2}}|\mathbf{K}|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}\mathbf{y}^\top \left(\mathbf{K}+\sigma^2 \mathbf{I}\right)^{-1}\mathbf{y}\right) \]
  • Hence, the negative log likelihood (the objective function) is given by:

\[ E(\boldsymbol{\theta}) = \frac{1}{2} \log |\mathbf{K}| + \frac{1}{2} \mathbf{y}^\top \left(\mathbf{K} + \sigma^2\mathbf{I}\right)^{-1}\mathbf{y} \]
  • The model parameters (e.g. noise variance \(\sigma^2\)) and the kernel parameters (e.g. lengthscale, variance) can be embedded in the covariance function and learned from data.

  • Good news: This loss function can be optimized using linear algebra (Cholesky Decomposition)

  • Bad news: This is cubic in the number of data points AND the number of features: \(\mathcal{O}(n^3 d^3)\)

  • Read more on the derivation

class GP():
    def __init__(self, X, y, sigma2, kernel, **kwargs):
        self.K = compute_kernel(X, X, kernel, **kwargs)
        self.X = X
        self.y = y
        self.sigma2 = sigma2
        self.kernel = kernel
        self.kernel_args = kwargs
        self.update_inverse()
    
    def update_inverse(self):
        # Precompute the inverse covariance and some quantities of interest
        ## NOTE: Not the correct *numerical* way to compute this! For ease of use.
        self.Kinv = np.linalg.inv(self.K+self.sigma2*np.eye(self.K.shape[0]))
        # the log determinant of the covariance matrix.
        self.logdetK = np.linalg.det(self.K+self.sigma2*np.eye(self.K.shape[0]))
        # The matrix inner product of the inverse covariance
        self.Kinvy = np.dot(self.Kinv, self.y)  
        self.yKinvy = (self.y*self.Kinvy).sum()

        
    def log_likelihood(self):
        # use the pre-computes to return the likelihood
        return -0.5*(self.K.shape[0]*np.log(2*np.pi) + self.logdetK + self.yKinvy)
    
    def objective(self):
        # use the pre-computes to return the objective function 
        return -self.log_likelihood()  

Making predictions#

The model makes predictions for \(\mathbf{f}\) that are unaffected by future values of \(\mathbf{f}^*\).
If we think of \(\mathbf{f}^*\) as test points, we can still write down a joint probability density over the training observations, \(\mathbf{f}\) and the test observations, \(\mathbf{f}^*\).

This joint probability density will be Gaussian, with a covariance matrix given by our kernel function, \(k(\mathbf{x}_i, \mathbf{x}_j)\). $\( \begin{bmatrix}\mathbf{f} \\ \mathbf{f}^*\end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K} & \mathbf{K}_\ast \\ \mathbf{K}_\ast^\top & \mathbf{K}_{\ast,\ast}\end{bmatrix}\right) \)$

where \(\mathbf{K}\) is the kernel matrix computed between all the training points,
\(\mathbf{K}_\ast\) is the kernel matrix computed between the training points and the test points,
\(\mathbf{K}_{\ast,\ast}\) is the kernel matrix computed between all the tests points and themselves.

Analogy

The joint probability of two variable \(x_1\) and \(x_2\) following a multivariate Gaussian is: $\( \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K} & \mathbf{K}_\ast \\ \mathbf{K}_\ast^\top & \mathbf{K}_{\ast,\ast}\end{bmatrix}\right) \)$

Conditional Density \(P(\mathbf{y}|x_{test} , \mathbf{X})\)#

Finally, we need to define conditional distributions to answer particular questions of interest

We will need the conditional density for making predictions.

\[ \mathbf{f}^* | \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_f,\mathbf{C}_f) \]

with a mean given by \( \boldsymbol{\mu}_f = \mathbf{K}_*^\top \left[\mathbf{K} + \sigma^2 \mathbf{I}\right]^{-1} \mathbf{y} \)

and a covariance given by \( \mathbf{C}_f = \mathbf{K}_{*,*} - \mathbf{K}_*^\top \left[\mathbf{K} + \sigma^2 \mathbf{I}\right]^{-1} \mathbf{K}_\ast. \)

Hide code cell source
model = GP(x, y, sigma2, exponentiated_quadratic, variance=variance, lengthscale=lengthscale)
mu_f, C_f = model.posterior_f(x_pred, y)

Gaussian process example#

We can now get the mean and covariance learned on our dataset, and sample from this distribution!

Hide code cell source
def shuffled_olympics():
    data = pods.datasets.olympic_marathon_men()
    X = data['X']
    Y = data['Y']
    perm = np.random.RandomState(seed=0).permutation(len(X))
    x_shuffle, y_shuffle = X[perm], Y[perm]
    X_test = np.linspace(1890, 2020, 100)[:, None]
    return x_shuffle, y_shuffle, X_test

@interact
def plot_gp_olympics(nr_points=(0,27,1)):
    x, y, xt = shuffled_olympics()
    plot_gp(x, y, xt, lengthscale=12, nr_points=nr_points, nr_samples=100, show_covariance=True, ylim=(-4,8))
    
if not interactive:
    plot_gp_olympics(nr_points=13)
_images/17873653ff073f46098017e523e2051ff02634059ba4e01226713f9eb8e35501.png

Remember that our prediction is the sum of the mean and the variance: \(P(\mathbf{y}|x_{test} , \mathbf{X}) = \mathcal{N}(\mathbf{\mu,\Sigma})\)

  • The mean is the same as the one computed with kernel ridge (if given the same kernel and hyperparameters)

  • The Gaussian process learned the covariance and the hyperparameters

Hide code cell source
@interact
def plot_gp_olympics_mean(nr_points=(0,27,1), add_mean=False):
    x, y, xt = shuffled_olympics()
    plot_gp(x, y, xt, lengthscale=12, nr_points=nr_points, add_mean=add_mean, nr_samples=100, show_covariance=True, ylim=(-4,8))
    
if not interactive:
    plot_gp_olympics_mean(nr_points=13)
_images/7adf93665b96f9444c21baebd58f0ddd18c46c2af26aafc594625c56a08555d3.png

The values on the diagonal of the covariance matrix give us the variance, so we can simply plot the mean and 95% confidence interval

Hide code cell source
@interact
def plot_gp_olympics_stdev(nr_points=(0,27,1), lengthscale=(4,20,1)):
    x, y, xt = shuffled_olympics()
    plot_gp(x, y, xt, lengthscale=lengthscale, nr_points=nr_points, nr_samples=100, show_covariance=True, show_stdev=True, ylim=(-4,8))
    
if not interactive:
    plot_gp_olympics_stdev(nr_points=13,  lengthscale=12)
_images/6fbba97b41ec7bcbd5e549304bb22cb681a14e7621527ba7b33c9689c5c2c0f9.png

Gaussian Processes in practice (with GPy)#

  • GPyRegression

  • Generate a kernel first

    • State the dimensionality of your input data

    • Variance and lengthscale are optional, default = 1

kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
  • Other kernels:

GPy.kern.BasisFuncKernel?
  • Build model:

m = GPy.models.GPRegression(X,Y,kernel)
Hide code cell source
import GPy
custom_kernel = GPy.kern.RBF(1) + GPy.kern.Linear(1) * GPy.kern.RatQuad(1.0)

figure, axes = plt.subplots(3,4, figsize=(10*fig_scale,7*fig_scale), tight_layout=True)
kerns = [GPy.kern.RBF(1), GPy.kern.Exponential(1), GPy.kern.Matern32(1), 
         GPy.kern.Matern52(1), GPy.kern.Brownian(1),GPy.kern.RatQuad(1), 
         GPy.kern.Linear(1), GPy.kern.PeriodicExponential(1), GPy.kern.White(1), 
         GPy.kern.Poly(1), GPy.kern.MLP(1), custom_kernel]
 
for k,a in zip(kerns, axes.flatten()):
    k.plot(ax=a, x=0.01)
    a.set_title(k.name.replace('_', ' '))
    a.tick_params(axis='both')
    a.set_ylabel("")
    a.set_xlabel("")
_images/7557784ac05f605ec4256fda6b04abfc18caf50bafbd28f6bd5cbdf22f6b4698.png

Matern is a generalized RBF kernel that can scale between RBF and Exponential. Sum is RBF + Linear * RatQuad

Effect of different kernels#

#!pip install "numpy<=1.23.5"
# Needed because of an unfixed deprecation bug in GPy
import copy
np.random.seed(5)
Xr = np.random.uniform(-5.,5.,(10,1))
Yr = Xr + np.sin(Xr*2) + np.random.randn(10,1)*0.05

# I need to make a deep copy since I will be training the kernels
kerneldict = {k.name.replace('_', ' ') : k for k in copy.deepcopy(kerns)}
height_ratio=1

@interact
def plot_kernels(kernel=kerneldict.keys()):
    
    plt.rcParams.update({"lines.markersize": 4})
    figure, axes = plt.subplots(1,2, figsize=(8*fig_scale,5*fig_scale*height_ratio), tight_layout=True, gridspec_kw={'width_ratios': [1, 3]})
    kerneldict[kernel].plot(ax=axes[0], x=0.01)
    
    m = GPy.models.GPRegression(Xr,Yr,kerneldict[kernel])
    m.optimize_restarts(num_restarts = 3, verbose=0)
    fig = m.plot(ax=axes[1])
    
    axes[0].set_title(kernel)
    axes[0].get_xaxis().set_visible(False)
    axes[0].get_yaxis().set_visible(False)
    
if not interactive:
    height_ratio=0.6
    plot_kernels("rbf")
    plot_kernels("sum")
_images/b91ae1b507e562d78c64f7ab228033fb45ea46bcdb76904b40b0df9aa30b1d04.png _images/685507bb69a0193f371233a0800081c53ce577b9e1983fc11e58bdbf10b9d51a.png

Build the untrained GP. The shaded region corresponds to ~95% confidence intervals (i.e. +/- 2 standard deviation)

Hide code cell source
# Generate noisy sine data
X = np.random.uniform(-5.,5.,(10,1))
Y = np.sin(X) + np.random.randn(10,1)*0.05

# Build untrained model
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.) 
m = GPy.models.GPRegression(X,Y,kernel)
fig = m.plot()
_images/3cfd56e36365cd3e97ee05b4928789fdde627ff1cc8d7a0b483e15a1b8552172.png

Train the model (optimize the kernel parameters): maximize the likelihood of the data.
Best to optimize with a few restarts: the optimizer may converges to the high-noise solution. The optimizer is then restarted with a few random initializations of the parameter values.

Hide code cell source
m.optimize_restarts(num_restarts = 3, verbose=0)
fig = m.plot()
_images/3eaa58a612cec22f416d8ec443c231127a12e3528b8d9ff9772427c2ebac9ec5.png

You can also show results in 2D

Hide code cell source
# sample inputs and outputs
X = np.random.uniform(-3.,3.,(50,2))
Y = np.sin(X[:,0:1]) * np.sin(X[:,1:2])+np.random.randn(50,1)*0.05 
ker = GPy.kern.Matern52(2,ARD=True) + GPy.kern.White(2)

# create simple GP model
m = GPy.models.GPRegression(X,Y,ker)

# optimize and plot
m.optimize(max_f_eval = 1000)
fig = m.plot()
_images/6ac3e7189b0e2bc609a004da3824496f758854d2d4b9fcc4a50869496cc0a762.png

We can plot 2D slices using the fixed_inputs argument to the plot function.
fixed_inputs is a list of tuples containing which of the inputs to fix, and to which value.

Hide code cell source
slices = [-1, 0, 1.5]
figure = GPy.plotting.plotting_library().figure(3, 1)
figure.set_size_inches(8*fig_scale,5*fig_scale)
for i, y in zip(range(3), slices):
    # Use fixed_inputs=[(0,y)] for vertical slices
    canvas = m.plot(figure=figure, fixed_inputs=[(1,y)], row=(i+1), plot_data=False)
_images/d302195ca68dc0a0d3d31cc0f42429a44ecf36a32e85f1add1ac9075c8e57a5c.png

Gaussian Processes with scikit-learn#

  • GaussianProcessRegressor

  • Hyperparameters:

    • kernel: kernel specifying the covariance function of the GP

      • Default: “1.0 * RBF(1.0)”

      • Typically leave at default. Will be optimized during fitting

    • alpha: regularization parameter

      • Tikhonov regularization of covariance between the training points.

      • Adds a (small) value to diagonal of the kernel matrix during fitting.

      • Larger values:

        • correspond to increased noise level in the observations

        • also reduce potential numerical issues during fitting

      • Default: 1e-10

    • n_restarts_optimizer: number of restarts of the optimizer

      • Default: 0. Best to do at least a few iterations.

      • Optimizer finds kernel parameters maximizing log-marginal likelihood

  • Retrieve predictions and confidence interval after fitting:

y_pred, sigma = gp.predict(x, return_std=True)

Example

Hide code cell source
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

def f(x):
    """The function to predict."""
    return x * np.sin(x)

X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T

# Observations
y = f(X).ravel()

# Mesh the input space for evaluations of the real function, the prediction and
# its MSE
x = np.atleast_2d(np.linspace(0, 10, 1000)).T

# Instanciate a Gaussian Process model
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)

# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
fig = plt.figure()
plt.plot(x, f(x), 'r:', label=u'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'r.', markersize=2, label=u'Observations')
plt.plot(x, y_pred, 'b-', label=u'Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
         np.concatenate([y_pred - 1.9600 * sigma,
                        (y_pred + 1.9600 * sigma)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left'); 
_images/9c137923145999d78d6882e1c31a2133f642ae7434b0e459c88797c867998070.png

Example with noisy data

Hide code cell source
X = np.linspace(0.1, 9.9, 20)
X = np.atleast_2d(X).T

# Observations and noise
y = f(X).ravel()
dy = 0.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise

# Instanciate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel, alpha=(dy / y) ** 2,
                              n_restarts_optimizer=10)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)

# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
fig = plt.figure()
plt.plot(x, f(x), 'r:', label=u'$f(x) = x\,\sin(x)$')
plt.errorbar(X.ravel(), y, dy, fmt='r.', markersize=2, label=u'Observations')
plt.plot(x, y_pred, 'b-', label=u'Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
         np.concatenate([y_pred - 1.9600 * sigma,
                        (y_pred + 1.9600 * sigma)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left')

plt.show() 
_images/00509565abc8506e7e542442a84c7742d4df1c9e56777fcb8139d2c67dc94aed.png

Gaussian processes: Conclusions#

Advantages:

  • The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals.

  • The prediction interpolates the observations (at least for regular kernels).

  • Versatile: different kernels can be specified.

Disadvantages:

  • They are typically not sparse, i.e., they use the whole sample/feature information to perform the prediction.

    • Sparse GPs also exist: they remember only the most important points

  • They lose efficiency in high dimensional spaces – namely when the number of features exceeds a few dozens.

Gaussian processes and neural networks#

  • You can prove that a Gaussian process is equivalent to a neural network with one layer and an infinite number of nodes

  • You can build deep Gaussian Processes by constructing layers of GPs

ml

Bayesian optimization#

  • The incremental updates you can do with Bayesian models allow a more effective way to optimize functions

    • E.g. to optimize the hyperparameter settings of a machine learning algorithm/pipeline

  • After a number of random search iterations we know more about the performance of hyperparameter settings on the given dataset

  • We can use this data to train a model, and predict which other hyperparameter values might be useful

    • More generally, this is called model-based optimization

    • This model is called a surrogate model

  • This is often a probabilistic (e.g. Bayesian) model that predicts confidence intervals for all hyperparameter settings

  • We use the predictions of this model to choose the next point to evaluate

  • With every new evaluation, we update the surrogate model and repeat

Example (see figure):#

  • Consider only 1 continuous hyperparameter (X-axis)

    • You can also do this for many more hyperparameters

  • Y-axis shows cross-validation performance

  • Evaluate a number of random hyperparameter settings (black dots)

    • Sometimes an initialization design is used

  • Train a model, and predict the expected performance of other (unseen) hyperparameter values

    • Mean value (black line) and distribution (blue band)

  • An acquisition function (green line) trades off maximal expected performace and maximal uncertainty

    • Exploitation vs exploration

  • Optimal value of the asquisition function is the next hyperparameter setting to be evaluated

  • Repeat a fixed number of times, or until time budget runs out

ml

Shahriari et al. Taking the Human Out of the Loop: A Review of Bayesian Optimization

In 2 dimensions:

ml

Surrogate models#

  • Surrogate model can be anything as long as it can do regression and is probabilistic

  • Gaussian Processes are commonly used

    • Smooth, good extrapolation, but don’t scale well to many hyperparameters (cubic)

    • Sparse GPs: select ‘inducing points’ that minimize info loss, more scalable

    • Multi-task GPs: transfer surrogate models from other tasks

  • Random Forests

    • A lot more scalable, but don’t extrapolate well

    • Often an interpolation between predictions is used instead of the raw (step-wise) predictions

  • Bayesian Neural Networks:

    • Expensive, sensitive to hyperparameters

Acquisition Functions#

  • When we have trained the surrogate model, we ask it to predict a number of samples

    • Can be simply random sampling

    • Better: Thompson sampling

      • fit a Gaussian distribution (a mixture of Gaussians) over the sampled points

      • sample new points close to the means of the fitted Gaussians

  • Typical acquisition function: Expected Improvement

    • Models the predicted performance as a Gaussian distribution with the predicted mean and standard deviation

    • Computes the expected performance improvement over the previous best configuration \(\mathbf{X^+}\): $\(EI(X) := \mathbb{E}\left[ \max\{0, f(\mathbf{X^+}) - f_{t+1}(\mathbf{X}) \} \right]\)$

    • Computing the expected performance requires an integration over the posterior distribution, but has a closed form solution.

Bayesian Optimization: conclusions#

  • More efficient way to optimize hyperparameters

  • More similar to what humans would do

  • Harder to parallellize

  • Choice of surrogate model depends on your search space

    • Very active research area

    • For very high-dimensional search spaces, random forests are popular