Kernel Trick

Kernel Trick#

Making linear models non-linear

Mahmood Amintoosi, Spring 2025

Computer Science Dept, Ferdowsi University of Mashhad

I should mention that the original material of this course was from Open Machine Learning Course, by Joaquin Vanschoren and others.

Note: Kernel trick, which is also known as kernel method or kernel machines are a class of algorithms for pattern analysis, where involve using linear classifiers to solve nonlinear problems. These methods are different from kernelization, that is a technique for designing efficient algorithms that achieve their efficiency by a preprocessing stage in which inputs to the algorithm are replaced by a smaller input, called a “kernel”.

Feature Maps#

Linear models: $\hat{y} = \mathbf{w}\mathbf{x} + w_0 = \sum_{i=1}^{p} w_i x_i + w_0 = w_0 + w_1 x_1 + ... + w_p x_p $
When we cannot fit the data well, we can add non-linear transformations of the features
Feature map (or basis expansion ) $\phi$ : $ X \rightarrow \mathbb{R}^d $

\[y=\textbf{w}^T\textbf{x} \rightarrow y=\textbf{w}^T\phi(\textbf{x})\]

E.g. Polynomial feature map: all polynomials up to degree $d$ and all products

\[[1, x_1, ..., x_p] \xrightarrow{\phi} [1, x_1, ..., x_p, x_1^2, ..., x_p^2, ..., x_p^d, x_1 x_2, ..., x_{p-1} x_p]\]

Example with $p=1, d=3$ :

\[y=w_0 + w_1 x_1 \xrightarrow{\phi} y=w_0 + w_1 x_1 + w_2 x_1^2 + w_3 x_1^3\]

Ridge regression example#

Weights: [0.418]

_images/4b839e696a07140c4b10c0d5940af0e4fec2f2107d53cd9089ec3ea65158fdd7.png

Add all polynomials $x^d$ up to degree 10 and fit again:
- e.g. use sklearn PolynomialFeatures

	x0	x0^2	x0^3	x0^4	x0^5	x0^6	x0^7	x0^8	x0^9	x0^10
0	-0.752759	0.566647	-0.426548	0.321088	-0.241702	0.181944	-0.136960	0.103098	-0.077608	0.058420
1	2.704286	7.313162	19.776880	53.482337	144.631526	391.124988	1057.713767	2860.360362	7735.232021	20918.278410
2	1.391964	1.937563	2.697017	3.754150	5.225640	7.273901	10.125005	14.093639	19.617834	27.307312
3	0.591951	0.350406	0.207423	0.122784	0.072682	0.043024	0.025468	0.015076	0.008924	0.005283
4	-2.063888	4.259634	-8.791409	18.144485	-37.448187	77.288869	-159.515582	329.222321	-679.478050	1402.366700

Weights: [ 0.643  0.297 -0.69  -0.264  0.41   0.096 -0.076 -0.014  0.004  0.001]

_images/4c95da52e6f2ad78ada1fe4f3c130317838f51c195a0fc8978f9590028947ccf.png

How expensive is this?#

You may need MANY dimensions to fit the data
- Memory and computational cost
- More weights to learn, more likely overfitting
Ridge has a closed-form solution which we can compute with linear algebra: $$w^{*} = (X^{T}X + \alpha I)^{-1} X^T Y$$
Since X has $n$ rows (examples), and $d$ columns (features), $X^{T}X$ has dimensionality $d \times d$
Hence Ridge is quadratic in the number of features, $\mathcal{O}(d^2n)$
After the feature map $\Phi$, we get $$w^{*} = (\Phi(X)^{T}\Phi(X) + \alpha I)^{-1} \Phi(X)^T Y$$
Since $\Phi$ increases $d$ a lot, $\Phi(X)^{T}\Phi(X)$ becomes huge

Linear SVM example (classification)#

_images/6b7d1fc1e3f26626773da1a98c5711cb676866dfa031462ab565ebb324cf8be4.png

We can add a new feature by taking the squares of feature1 values

_images/5abee120306e3f0cd85f512f28bfcdd63bc90863506848293ae20a84f18185b7.png

Now we can fit a linear model

_images/e3e8da7e7be02014b52a26f4e8039b6723bc5e699256114ab0f53d11390d4fae.png

As a function of the original features, the decision boundary is now a polynomial as well $$y = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_2^2 > 0$$

_images/db5efd5d5ff8417a94a7d41b510232c2905e18b35ee253ddd3e34e7796a1ce00.png

The kernel trick#

Computations in explicit, high-dimensional feature maps are expensive
For some feature maps, we can, however, compute distances between points cheaply
- Without explicitly constructing the high-dimensional space at all
Example: quadratic feature map for $\mathbf{x} = (x_1,..., x_p )$:

\[ \Phi(\mathbf{x}) = (x_1,..., x_p , x_1^2,..., x_p^2 , \sqrt{2} x_1 x_2 , ..., \sqrt{2} x_{p-1} x_{p}) \]

A kernel function exists for this feature map to compute dot products

\[ k_{quad}(\mathbf{x_i},\mathbf{x_j}) = \Phi(\mathbf{x_i}) \cdot \Phi(\mathbf{x_j}) = \mathbf{x_i} \cdot \mathbf{x_j} + (\mathbf{x_i} \cdot \mathbf{x_j})^2\]

Skip computation of $\Phi(x_i)$ and $\Phi(x_j)$ and compute $k(x_i,x_j)$ directly

Kernel Method#

Kernel $k$ corresponding to a feature map $\Phi$: $ k(\mathbf{x_i},\mathbf{x_j}) = \Phi(\mathbf{x_i}) \cdot \Phi(\mathbf{x_j})$
Computes dot product between $x_i,x_j$ in a high-dimensional space $\mathcal{H}$
- Kernels are sometimes called generalized dot products
- $\mathcal{H}$ is called the reproducing kernel Hilbert space (RKHS)
The dot product is a measure of the similarity between $x_i,x_j$
- Hence, a kernel can be seen as a similarity measure for high-dimensional spaces
If we have a loss function based on dot products $\mathbf{x_i}\cdot\mathbf{x_j}$ it can be kernelized
- Simply replace the dot products with $k(\mathbf{x_i},\mathbf{x_j})$

Chapter 5 of Zaki book#

Kernel K-Means#

Slides of Chap. 13

Other examples#

For Kernel PCA, see section 7.3 of Zaki Book
For Kernel SMV and Kernel Regression see Appendix A
- MIT, Lecture on Kernel Regression

Summary#

Feature maps $\Phi(x)$ transform features to create a higher-dimensional space
- Allows learning non-linear functions or boundaries, but very expensive/slow
For some $\Phi(x)$, we can compute dot products without constructing this space
- Kernel trick: $k(\mathbf{x_i},\mathbf{x_j}) = \Phi(\mathbf{x_i}) \cdot \Phi(\mathbf{x_j})$
- Kernel $k$ (generalized dot product) is a measure of similarity between $\mathbf{x_i}$ and $\mathbf{x_j}$
There are many such kernels
- Polynomial kernel: $k_{poly}(\mathbf{x_1},\mathbf{x_2}) = (\gamma (\mathbf{x_1} \cdot \mathbf{x_2}) + c_0)^d$
- RBF (Gaussian) kernel: $k_{RBF}(\mathbf{x_1},\mathbf{x_2}) = exp(-\gamma ||\mathbf{x_1} - \mathbf{x_2}||^2)$
- A kernel matrix can be precomputed using any similarity measure (e.g. for text, graphs,…)
Any loss function where inputs appear only as dot products can be kernelized
- E.g. Linear SVMs: simply replace the dot product with a kernel of choice
The Representer theorem states which other loss functions can also be kernelized and how
- Ridge regression, Logistic regression, Perceptrons,…