Feature Maps: Bridging to Kernel Methods

Feature Maps: Bridging to Kernel Methods#

1. Introduction to Feature Maps#

A feature map $\phi$ transforms input data into a higher-dimensional space:

\[ \phi: \mathbb{R}^p \rightarrow \mathbb{R}^d \quad \text{where } d \gg p \]

Key Motivation: Enable linear models to solve non-linear problems by:

Explicit mapping for classification/clustering
Basis expansion for regression

For advanced feature creation techniques, see Feature-Engine’s MathFeatures.

2. Classification: Two-Ring Problem#

2.1 Original 2D Space (Linear Failure)#

_images/363f34258d4ba3261ce60e3c70de2a1af7d91effe8280dee35b08cb87179cc28.png

The map $\phi(x,y) = [x, y, x^2+y^2]$ makes classes linearly separable by converting radial distance to a linear feature.

# Transform to 3D (z = x² + y²)
X_3d = np.column_stack((X[:, 0], X[:, 1], X[:, 0]**2 + X[:, 1]**2))

# Perceptron in 3D
perceptron.fit(X_3d, y)

_images/83c4d63ef99ae0564446d437bc4f28890986a28f12bd0457fb2a2365c97c3e81.png

3. Clustering: Two-Ring Problem#

_images/a4c6d65ac948bfc2c38d83db6fbd9a84cecf57c2a4ef26c1841ae7eb29427dde.png

Regression Example: Using Feature Maps for Non-Linear Data#

Goal: To demonstrate how mapping features to a higher-dimensional space can allow a linear model to fit non-linear data.

Motivation: Standard linear regression models assume a linear relationship between the features (independent variables) and the target (dependent variable). What happens when the underlying relationship is non-linear? A simple linear model will perform poorly.

Consider a scenario where the data follows a curve, for example, a quadratic relationship. A straight line (from linear regression) won’t capture this curve effectively.

Solution Idea: We can transform the original features into a higher-dimensional space where the relationship becomes linear. This transformation is called a feature map, denoted by Φ.

Let’s illustrate this with an example.

1. Generating Non-Linear Data#

We’ll create synthetic data where y depends quadratically on x, plus some random noise to make it more realistic.

_images/b8e0a7d93fbdbd2e707622dc219a3c9cabb2dd22847a4805076e7dc2ed4a6da4.png

As you can see, the data clearly follows a curve, not a straight line.

2. Attempting Simple Linear Regression (Original Space)#

Let’s see how a standard linear regression model performs on this data without any feature transformation.

_images/63b75240e042c4b8664ac88be0fa23dee3c526ae9b67e6ebcccddc4c6c621065.png

Linear Regression Score (R^2): 0.4260

The linear model tries its best to fit a straight line through the curved data, but it’s clearly a poor fit. The R² score will likely be low, indicating that the model doesn’t explain much of the variance in the data.

3. Applying a Feature Map (Polynomial Features)#

Now, let’s apply a feature map. Since we know the underlying relationship is quadratic (y ≈ ax² + bx + c), a suitable feature map Φ would transform our single feature x into two features: x and x². So, Φ(x) = [x, x²].

We can achieve this using Scikit-Learn’s PolynomialFeatures.

# Create polynomial features (degree=2 includes x and x^2)
# include_bias=False because LinearRegression handles the intercept (bias) term
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

# Display the original feature and the transformed features for the first few samples
print("Original X (first 5 samples):\n", X[:5])
print("\nTransformed X_poly (first 5 samples) [x, x^2]:\n", X_poly[:5])

Original X (first 5 samples):
 [[-0.75275929]
 [ 2.70428584]
 [ 1.39196365]
 [ 0.59195091]
 [-2.06388816]]

Transformed X_poly (first 5 samples) [x, x^2]:
 [[-0.75275929  0.56664654]
 [ 2.70428584  7.3131619 ]
 [ 1.39196365  1.93756281]
 [ 0.59195091  0.35040587]
 [-2.06388816  4.25963433]]

Notice that our data is now represented in a 2-dimensional feature space [x, x²].

4. Linear Regression in the Higher-Dimensional Feature Space#

Now, we train a linear regression model, but using the transformed features (X_poly). The model will learn weights for both x and x², effectively fitting a model of the form: y = w₁*x + w₂*x² + b where w₁, w₂ are the weights (coefficients) and b is the intercept (bias). This is a linear model with respect to the new features x and x².

Original linear model:

\[ \hat{y} = \mathbf{w}^T\mathbf{x} = \sum_{i=1}^{p} w_i x_i \]

A transformation $\phi$ that maps features to higher dimensions:

\[ \phi: \mathbb{R}^p \rightarrow \mathbb{R}^d \quad (d \gg p) \]

New model becomes:

\[ \hat{y} = \mathbf{w}^T\phi(\mathbf{x}) \]

For p=1, degree=2:

\[ [1, x_1] \xrightarrow{\phi} [1, x_1, x_1^2] \]

_images/5d57cbca298cfdfecd38158ea9d4aadf5943161a6beddbbeed5baab5ccc12781.png

Polynomial Regression Score (R^2): 0.8525
Coefficients (w1, w2): [[0.93366893 0.56456263]]
Intercept (b): [1.78134581]

5. Conclusion#

By mapping the original single feature x to a higher-dimensional space [x, x²], we enabled a standard linear regression model to perfectly capture the non-linear (quadratic) relationship present in the data. The resulting fit is much better, as indicated visually and by the significantly higher R² score.

This example illustrates the power of feature maps: transforming data into a space where linear models can become effective, even if the original relationship was non-linear.

Transition to Kernel Trick: While this explicit feature mapping works well for low-dimensional data and simple transformations, calculating and storing these higher-dimensional features can become computationally expensive or even infeasible if the original data or the target feature space is very high-dimensional (e.g., using polynomial features of a very high degree or other complex maps). The Kernel Trick provides a mathematical shortcut to achieve the same result as working in the high-dimensional feature space without explicitly computing the coordinates of the data in that space. It allows algorithms that only depend on dot products between data points (like Support Vector Machines) to operate implicitly in the high-dimensional feature space, making complex mappings computationally tractable. This regression example helps build the intuition for why such implicit mappings are desirable.

For modern approaches, see arXiv:2406.03505.

To Do#

Limitations of Explicit Maps:#

Cost: $\phi(X)$ storage is $\mathcal{O}(nd)$
Overfitting: Risk when $d \gg n$

Kernel Trick Solution:

Implicit computation via $k(x_i,x_j) = \langle \phi(x_i), \phi(x_j) \rangle$
Example: RBF kernel $k(x,y) = \exp(-\gamma \|x-y\|^2)$ corresponds to infinite-dimensional $\phi$

For Ridge regression: $$ \mathbf{w}^* = (\Phi(X)^T\Phi(X) + \alpha I)^{-1} \Phi(X)^T \mathbf{y} $$

Original features (p=1): O(1) computation
After polynomial expansion (d=10): O(100) computation

5. Key Takeaways#

Approach	Pros	Cons
Original Features	Simple, fast	Limited expressiveness
Feature Maps	Enables non-linearity	Computational cost
	Works with linear models	Risk of overfitting

Next Step: Kernel methods avoid explicit feature map computation!