2.5 Norms

2.5 Norms#

\[ \newcommand\bs[1]{\boldsymbol{#1}} \newcommand\norm[1]{\left\lVert#1\right\rVert} \]

The previous chapter was heavy but this one is light. We will however see an important concept for machine learning and deep learning. The norm is what is generally used to evaluate the error of a model. For instance it is used to calculate the error between the output of a neural network and what is expected (the actual label or value). You can think of the norm as the length of a vector. It is a function that maps a vector to a positive value. Different functions can be used and we will see few examples.

Norms are any functions that are characterized by the following properties:

1- Norms are non-negative values. If you think of the norms as a length, you easily see why it can’t be negative.

2- Norms are \(0\) if and only if the vector is a zero vector

3- Norms respect the triangle inequality. See bellow.

4- \(\norm{\bs{k}\cdot \bs{u}}=| \bs{k}| \cdot \norm{\bs{u}}\). The norm of a vector multiplied by a scalar is equal to the absolute value of this scalar multiplied by the norm of the vector.

It is usually written with two horizontal bars: \(\norm{\bs{x}}\)

The triangle inequality#

The norm of the sum of some vectors is less than or equal to the sum of the norms of these vectors.

\[ \norm{\bs{u}+\bs{v}} \leq \norm{\bs{u}}+\norm{\bs{v}} \]

Example 1.#

\[ \bs{u}= \begin{bmatrix} 1 & 6 \end{bmatrix} \]

and

\[ \bs{v}= \begin{bmatrix} 4 & 2 \end{bmatrix} \]

\[ \norm{\bs{u}+\bs{v}} = \sqrt{(1+4)^2+(6+2)^2} = \sqrt{89} \approx 9.43 \]

\[ \norm{\bs{u}}+\norm{\bs{v}} = \sqrt{1^2+6^2}+\sqrt{4^2+2^2} = \sqrt{37}+\sqrt{20} \approx 10.55 \]

Let’s check these results:

u = np.array([1, 6])
u

array([1, 6])

v = np.array([4, 2])
v

array([4, 2])

u+v

array([5, 8])

np.linalg.norm(u+v)

9.433981132056603

np.linalg.norm(u)+np.linalg.norm(v)

10.554898485297798

u = [0,0,1,6]
v = [0,0,4,2]
u_bis = [1,6,v[2],v[3]]
w = [0,0,5,8]
plt.quiver([u[0], u_bis[0], w[0]],
           [u[1], u_bis[1], w[1]],
           [u[2], u_bis[2], w[2]],
           [u[3], u_bis[3], w[3]],
           angles='xy', scale_units='xy', scale=1, color=sns.color_palette())
# plt.rc('text', usetex=True)
plt.xlim(-2, 6)
plt.ylim(-2, 9)
plt.axvline(x=0, color='grey')
plt.axhline(y=0, color='grey')

plt.text(-1, 3.5, r'$||\vec{u}||$', color=sns.color_palette()[0], size=20)
plt.text(2.5, 7.5, r'$||\vec{v}||$', color=sns.color_palette()[1], size=20)
plt.text(2, 2, r'$||\vec{u}+\vec{v}||$', color=sns.color_palette()[2], size=20)

plt.show()
plt.close()

../_images/dd49e1619ffe5699f8ba77ef522594e5fa1e50abbe94ca59d96863e134fe4cec.png

Geometrically, this simply means that the shortest path between two points is a line

P-norms: general rules#

Here is the recipe to get the \(p\)-norm of a vector:

Calculate the absolute value of each element
Take the power \(p\) of these absolute values
Sum all these powered absolute values
Take the power \(\frac{1}{p}\) of this result

This is more condensly expressed with the formula:

\[ \norm{\bs{x}}_p=(\sum_i|\bs{x}_i|^p)^{1/p} \]

This will be clear with examples using these widely used \(p\)-norms.

The \(L^1\) norm#

\(p=1\) so this norm is simply the sum of the absolute values:

\[ \norm{\bs{x}}_1=\sum_{i} |\bs{x}_i| \]

The Euclidean norm (\(L^2\) norm)#

The Euclidean norm is the \(p\)-norm with \(p=2\) which may be the more used norm (\(L^2\) norm).

\[ \norm{\bs{x}}_2=(\sum_i \bs{x}_i^2)^{1/2}\Leftrightarrow \sqrt{\sum_i \bs{x}_i^2} \]

Let’s see an example of this norm:

Example 2.#

Graphically, the Euclidean norm corresponds to the length of the vector from the origin to the point obtained by linear combination (like applying Pythagorean theorem).

\[\begin{split} \bs{u}= \begin{bmatrix} 3 \\\\ 4 \end{bmatrix} \end{split}\]

\[\begin{split} \begin{align*} \norm{\bs{u}}_2 &=\sqrt{|3|^2+|4|^2}\\\\ &=\sqrt{25}\\\\ &=5 \end{align*} \end{split}\]

So the \(L^2\) norm of \(\bs{u}\) is \(5\).

The \(L^2\) norm can be calculated with the linalg.norm function from numpy. We can check the result:

np.linalg.norm([3, 4])

5.0

Here is the graphical representation of the vectors:

u = [0,0,3,4]

plt.quiver([u[0]],
           [u[1]],
           [u[2]],
           [u[3]],
           angles='xy', scale_units='xy', scale=1)

plt.xlim(-2, 4)
plt.ylim(-2, 5)
plt.axvline(x=0, color='grey')
plt.axhline(y=0, color='grey')

plt.annotate('', xy = (3.2, 0), xytext = (3.2, 4),
             arrowprops=dict(edgecolor='black', arrowstyle = '<->'))
plt.annotate('', xy = (0, -0.2), xytext = (3, -0.2),
             arrowprops=dict(edgecolor='black', arrowstyle = '<->'))

plt.text(1, 2.5, r'$\vec{u}$', size=18)
plt.text(3.3, 2, r'$\vec{u}_y$', size=18)
plt.text(1.5, -1, r'$\vec{u}_x$', size=18)

plt.show()
plt.close()

../_images/8b512104f927d6feb0333dccd654dc09766b0f085116582b20568799e07dfaa0.png

In this case, the vector is in a 2-dimensional space but this stands also for more dimensions.

\[\begin{split} u= \begin{bmatrix} u_1\\\\ u_2\\\\ \cdots \\\\ u_n \end{bmatrix} \end{split}\]

\[ ||u||_2 = \sqrt{u_1^2+u_2^2+\cdots+u_n^2} \]

The squared Euclidean norm (squared \(L^2\) norm)#

\[ \sum_i|\bs{x}_i|^2 \]

The squared \(L^2\) norm is convenient because it removes the square root and we end up with the simple sum of every squared values of the vector.

The squared Euclidean norm is widely used in machine learning partly because it can be calculated with the vector operation \(\bs{x}^\text{T}\bs{x}\). There can be performance gain due to the optimization See here and here for more details.

Example 3.#

\[\begin{split} \bs{x}= \begin{bmatrix} 2 \\\\ 5 \\\\ 3 \\\\ 3 \end{bmatrix} \end{split}\]

\[ \bs{x}^\text{T}= \begin{bmatrix} 2 & 5 & 3 & 3 \end{bmatrix} \]

\[\begin{split} \begin{align*} \bs{x}^\text{T}\bs{x}&= \begin{bmatrix} 2 & 5 & 3 & 3 \end{bmatrix} \times \begin{bmatrix} 2 \\\\ 5 \\\\ 3 \\\\ 3 \end{bmatrix}\\\\ &= 2\times 2 + 5\times 5 + 3\times 3 + 3\times 3= 47 \end{align*} \end{split}\]

x = np.array([[2], [5], [3], [3]])
x

array([[2],
       [5],
       [3],
       [3]])

euclideanNorm = x.T.dot(x)
euclideanNorm

array([[47]])

np.linalg.norm(x)**2

47.0

It works!

Derivative of the squared \(L^2\) norm#

Another advantage of the squared \(L^2\) norm is that its partial derivative is easily computed:

\[\begin{split} u= \begin{bmatrix} u_1\\\\ u_2\\\\ \cdots \\\\ u_n \end{bmatrix} \end{split}\]

\[ \norm{u}_2^2 = u_1^2+u_2^2+\cdots+u_n^2 \]

\[\begin{split} \begin{cases} \dfrac{d\norm{u}_2^2}{du_1} = 2u_1\\\\ \dfrac{d\norm{u}_2^2}{du_2} = 2u_2\\\\ \cdots\\\\ \dfrac{d\norm{u}_2^2}{du_n} = 2u_n \end{cases} \end{split}\]

Derivative of the \(L^2\) norm#

In the case of the \(L^2\) norm, the derivative is more complicated and takes every elements of the vector into account:

\[ \norm{u}_2 = \sqrt{(u_1^2+u_2^2+\cdots+u_n^2)} = (u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}} \]

\[\begin{split} \begin{align*} \dfrac{d\norm{u}_2}{du_1} &= \dfrac{1}{2}(u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}-1}\cdot \dfrac{d}{du_1}(u_1^2+u_2^2+\cdots+u_n^2)\\\\ &=\dfrac{1}{2}(u_1^2+u_2^2+\cdots+u_n^2)^{-\frac{1}{2}}\cdot \dfrac{d}{du_1}(u_1^2+u_2^2+\cdots+u_n^2)\\\\ &=\dfrac{1}{2}\cdot\dfrac{1}{(u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}}}\cdot \dfrac{d}{du_1}(u_1^2+u_2^2+\cdots+u_n^2)\\\\ &=\dfrac{1}{2}\cdot\dfrac{1}{(u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}}}\cdot 2\cdot u_1\\\\ &=\dfrac{u_1}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\\\ \end{align*} \end{split}\]

\[\begin{split} \begin{cases} \dfrac{d\norm{u}_2}{du_1} = \dfrac{u_1}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\\\ \dfrac{d\norm{u}_2}{du_2} = \dfrac{u_2}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\\\ \cdots\\\\ \dfrac{d\norm{u}_2}{du_n} = \dfrac{u_n}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\\\ \end{cases} \end{split}\]

One problem of the squared \(L^2\) norm is that it hardly discriminates between 0 and small values because the increase of the function is slow.

We can see this by graphically comparing the squared \(L^2\) norm with the \(L^2\) norm. The \(z\)-axis corresponds to the norm and the \(x\)- and \(y\)-axis correspond to two parameters. The same thing is true with more than 2 dimensions but it would be hard to visualize it.

\(L^2\) norm:

Representation of the L2 norm

The L2 norm

Squared \(L^2\) norm:

Representation of the squared L2 norm

The squared L2 norm

\(L^1\) norm:

Representation of the L1 norm

The L1 norm

These plots are done with the help of this website. Go and plot these norms if you need to move them in order to catch their shape.

The max norm#

It is the \(L^\infty\) norm and corresponds to the absolute value of the greatest element of the vector.

\[ \norm{\bs{x}}_\infty = \max\limits_i|x_i| \]

Matrix norms: the Frobenius norm#

\[ \norm{\bs{A}}_F=\sqrt{\sum_{i,j}A^2_{i,j}} \]

This is equivalent to take the \(L^2\) norm of the matrix after flattening.

The same Numpy function can be use:

A = np.array([[1, 2], [6, 4], [3, 2]])
A

array([[1, 2],
       [6, 4],
       [3, 2]])

np.linalg.norm(A)

8.366600265340756

Expression of the dot product with norms#

\[ \bs{x}^\text{T}\bs{y} = \norm{\bs{x}}_2\cdot\norm{\bs{y}}_2\cos\theta \]

Example 4.#

\[\begin{split} \bs{x}= \begin{bmatrix} 0 \\\\ 2 \end{bmatrix} \end{split}\]

and

\[\begin{split} \bs{y}= \begin{bmatrix} 2 \\\\ 2 \end{bmatrix} \end{split}\]

x = [0,0,0,2]
y = [0,0,2,2]

plt.xlim(-2, 4)
plt.ylim(-2, 5)
plt.axvline(x=0, color='grey', zorder=0)
plt.axhline(y=0, color='grey', zorder=0)

plt.quiver([x[0], y[0]],
           [x[1], y[1]],
           [x[2], y[2]],
           [x[3], y[3]],
           angles='xy', scale_units='xy', scale=1)

plt.text(-0.5, 1, r'$\vec{x}$', size=18)
plt.text(1.5, 0.5, r'$\vec{y}$', size=18)

plt.show()
plt.close()

../_images/e1087be56d6fbe540d4a095d0b73e33f78beb458593a1755cdf2080ffe6f74eb.png

We took this example for its simplicity. As we can see, the angle \(\theta\) is equal to 45°.

\[\begin{split} \bs{x^\text{T}y}= \begin{bmatrix} 0 & 2 \end{bmatrix} \cdot \begin{bmatrix} 2 \\\\ 2 \end{bmatrix} = 0\times2+2\times2 = 4 \end{split}\]

and

\[ \norm{\bs{x}}_2=\sqrt{0^2+2^2}=\sqrt{4}=2 \]

\[ \norm{\bs{y}}_2=\sqrt{2^2+2^2}=\sqrt{8} \]

\[ 2\times\sqrt{8}\times cos(45)=4 \]

Here are the operations using numpy:

# Note: np.cos take the angle in radian
np.cos(np.deg2rad(45))*2*np.sqrt(8)

4.000000000000001

Papers#

Using Block Norm in Sparse Optimization for Super-Resolution, By R. Hamedi, M. Amintoosi and M. Zaferanieh