Paying back knowledge debt on reproducing kernel Hilbert spaces

Programmers build up technical debt, such as poorly-documented code or code that’s not compartmentalized and is therefore more difficult to modify, when they need to implement features quickly. Eventually, that debt has to be paid back through writing documentation, refactoring, or implementing tests. Likewise, as researchers, we build up knowledge debt when we use concepts without deeply understanding them. Eventually, we have to pay some of it back so that our shallow understanding doesn’t hinder our future efforts. This is the first in an occasional series documenting my efforts to pay back some of mine.

Introduction

Reproducing kernel Hilbert spaces often come up when reading about Gaussian processes, especially in the context of Bayesian optimization. For example, in Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design:

we also consider the more agnostic case where $f$ has low “complexity” as measured under an RKHS norm…The notion of reproducing kernel Hilbert Spaces (RKHS, Wahba 1990) is intimately related to GPs and their covariance functions $k(\boldsymbol{x}, \boldsymbol{x’})$. The RKHS $\mathcal{H}_k(D)$ is a complete subspace of $L_2(D)$ of nicely behaved functions, with an inner product $\langle\cdot , \cdot\rangle_k$ obeying the reproducing property: $\langle f, k(\boldsymbol{x}, \cdot)\rangle_k = f(\boldsymbol{x})$ for all $f \in \mathcal{H}_k(D)$… The induced RKHS norm $|f|_k = \sqrt{\langle f, f\rangle_k}$ measures smoothness of $f$ w.r.t $k$…

Previous understanding

A kernel $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ maps pairs of inputs $\mathbf{x}, \mathbf{x’} \in \mathcal{X}$ to a real number and is analogous to a similarity measure between inputs. Kernels are symmetric $\left(k(\mathbf{x}, \mathbf{x’}) = k(\mathbf{x’}, \mathbf{x})\right)$ and positive definite $\left(\sum_{i=1}^n\sum_{j=1}^nc_ic_j k(\mathbf{x_i}, \mathbf{x_j}) \geq 0 :\forall n \in \mathbb{N}, \mathbf{x_1}…\mathbf{x_n}\in \mathcal{X}, c_1…c_n \in \mathbb{R}\right)$.

An RKHS is a set of “nicely-behaved” functions somehow associated with a specific kernel. The functions drawn from any Gaussian process are one example of an RKHS.

Deeper understanding

We’ll first try to understand the “Hilbert space” part of “reproducing kernel Hilbert space,” and then investigate the “reproducing kernel” part. Finally, we’ll briefly discuss why RKHSs are convenient.

Hilbert spaces

The meat of the quote is the sentence beginning with “The RKHS $\mathcal{H}_k(D)$ is a complete subspace of $L_2(D)$…” It’s also a fairly intimidating sentence packed with mathy terms and notation. To understand it, we’ll begin with a quick refresher on vectors.

From high school/early college math, we recall that an n-dimensional vector $x \in \mathbb{R}^n$ is an ordered collection of $n$ real numbers. For example, the vector $(-5, 2.5, \pi)$ is a vector in $\mathbb{R}^3$. More abstractly, we can use the notation $\mathbf{a} = (a_1, a_2, … a_n) \in \mathbb{R}^n$. There are two basic operations on vectors: element-wise addition

\[\mathbf{a} + \mathbf{b} = (a_1 + b_1, a_2 + b_2, ..., a_n + b_n)\]

and scalar multiplication

\[c\mathbf{a} = (ca_1, ca_2, ..., ca_n)\]

The key insight that you may not have been taught (I certainly wasn’t until partway through grad school) is that functions can also be vectors. Specifically, functions that map from any metric space $\mathcal{X}$ to $\mathbb{R}$ are vectors. A metric space is a set of objects for which we can define a distance metric. For example, vectors of real numbers $\mathbb{R}^n$ are members of metric spaces (with the Euclidiean distance being one possible metric). As long as we define a valid distance metric, so are discrete, structured, things like documents, chemical compounds, and protein sequences. This means that the set of functions that map, for example, a protein sequence to a number is a vector space.

Now, let’s consider some special types of vector spaces. A Banach space contains vectors for which we can calculate a “size,” or norm $|\cdot|$. For example, the Euclidean norm for $\mathbb{R}^n$ is $|\mathbf{x}|_2 =\sum_{i=1}^nx_i^2$ (it’s the Euclidean distance from the 0-vector). Functions can also have norms: the p-norm norm of functions from $\mathbb{R}^n$ to $\mathbb{R}$ is

\[\|f\|_p = \left(\int_{-\infty}^{\infty}|f^p(x)|dx\right)^{\frac{1}{p}}\]

The subset of a Banach space with a finite p-norm is called an $L_p$-space. So “$L_2(D)$” just means “the set of functions $f: D \to \mathbb{R}$ with a finite 2-norm.”

Now we’re ready for Hilbert spaces. A Hilbert space is a set of vectors for which a dot product (sometimes called an inner product) is defined. For example, $\mathbb{R}^n$ is a Hilbert space, and its dot product is

\[\langle\mathbf{a}, \mathbf{b}\rangle = \sum_{i=1}^na_ib_i\]

Conveniently, $\sqrt{\langle\mathbf{a}, \mathbf{a}\rangle}$ is the Euclidean norm if $\mathbf{a} \in \mathbb{R}^n$, and in fact the square root of a vector’s dot product with itself is always a valid norm. This means that every Hilbert space is also a Banach space (but every Banach space is not a Hilbert space).

We can also define dot products over functions. For example

\[\langle f, g\rangle = \int_{-\infty}^{\infty}f(x)g(x)dx\]

is a dot product that induces the the $L_2$ norm. However, the functions in a RKHS $\mathcal{H_k(D)}$ have another dot product, one associated with a reproducing kernel.

Reproducing kernels

A kernel $K$ is a symmetric, positive definite function that maps two vectors to a real number: $K: \mathcal{X} \times\mathcal{X} \to \mathbb{R}$. By positive definite, we mean that for any non-zero $L_2$ function $f$,

\[\int\int f(\mathbf{x})K(\mathbf{x}, \mathbf{x'})f(\mathbf{x'})d\mathbf{x}d\mathbf{x'} > 0\]

This implies that for any finite set of inputs $X$, where $\mathbf{x}_i$ is the $i^{th}$ row of $X$, the matrix $\mathcal{K}$ defined by $\mathcal{K}_{i,j} = K(\mathbf{x}_i, \mathbf{x}_j)$ is positive definite. For Gaussian processes, this is important because we need the kernel function to generate valid covariances between any finite subset of inputs, and all positive definite matrices are valid covariances. It is often helpful to think of $K$ as a measure of similarity between objects. A much more detailed discussion of kernel functions can be found in Chapter 4 of Gaussian processes for machine learning.

And so we finally arrive at the reproducing kernel part of RKHSs.

Given a kernel function $K$ and a fixed vector $\mathbf{x}$, we can define a function $k_{\mathbf{x}}(\mathbf{x’}) = K(\mathbf{x}, \mathbf{x’}): \mathcal{X} \to \mathbb{R}$. This function is evaluated at points $\mathbf{x’}$ by computing the kernel between $\mathbf{x}$ and $\mathbf{x’}$. The RKHS $\mathcal{H}_K$ defined by this kernel has the following properties:

For every $\mathbf{x}$, $k_{\mathbf{x}}(\mathbf{x’})$ as a function of $\mathbf{x’}$ belongs to $\mathcal{H}_K$.
The dot product $\langle f(\mathbf{x’}), k_{\mathbf{x}}(\mathbf{x’})\rangle_{\mathcal{H}}$ reproduces, or is equal to, $f(\mathbf{x})$.
$\langle k_{\mathbf{x’}}(\mathbf{x}), k_{\mathbf{x}}(\mathbf{x’})\rangle_{\mathcal{H}} = K(\mathbf{x}, \mathbf{x’})$.

Every positive definite kernel has a one-to-one correspondence with an RKHS.

Why we care

RKHSs are convenient for functional analysis because they are smooth spaces of smooth functions. Moving a small “distance” in function space results in a new function that is pointwise closeness with the starting function. Interestingly, given a kernel $K$, functions drawn from the Gaussian process $\mathcal{GP}(0, K)$ are not members of the RKHS associated with $K$, but the posterior mean after observing some data does lie in the associated RKHS.