Kernel Methods

In the following, we will discuss the use of kernels to compare time series. A kernel \(k(\cdot, \cdot)\) is such that there exists an unknown map \(\Phi\) such that:

\[k(\mathbf{x}, \mathbf{y}) = \left\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \right\rangle_{\mathcal{H}}\]

i.e. \(k(\cdot, \cdot)\) is the inner product between \(\mathbf{x}\) and \(\mathbf{y}\) in some (unknown) embedding space \(\mathcal{H}\). In practice, \(k(\mathbf{x}, \mathbf{y})\) will be large when \(\mathbf{x}\) and \(\mathbf{y}\) are similar and close to 0 when they are very dissimilar.

A large number of kernel methods from the machine learning literature rely on the so-called kernel trick, that consists in performing computations in the embedding space \(\mathcal{H}\) without ever actually performing any embedding. As an example, one can compute distance between \(\mathbf{x}\) and \(\mathbf{y}\) in \(\mathcal{H}\) via:

\[\begin{split}\| \Phi(\mathbf{x}) - \Phi(\mathbf{y})\|_\mathcal{H}^2 &= \left\langle \Phi(\mathbf{x}) - \Phi(\mathbf{y}), \Phi(\mathbf{x}) - \Phi(\mathbf{y}) \right\rangle_{\mathcal{H}} \\ &= \left\langle \Phi(\mathbf{x}), \Phi(\mathbf{x}) \right\rangle_{\mathcal{H}} + \left\langle \Phi(\mathbf{y}), \Phi(\mathbf{y}) \right\rangle_{\mathcal{H}} - 2 \left\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \right\rangle_{\mathcal{H}} \\ &= k(\mathbf{x}, \mathbf{x}) + k(\mathbf{y}, \mathbf{y}) - 2 k(\mathbf{x}, \mathbf{y})\end{split}\]

Such computations are used, for example, in the kernel-\(k\)-means algorithm (see below).

Global Alignment Kernel

The Global Alignment Kernel (GAK) is a kernel that operates on time series.

It is defined, for a given bandwidth \(\sigma\), as:

\[k(\mathbf{x}, \mathbf{y}) = \sum_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{y})} \prod_{i=1}^{ | \pi | } \exp \left( - \frac{ \left\| x_{\pi_1(i)} - y_{\pi_2{j}} \right\|^2}{2 \sigma^2} \right)\]

where \(\mathcal{A}(\mathbf{x}, \mathbf{y})\) is the set of all possible alignments between series \(\mathbf{x}\) and \(\mathbf{y}\).

It is advised in [1] to set the bandwidth \(\sigma\) as a multiple of a simple estimate of the median distance of different points observed in different time-series of your training set, scaled by the square root of the median length of time-series in the set. This estimate is made available in tslearn through tslearn.metrics.sigma_gak:

from tslearn.metrics import gak, sigma_gak

sigma = sigma_gak(X)
k_01 = gak(X[0], X[1], sigma=sigma)

Note however that, on long time series, this estimate can lead to numerical overflows, which smaller values can avoid.

Finally, GAK is related to softDTW [3] through the following formula:

\[k(\mathbf{x}, \mathbf{y}) = \exp \left(- \frac{\text{softDTW}_\gamma(\mathbf{x}, \mathbf{y})}{\gamma} \right)\]

where \(\gamma\) is the hyper-parameter controlling softDTw smoothness, which is related to the bandwidth parameter of GAK through \(\gamma = 2 \sigma^2\).

Clustering and Classification

Kernel \(k\)-means [2] is a method that uses the kernel trick to implicitly perform \(k\)-means clustering in the embedding space associated to a kernel. This method is discussed in our User Guide section dedicated to clustering.

Kernels can also be used in classification settings. tslearn.svm offers implementations of Support Vector Machines (SVM) that accept GAK as a kernel. This implementation heavily relies on scikit-learn and libsvm. One implication is that predict_proba and predict_log_proba methods are computed based on cross-validation probability estimates, which has two main implications, as discussed in more details in scikit-learn’s user guide:

1. setting the constructor option probability to True makes the fit step longer since it then relies on an expensive five-fold cross-validation;

2. the probability estimates obtained through predict_proba may be inconsistent with the scores provided by decision_function and the predicted class output by predict.

Examples Using Kernel Methods


  1. Cuturi. “Fast Global Alignment Kernels,” ICML 2011.
[2]I. S. Dhillon, Y. Guan & B. Kulis. “Kernel k-means, Spectral Clustering and Normalized Cuts,” KDD 2004.
[3]M. Cuturi, M. Blondel “Soft-DTW: a Differentiable Loss Function for Time-Series,” ICML 2017.