mogpe documentation!

This package implements a Mixtures of Gaussian Process Experts (MoGPE) model with a GP-based gating network. Inference exploits factorisation through sparse GPs and trains a variational lower bound stochastically. It also provides the building blocks for implementing other Mixtures of Gaussian Process Experts models. mogpe uses GPflow 2.2/TensorFlow 2.4+ for running computations, which allows fast execution on GPUs, and uses Python ≥ 3.8. It was originally created by Aidan Scannell.

Getting Started

To get started please see the Install instructions. Notes on using mogpe can be found in Usage and the examples directory and notebooks show how the model can be configured and trained. Details on the implementation can be found in What’s going on with this code?! and the mogpe API.

Install

This is a Python package that should be installed into a virtual environment. Start by cloning the repo from Github:

git clone https://github.com/aidanscannell/mogpe.git

The package can then be installed into a virutal environment by adding it as a local dependency.

Install with Poetry

mogpe’s dependencies and packaging are being managed with Poetry, instead of other tools such as Pipenv. To install mogpe into an existing poetry environment add it as a dependency under [tool.poetry.dependencies] (in the pyproject.toml configuration file) with the following line:

mogpe = {path = "/path/to/mogpe"}

If you want to develop the mogpe codebase then set develop=true:

mogpe = {path = "/path/to/mogpe", develop=true}

The dependencies in a pyproject.toml file are resolved and installed with:

poetry install

If you do not require the development packages then you can opt to install without them:

poetry install --no-dev
Running Python scripts inside Poetry Environments

There are multiple ways to run code with Poetry and I advise checking out the documentation. My favourite option is to spawn a shell within the virtual environment:

poetry shell

and then python scripts can simply be run with:

python codey_mc_code_face.py

Alternatively, you can run scripts without spawning an instance of the virtual environment with the following command:

poetry run python codey_mc_code_face.py

I am much preferring using Poetry, however, it does feel quite slow doing some things and annoyingly doesn’t integrate that well with Read the Docs. A setup.py file is still needed for building the docs on Read the Docs, so I use Dephell to generate the requirements.txt and setup.py files from pyproject.toml.

Install with Pip

Create a new virtual environment and activate it, for example:

mkvirtualenv --python=python3 mogpe-env
workon mogpe-env

cd into the root of this package and install it and its dependencies with:

pip install .

Usage

The model (and training with optional logging and checkpointing) can be configured using a TOML file. Please see the examples directory showing how to configure and train MixtureOfSVGPExperts on multiple data sets. See the notebooks (two experts and three experts) for how to define and train an instance of MixtureOfSVGPExperts without configuration files.

Training

The training directory contains methods for three different training loops, for saving and loading the model, and for initialising the model (and training) from TOML config files.

Training Loops

mogpe.training.training_loops contains three different training loops,

  1. A simple TensorFlow training loop,

  2. A monitoring tf training loop - a TensorFlow training loop with monitoring within tf.function(). This method only monitors the model parameters and loss (elbo) and does not generate images.

  3. A monitoring training loop - this loop generates images during training. The matplotlib functions cannot be inside the tf.function so this training loop should be slower but provide more insights.

To use Tensorboard cd to the logs directory and start Tensorboard:

cd /path-to-log-dir
tensorboard --logdir . --reload_multifile=true

Tensorboard can then be found by visiting http://localhost:6006/ in your browser.

Saving/Loading

mogpe.training.utils contains methods for loading and saving the model. See the examples for how to use.

TOML Config Parsers

mogpe.training.toml_config_parsers contains methods for 1) initialising the MixtureOfSVGPExperts class and 2) training it from a TOML config file. See the examples for how to use the TOML config parsers.

mogpe.helpers

The helpers directory contains classes to aid plotting models with 1D and 2D inputs. These are exploited by the monitored training loops.

Training MixtureOfSVGPExperts on the Motorcycle Data Set (with two experts)

This notebook is a basic example of configuring and training a Mixture of Gaussian Process Experts (using MixtureOfSVGPExperts) on the motorcycle dataset with two experts. Instantiating the model with two experts is a special case because only a single gating function is needed (not two!) and the gating network can be calculated in closed form, which is not the case when using more than two experts.

[1]:
import numpy as np
import gpflow as gpf
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import clear_output

from gpflow import default_float
from gpflow.utilities import print_summary
from gpflow.likelihoods import Bernoulli

from mogpe.experts import SVGPExperts, SVGPExpert
from mogpe.gating_networks import SVGPGatingNetwork
from mogpe.mixture_of_experts import MixtureOfSVGPExperts
from mogpe.training import training_tf_loop
from mogpe.helpers.plotter import Plotter1D

Let’s start by loading the motorcycle dataset and plotting it to see what we’re dealing with.

[2]:
def load_mcycle_dataset(filename='../data/mcycle.csv'):
    df = pd.read_csv(filename, sep=',')
    X = pd.to_numeric(df['times']).to_numpy().reshape(-1, 1)
    Y = pd.to_numeric(df['accel']).to_numpy().reshape(-1, 1)

    X = tf.convert_to_tensor(X, dtype=default_float())
    Y = tf.convert_to_tensor(Y, dtype=default_float())
    print("Input data shape: ", X.shape)
    print("Output data shape: ", Y.shape)

    # standardise input
    mean_x, var_x = tf.nn.moments(X, axes=[0])
    mean_y, var_y = tf.nn.moments(Y, axes=[0])
    X = (X - mean_x) / tf.sqrt(var_x)
    Y = (Y - mean_y) / tf.sqrt(var_y)
    data = (X, Y)
    return data
[3]:
data_file = '../data/mcycle.csv'
dataset = load_mcycle_dataset(filename=data_file)
X, Y = dataset
num_data, input_dim = X.shape
output_dim = Y.shape[1]
plt.scatter(X, Y)
Input data shape:  (133, 1)
Output data shape:  (133, 1)
2021-12-08 18:27:42.245992: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[3]:
<matplotlib.collections.PathCollection at 0x19f6aad60>
_images/notebooks_train_mcycle_with_2_experts_5_3.png

Given this data set, let’s specify some of the model and training parameters. It is clear that there is a low noise, long lengthscales function at \(x<-1\) and at \(x>-1\) the noise increases and the lengthscale shortens. With this knowledge, let’s initialise expert one with a short lengthscale and expert two with a longer lengthscale. We specify each expert to have 6 inducing points and the gating network to have 7 inducing points.

[4]:
num_experts = 2
experts_lengthscales = [1.0, 10.0]  # lengthsales for expert 1 and 2
num_inducing_expert = 6  # number of inducing points for each expert
num_inducing_gating = 7  # number of inducing points for gating network
num_samples = 1  # number of samples to draw from variational posterior in ELBO
batch_size = 16
learning_rate = 0.01

In order to initialie the MixtureOfSVGPExperts class for two experts we must pass it an instance of SVGPExperts and an instance of SVGPGatingNetwork with a Bernoulli likelihood. Let’s start by creating an instance of SVGPExperts. To do this we must first create two SVGPExpert instances and pass them as a list to SVGPExperts. Let’s create out first expert.

[5]:
def init_expert(lengthscales=1.0, kernel_variance=1.0, noise_variance=1.0):
    idx = np.random.choice(range(num_data), size=num_inducing_expert, replace=False)
    inducing_variable = X.numpy()[idx, ...].reshape(-1, input_dim)
    inducing_variable = gpf.inducing_variables.InducingPoints(inducing_variable)

    mean_function = gpf.mean_functions.Constant()
    likelihood = gpf.likelihoods.Gaussian(noise_variance)
    kernel = gpf.kernels.RBF(lengthscales=lengthscales, variance=kernel_variance)

    return SVGPExpert(kernel,
                      likelihood,
                      mean_function=mean_function,
                      inducing_variable=inducing_variable)
[6]:
experts_list = [init_expert(lengthscales=experts_lengthscales[k]) for k in range(num_experts)]

We can now create an instance of SVGPExperts by passing our two experts as a list.

[7]:
experts = SVGPExperts(experts_list)
print_summary(experts, fmt="notebook")
name class transform prior trainable shape dtype value
SVGPExperts.experts_list[0].mean_function.c ParameterIdentity True (1,) float64[0.]
SVGPExperts.experts_list[0].kernel.variance ParameterSoftplus True () float641.0
SVGPExperts.experts_list[0].kernel.lengthscalesParameterSoftplus True () float641.0
SVGPExperts.experts_list[0].likelihood.varianceParameterSoftplus + Shift True () float641.0
SVGPExperts.experts_list[0].inducing_variable.ZParameterIdentity True (6, 1) float64[[-1.74116353...
SVGPExperts.experts_list[0].q_mu ParameterIdentity True (6, 1) float64[[0....
SVGPExperts.experts_list[0].q_sqrt ParameterFillTriangular True (1, 6, 6)float64[[[1., 0., 0....
SVGPExperts.experts_list[1].mean_function.c ParameterIdentity True (1,) float64[0.]
SVGPExperts.experts_list[1].kernel.variance ParameterSoftplus True () float641.0
SVGPExperts.experts_list[1].kernel.lengthscalesParameterSoftplus True () float6410.0
SVGPExperts.experts_list[1].likelihood.varianceParameterSoftplus + Shift True () float641.0
SVGPExperts.experts_list[1].inducing_variable.ZParameterIdentity True (6, 1) float64[[-0.71690236...
SVGPExperts.experts_list[1].q_mu ParameterIdentity True (6, 1) float64[[0....
SVGPExperts.experts_list[1].q_sqrt ParameterFillTriangular True (1, 6, 6)float64[[[1., 0., 0....

Lovely stuff. We now need to create an instance of SVGPGatingNetwork with a Bernoulli likelihood. Remember that we only need a single gating function for the two expert case. Let’s go ahead and create our gating function and use it to construct our gating network.

[8]:
def init_gating_network():
    idx = np.random.choice(range(num_data), size=num_inducing_gating, replace=False)
    inducing_variable = X.numpy()[idx, ...].reshape(-1, input_dim)
    inducing_variable = gpf.inducing_variables.InducingPoints(inducing_variable)

    mean_function = gpf.mean_functions.Zero()
    kernel = gpf.kernels.RBF()

    return SVGPGatingNetwork(kernel,
                             likelihood=Bernoulli(),
                             inducing_variable=inducing_variable,
                             mean_function=mean_function)
[9]:
gating_network = init_gating_network()
print_summary(gating_network, fmt="notebook")
name class transform prior trainable shape dtype value
SVGPGatingNetwork.kernel.variance ParameterSoftplus True () float641.0
SVGPGatingNetwork.kernel.lengthscalesParameterSoftplus True () float641.0
SVGPGatingNetwork.inducing_variable.ZParameterIdentity True (7, 1) float64[[-0.56402756...
SVGPGatingNetwork.q_mu ParameterIdentity True (7, 1) float64[[0....
SVGPGatingNetwork.q_sqrt ParameterFillTriangular True (1, 7, 7)float64[[[1., 0., 0....

We now have all the components to construct our MixtureOfSVGPExperts model so let’s go ahead and do it.

[10]:
model = MixtureOfSVGPExperts(gating_network=gating_network,
                             experts=experts,
                             num_samples=num_samples,
                             num_data=num_data)
print_summary(model, fmt="notebook")
name class transform prior trainable shape dtype value
MixtureOfSVGPExperts.gating_network.kernel.variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.kernel.lengthscales ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.inducing_variable.Z ParameterIdentity True (7, 1) float64[[-0.56402756...
MixtureOfSVGPExperts.gating_network.q_mu ParameterIdentity True (7, 1) float64[[0....
MixtureOfSVGPExperts.gating_network.q_sqrt ParameterFillTriangular True (1, 7, 7)float64[[[1., 0., 0....
MixtureOfSVGPExperts.experts.experts_list[0].mean_function.c ParameterIdentity True (1,) float64[0.]
MixtureOfSVGPExperts.experts.experts_list[0].kernel.variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.experts.experts_list[0].kernel.lengthscalesParameterSoftplus True () float641.0
MixtureOfSVGPExperts.experts.experts_list[0].likelihood.varianceParameterSoftplus + Shift True () float641.0
MixtureOfSVGPExperts.experts.experts_list[0].inducing_variable.ZParameterIdentity True (6, 1) float64[[-1.74116353...
MixtureOfSVGPExperts.experts.experts_list[0].q_mu ParameterIdentity True (6, 1) float64[[0....
MixtureOfSVGPExperts.experts.experts_list[0].q_sqrt ParameterFillTriangular True (1, 6, 6)float64[[[1., 0., 0....
MixtureOfSVGPExperts.experts.experts_list[1].mean_function.c ParameterIdentity True (1,) float64[0.]
MixtureOfSVGPExperts.experts.experts_list[1].kernel.variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.experts.experts_list[1].kernel.lengthscalesParameterSoftplus True () float6410.0
MixtureOfSVGPExperts.experts.experts_list[1].likelihood.varianceParameterSoftplus + Shift True () float641.0
MixtureOfSVGPExperts.experts.experts_list[1].inducing_variable.ZParameterIdentity True (6, 1) float64[[-0.71690236...
MixtureOfSVGPExperts.experts.experts_list[1].q_mu ParameterIdentity True (6, 1) float64[[0....
MixtureOfSVGPExperts.experts.experts_list[1].q_sqrt ParameterFillTriangular True (1, 6, 6)float64[[[1., 0., 0....
We can use the Plotter1D class from mogpe.helpers.plotter to plot our model before training.
- The top plot shows mixing probablility for each expert, - Middle plots show each experts latent GP, - The bottom plot shows the models posterior with the mean (black line) and samples (green dots).
[11]:
plotter = Plotter1D(model, X, Y)
plotter.plot_model()
_images/notebooks_train_mcycle_with_2_experts_19_0.png
_images/notebooks_train_mcycle_with_2_experts_19_1.png
_images/notebooks_train_mcycle_with_2_experts_19_2.png

We must now convert our numpy data set into a TensorFlow data set and set it up for stochastic optimisation by setting the batch size. We set drop_remainder=True to ensure the model receives the batch size.

[12]:
prefetch_size = tf.data.experimental.AUTOTUNE
shuffle_buffer_size = num_data // 2
num_batches_per_epoch = num_data // batch_size
train_dataset = tf.data.Dataset.from_tensor_slices(dataset)
train_dataset = (train_dataset.repeat().prefetch(prefetch_size).shuffle(
    buffer_size=shuffle_buffer_size).batch(batch_size, drop_remainder=True))

We then use GPflows training_loss_closure method to get our training loss.

[13]:
training_loss = model.training_loss_closure(iter(train_dataset))

In mogpe.training.training_loops some training loops are defined. Here we use the simple training_tf_loop which runs the Adam optimizer on model with training_loss as the objective function. The loop does not use any TensorBoard monitoring. We first configure the training/logging parameters.

[14]:
logging_epoch_freq = 5
plotting_epoch_freq = 500
num_epochs = 2500
[15]:
def plot_elbo(elbo):
    plt.subplot(111)
    plt.scatter(np.arange(len(elbo))*logging_epoch_freq, elbo)
    plt.xlabel("Epoch")
    plt.ylabel("ELBO")
[16]:
optimizer = tf.optimizers.Adam(learning_rate=learning_rate)

@tf.function
def tf_optimization_step():
    optimizer.minimize(training_loss, model.trainable_variables)

elbo_log = []
for epoch in range(num_epochs):
    for _ in range(num_batches_per_epoch):
        tf_optimization_step()
    epoch_id = epoch + 1
    if epoch_id % logging_epoch_freq == 0:
        elbo_log.append(training_loss()*-1.0)
    if epoch_id % plotting_epoch_freq == 0:
        clear_output(True)
        tf.print(f"Epoch {epoch_id}: ELBO (train) {training_loss()}")
        plot_elbo(elbo_log)
        plt.show()
Epoch 2500: ELBO (train) -12.587674046751173
_images/notebooks_train_mcycle_with_2_experts_27_1.png

Now that we have trained the model we can use our plotter again to visualise what we have learned.

[17]:
plotter.plot_model()
_images/notebooks_train_mcycle_with_2_experts_29_0.png
_images/notebooks_train_mcycle_with_2_experts_29_1.png
_images/notebooks_train_mcycle_with_2_experts_29_2.png
[18]:
print_summary(model, fmt="notebook")
name class transform prior trainable shape dtype value
MixtureOfSVGPExperts.gating_network.kernel.variance ParameterSoftplus True () float6410.044302011299953
MixtureOfSVGPExperts.gating_network.kernel.lengthscales ParameterSoftplus True () float640.9191759962241284
MixtureOfSVGPExperts.gating_network.inducing_variable.Z ParameterIdentity True (7, 1) float64[[-0.33679763...
MixtureOfSVGPExperts.gating_network.q_mu ParameterIdentity True (7, 1) float64[[1.23060906...
MixtureOfSVGPExperts.gating_network.q_sqrt ParameterFillTriangular True (1, 7, 7)float64[[[0.24373665, 0., 0....
MixtureOfSVGPExperts.experts.experts_list[0].mean_function.c ParameterIdentity True (1,) float64[0.2537413]
MixtureOfSVGPExperts.experts.experts_list[0].kernel.variance ParameterSoftplus True () float640.7245331900215413
MixtureOfSVGPExperts.experts.experts_list[0].kernel.lengthscalesParameterSoftplus True () float640.266099322492376
MixtureOfSVGPExperts.experts.experts_list[0].likelihood.varianceParameterSoftplus + Shift True () float640.10439134246880631
MixtureOfSVGPExperts.experts.experts_list[0].inducing_variable.ZParameterIdentity True (6, 1) float64[[-0.82094873...
MixtureOfSVGPExperts.experts.experts_list[0].q_mu ParameterIdentity True (6, 1) float64[[-0.22453415...
MixtureOfSVGPExperts.experts.experts_list[0].q_sqrt ParameterFillTriangular True (1, 6, 6)float64[[[0.14292316, 0., 0....
MixtureOfSVGPExperts.experts.experts_list[1].mean_function.c ParameterIdentity True (1,) float64[0.47905155]
MixtureOfSVGPExperts.experts.experts_list[1].kernel.variance ParameterSoftplus True () float641.875849923049286e-07
MixtureOfSVGPExperts.experts.experts_list[1].kernel.lengthscalesParameterSoftplus True () float6414.49029624682839
MixtureOfSVGPExperts.experts.experts_list[1].likelihood.varianceParameterSoftplus + Shift True () float640.001153125667128228
MixtureOfSVGPExperts.experts.experts_list[1].inducing_variable.ZParameterIdentity True (6, 1) float64[[-1.55605991...
MixtureOfSVGPExperts.experts.experts_list[1].q_mu ParameterIdentity True (6, 1) float64[[-0.00318104...
MixtureOfSVGPExperts.experts.experts_list[1].q_sqrt ParameterFillTriangular True (1, 6, 6)float64[[[-1.01888223e+00, 0.00000000e+00, 0.00000000e+00...
[ ]:

Training MixtureOfSVGPExperts on the Motorcycle Data Set (with three experts)

This notebook is a basic example of configuring and training a Mixture of Gaussian Process Experts (using MixtureOfSVGPExperts) in the gerenal case, i.e. with more than two experts. This notebook instantiates the model with three experts and trains it on the motorcycle dataset. It’s worth noting that this approach is applicable for any number of experts.

[ ]:
import numpy as np
import gpflow as gpf
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import clear_output

from gpflow import default_float
from gpflow.inducing_variables import InducingPoints, SharedIndependentInducingVariables
from gpflow.likelihoods import Softmax
from gpflow.utilities import print_summary

from mogpe.experts import SVGPExperts, SVGPExpert
from mogpe.gating_networks import SVGPGatingNetwork
from mogpe.mixture_of_experts import MixtureOfSVGPExperts
from mogpe.training import training_tf_loop
from mogpe.helpers.plotter import Plotter1D

Let’s start by loading the motorcycle dataset and plotting it to see what we’re dealing with.

[ ]:
def load_mcycle_dataset(filename='../data/mcycle.csv'):
    df = pd.read_csv(filename, sep=',')
    X = pd.to_numeric(df['times']).to_numpy().reshape(-1, 1)
    Y = pd.to_numeric(df['accel']).to_numpy().reshape(-1, 1)

    X = tf.convert_to_tensor(X, dtype=default_float())
    Y = tf.convert_to_tensor(Y, dtype=default_float())
    print("Input data shape: ", X.shape)
    print("Output data shape: ", Y.shape)

    # standardise input
    mean_x, var_x = tf.nn.moments(X, axes=[0])
    mean_y, var_y = tf.nn.moments(Y, axes=[0])
    X = (X - mean_x) / tf.sqrt(var_x)
    Y = (Y - mean_y) / tf.sqrt(var_y)
    data = (X, Y)
    return data
[3]:
data_file = '../data/mcycle.csv'
dataset = load_mcycle_dataset(filename=data_file)
X, Y = dataset
num_data, input_dim = X.shape
output_dim = Y.shape[1]
plt.scatter(X, Y)
Input data shape:  (133, 1)
Output data shape:  (133, 1)
2021-12-08 18:49:51.670128: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[3]:
<matplotlib.collections.PathCollection at 0x1942c85b0>
_images/notebooks_train_mcycle_with_3_experts_5_3.png

Given this data set, let’s specify some of the model and training parameters. It is clear that there is a low noise, long lengthscales function at \(x<-1\) and at \(x>-1\) the noise increases and the lengthscale shortens. When fitting MixtureOfSVGPExperts with two experts the gating network starts tending to a uniform distribution at \(x>1\). It is therefore interesting to consider if the model will fit a third expert in this region. With this knowledge, let’s initialise expert one and expert three with long lengthscales and expert two with a shorter lengthscale. We specify each expert to have 4 inducing points and the gating network to have 7 inducing points.

[4]:
num_experts = 3
experts_lengthscales = [10.0, 1.0, 10.0]  # lengthsales for expert 1 and 2
# experts_lengthscales = [1.0, 1.0, 1.0]  # lengthsales for expert 1 and 2
num_inducing_expert = 4  # number of inducing points for each expert
num_inducing_gating = 7  # number of inducing points for the gating network
num_samples = 1  # number of samples to draw from variational posterior in ELBO
batch_size = 16
learning_rate = 0.01

In order to initialie the MixtureOfSVGPExperts class for three experts we must pass it an instance of SVGPExperts and an instance of SVGPGatingNetwork with a Softmax likelihood. Let’s start by creating an instance of SVGPExperts. To do this we must first create three SVGPExpert instances and pass them as a list to SVGPExperts.

[5]:
def init_expert(lengthscales=1.0, kernel_variance=1.0, noise_variance=1.0):
    idx = np.random.choice(range(num_data), size=num_inducing_expert, replace=False)
    inducing_variable = X.numpy()[idx, ...].reshape(-1, input_dim)
    inducing_variable = gpf.inducing_variables.InducingPoints(inducing_variable)

    mean_function = gpf.mean_functions.Constant()
    likelihood = gpf.likelihoods.Gaussian(noise_variance)
    kernel = gpf.kernels.RBF(lengthscales=lengthscales, variance=kernel_variance)

    return SVGPExpert(kernel,
                      likelihood,
                      mean_function=mean_function,
                      inducing_variable=inducing_variable)
[6]:
experts_list = [init_expert(lengthscales=experts_lengthscales[k]) for k in range(num_experts)]

We can now create an instance of SVGPExperts by instantiating three experts and passing them to SVGPExperts constructor as a list.

[7]:
experts = SVGPExperts(experts_list)
print_summary(experts, fmt="notebook")
name class transform prior trainable shape dtype value
SVGPExperts.experts_list[0].mean_function.c ParameterIdentity True (1,) float64[0.]
SVGPExperts.experts_list[0].kernel.variance ParameterSoftplus True () float641.0
SVGPExperts.experts_list[0].kernel.lengthscalesParameterSoftplus True () float6410.0
SVGPExperts.experts_list[0].likelihood.varianceParameterSoftplus + Shift True () float641.0
SVGPExperts.experts_list[0].inducing_variable.ZParameterIdentity True (4, 1) float64[[1.34690746...
SVGPExperts.experts_list[0].q_mu ParameterIdentity True (4, 1) float64[[0....
SVGPExperts.experts_list[0].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1., 0., 0....
SVGPExperts.experts_list[1].mean_function.c ParameterIdentity True (1,) float64[0.]
SVGPExperts.experts_list[1].kernel.variance ParameterSoftplus True () float641.0
SVGPExperts.experts_list[1].kernel.lengthscalesParameterSoftplus True () float641.0
SVGPExperts.experts_list[1].likelihood.varianceParameterSoftplus + Shift True () float641.0
SVGPExperts.experts_list[1].inducing_variable.ZParameterIdentity True (4, 1) float64[[0.70483329...
SVGPExperts.experts_list[1].q_mu ParameterIdentity True (4, 1) float64[[0....
SVGPExperts.experts_list[1].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1., 0., 0....
SVGPExperts.experts_list[2].mean_function.c ParameterIdentity True (1,) float64[0.]
SVGPExperts.experts_list[2].kernel.variance ParameterSoftplus True () float641.0
SVGPExperts.experts_list[2].kernel.lengthscalesParameterSoftplus True () float6410.0
SVGPExperts.experts_list[2].likelihood.varianceParameterSoftplus + Shift True () float641.0
SVGPExperts.experts_list[2].inducing_variable.ZParameterIdentity True (4, 1) float64[[0.15448401...
SVGPExperts.experts_list[2].q_mu ParameterIdentity True (4, 1) float64[[0....
SVGPExperts.experts_list[2].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1., 0., 0....

We now need to create an instance of SVGPGatingNetwork with a Softmax likelihood. In contrast to the two expert case (where a single gating function can be used), the general case requires a gating function for each expert. The SVGPGatingNetwork inherits GPflow’s multioutput SVGP and uses SharedIndependentInducingVariables for the inducing inputs and SeparateIndependent kernels. The gating functions are independent but should share the same inducing inputs, unlike the experts where the separate inducing points loosely partition the data set.

[8]:
def init_gating_network(num_experts):
    idx = np.random.choice(range(num_data), size=num_inducing_gating, replace=False)
    inducing_variable = X.numpy()[idx, ...].reshape(-1, input_dim)
    inducing_variable = SharedIndependentInducingVariables(InducingPoints(inducing_variable))

    mean_function = gpf.mean_functions.Zero()
    kernel_list = [gpf.kernels.RBF() for _ in range(num_experts)]
    kernel = gpf.kernels.SeparateIndependent(kernel_list)

    return SVGPGatingNetwork(kernel,
                             likelihood=Softmax(num_experts),
                             inducing_variable=inducing_variable,
                             num_gating_functions=num_experts,
                             mean_function=mean_function)
[9]:
gating_network = init_gating_network(num_experts=num_experts)
print_summary(gating_network, fmt="notebook")
name class transform prior trainable shape dtype value
SVGPGatingNetwork.kernel.kernels[0].variance ParameterSoftplus True () float641.0
SVGPGatingNetwork.kernel.kernels[0].lengthscales ParameterSoftplus True () float641.0
SVGPGatingNetwork.kernel.kernels[1].variance ParameterSoftplus True () float641.0
SVGPGatingNetwork.kernel.kernels[1].lengthscales ParameterSoftplus True () float641.0
SVGPGatingNetwork.kernel.kernels[2].variance ParameterSoftplus True () float641.0
SVGPGatingNetwork.kernel.kernels[2].lengthscales ParameterSoftplus True () float641.0
SVGPGatingNetwork.inducing_variable.inducing_variable.ZParameterIdentity True (7, 1) float64[[1.72909446...
SVGPGatingNetwork.q_mu ParameterIdentity True (7, 3) float64[[0., 0., 0....
SVGPGatingNetwork.q_sqrt ParameterFillTriangular True (3, 7, 7)float64[[[1., 0., 0....

We now have all the components to construct our MixtureOfSVGPExperts model so let’s go ahead and do it.

[10]:
model = MixtureOfSVGPExperts(gating_network=gating_network,
                             experts=experts,
                             num_samples=num_samples,
                             num_data=num_data)
print_summary(model, fmt="notebook")
name class transform prior trainable shape dtype value
MixtureOfSVGPExperts.gating_network.kernel.kernels[0].variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.kernel.kernels[0].lengthscales ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.kernel.kernels[1].variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.kernel.kernels[1].lengthscales ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.kernel.kernels[2].variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.kernel.kernels[2].lengthscales ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.gating_network.inducing_variable.inducing_variable.ZParameterIdentity True (7, 1) float64[[1.72909446...
MixtureOfSVGPExperts.gating_network.q_mu ParameterIdentity True (7, 3) float64[[0., 0., 0....
MixtureOfSVGPExperts.gating_network.q_sqrt ParameterFillTriangular True (3, 7, 7)float64[[[1., 0., 0....
MixtureOfSVGPExperts.experts.experts_list[0].mean_function.c ParameterIdentity True (1,) float64[0.]
MixtureOfSVGPExperts.experts.experts_list[0].kernel.variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.experts.experts_list[0].kernel.lengthscales ParameterSoftplus True () float6410.0
MixtureOfSVGPExperts.experts.experts_list[0].likelihood.variance ParameterSoftplus + Shift True () float641.0
MixtureOfSVGPExperts.experts.experts_list[0].inducing_variable.Z ParameterIdentity True (4, 1) float64[[1.34690746...
MixtureOfSVGPExperts.experts.experts_list[0].q_mu ParameterIdentity True (4, 1) float64[[0....
MixtureOfSVGPExperts.experts.experts_list[0].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1., 0., 0....
MixtureOfSVGPExperts.experts.experts_list[1].mean_function.c ParameterIdentity True (1,) float64[0.]
MixtureOfSVGPExperts.experts.experts_list[1].kernel.variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.experts.experts_list[1].kernel.lengthscales ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.experts.experts_list[1].likelihood.variance ParameterSoftplus + Shift True () float641.0
MixtureOfSVGPExperts.experts.experts_list[1].inducing_variable.Z ParameterIdentity True (4, 1) float64[[0.70483329...
MixtureOfSVGPExperts.experts.experts_list[1].q_mu ParameterIdentity True (4, 1) float64[[0....
MixtureOfSVGPExperts.experts.experts_list[1].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1., 0., 0....
MixtureOfSVGPExperts.experts.experts_list[2].mean_function.c ParameterIdentity True (1,) float64[0.]
MixtureOfSVGPExperts.experts.experts_list[2].kernel.variance ParameterSoftplus True () float641.0
MixtureOfSVGPExperts.experts.experts_list[2].kernel.lengthscales ParameterSoftplus True () float6410.0
MixtureOfSVGPExperts.experts.experts_list[2].likelihood.variance ParameterSoftplus + Shift True () float641.0
MixtureOfSVGPExperts.experts.experts_list[2].inducing_variable.Z ParameterIdentity True (4, 1) float64[[0.15448401...
MixtureOfSVGPExperts.experts.experts_list[2].q_mu ParameterIdentity True (4, 1) float64[[0....
MixtureOfSVGPExperts.experts.experts_list[2].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1., 0., 0....
We can use the Plotter1D class from mogpe.helpers.plotter to plot our model before training.
- The top plot shows mixing probablility for each expert, - Middle plots show each experts latent GP, - The bottom plot shows the models posterior with the mean (black line) and samples (green dots).
[11]:
plotter = Plotter1D(model, X, Y)
plotter.plot_model()
_images/notebooks_train_mcycle_with_3_experts_19_0.png
_images/notebooks_train_mcycle_with_3_experts_19_1.png
_images/notebooks_train_mcycle_with_3_experts_19_2.png

We must now convert our numpy data set into a TensorFlow data set and set it up for stochastic optimisation by setting the batch size. We set drop_remainder=True to ensure the model receives the batch size.

[12]:
prefetch_size = tf.data.experimental.AUTOTUNE
shuffle_buffer_size = num_data // 2
num_batches_per_epoch = num_data // batch_size
train_dataset = tf.data.Dataset.from_tensor_slices(dataset)
train_dataset = (train_dataset.repeat().prefetch(prefetch_size).shuffle(
    buffer_size=shuffle_buffer_size).batch(batch_size, drop_remainder=True))

We then use GPflows training_loss_closure method to get our training loss.

[13]:
training_loss = model.training_loss_closure(iter(train_dataset))

In mogpe.training.training_loops some training loops are defined. Here we use the simple training_tf_loop which runs the Adam optimizer on model with training_loss as the objective function. The loop does not use any TensorBoard monitoring. We first configure the training/logging parameters.

[14]:
logging_epoch_freq = 5
plotting_epoch_freq = 500
num_epochs = 2000
[15]:
def plot_elbo(elbo):
    plt.subplot(111)
    plt.scatter(np.arange(len(elbo))*logging_epoch_freq, elbo)
    plt.xlabel("Epoch")
    plt.ylabel("ELBO")
[16]:
optimizer = tf.optimizers.Adam(learning_rate=learning_rate)

@tf.function
def tf_optimization_step():
    optimizer.minimize(training_loss, model.trainable_variables)

elbo_log = []
for epoch in range(num_epochs):
    for _ in range(num_batches_per_epoch):
        tf_optimization_step()
    epoch_id = epoch + 1
    if epoch_id % logging_epoch_freq == 0:
        elbo_log.append(training_loss()*-1.0)
    if epoch_id % plotting_epoch_freq == 0:
        clear_output(True)
        tf.print(f"Epoch {epoch_id}: ELBO (train) {training_loss()}")
        plot_elbo(elbo_log)
        plt.show()
Epoch 2000: ELBO (train) 49.81311591329514
_images/notebooks_train_mcycle_with_3_experts_27_1.png

Now that we have trained the model we can use our plotter again to visualise what we have learned.

[17]:
plotter.plot_model()
_images/notebooks_train_mcycle_with_3_experts_29_0.png
_images/notebooks_train_mcycle_with_3_experts_29_1.png
_images/notebooks_train_mcycle_with_3_experts_29_2.png
[18]:
print_summary(model, fmt="notebook")
name class transform prior trainable shape dtype value
MixtureOfSVGPExperts.gating_network.kernel.kernels[0].variance ParameterSoftplus True () float647.135789778871159
MixtureOfSVGPExperts.gating_network.kernel.kernels[0].lengthscales ParameterSoftplus True () float641.3016590612834866
MixtureOfSVGPExperts.gating_network.kernel.kernels[1].variance ParameterSoftplus True () float6423.512699977881155
MixtureOfSVGPExperts.gating_network.kernel.kernels[1].lengthscales ParameterSoftplus True () float640.8741896699556088
MixtureOfSVGPExperts.gating_network.kernel.kernels[2].variance ParameterSoftplus True () float640.1437596400755563
MixtureOfSVGPExperts.gating_network.kernel.kernels[2].lengthscales ParameterSoftplus True () float640.8437475112514834
MixtureOfSVGPExperts.gating_network.inducing_variable.inducing_variable.ZParameterIdentity True (7, 1) float64[[1.26555844...
MixtureOfSVGPExperts.gating_network.q_mu ParameterIdentity True (7, 3) float64[[-0.89398975, -0.55701677, 0.18885994...
MixtureOfSVGPExperts.gating_network.q_sqrt ParameterFillTriangular True (3, 7, 7)float64[[[2.84199781e-01, 0.00000000e+00, 0.00000000e+00...
MixtureOfSVGPExperts.experts.experts_list[0].mean_function.c ParameterIdentity True (1,) float64[0.47284263]
MixtureOfSVGPExperts.experts.experts_list[0].kernel.variance ParameterSoftplus True () float646.118328089014628e-07
MixtureOfSVGPExperts.experts.experts_list[0].kernel.lengthscales ParameterSoftplus True () float6413.425503933562686
MixtureOfSVGPExperts.experts.experts_list[0].likelihood.variance ParameterSoftplus + Shift True () float640.001097764117017966
MixtureOfSVGPExperts.experts.experts_list[0].inducing_variable.Z ParameterIdentity True (4, 1) float64[[1.20314838...
MixtureOfSVGPExperts.experts.experts_list[0].q_mu ParameterIdentity True (4, 1) float64[[-0.00874161...
MixtureOfSVGPExperts.experts.experts_list[0].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1.00055666e+00, 0.00000000e+00, 0.00000000e+00...
MixtureOfSVGPExperts.experts.experts_list[1].mean_function.c ParameterIdentity True (1,) float64[0.25559829]
MixtureOfSVGPExperts.experts.experts_list[1].kernel.variance ParameterSoftplus True () float640.8519896116531239
MixtureOfSVGPExperts.experts.experts_list[1].kernel.lengthscales ParameterSoftplus True () float640.27061685655515777
MixtureOfSVGPExperts.experts.experts_list[1].likelihood.variance ParameterSoftplus + Shift True () float640.061669447665278404
MixtureOfSVGPExperts.experts.experts_list[1].inducing_variable.Z ParameterIdentity True (4, 1) float64[[0.51534479...
MixtureOfSVGPExperts.experts.experts_list[1].q_mu ParameterIdentity True (4, 1) float64[[1.32988718...
MixtureOfSVGPExperts.experts.experts_list[1].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[0.13110926, 0., 0....
MixtureOfSVGPExperts.experts.experts_list[2].mean_function.c ParameterIdentity True (1,) float64[0.55334962]
MixtureOfSVGPExperts.experts.experts_list[2].kernel.variance ParameterSoftplus True () float641.084895730768368e-05
MixtureOfSVGPExperts.experts.experts_list[2].kernel.lengthscales ParameterSoftplus True () float6410.279219461615842
MixtureOfSVGPExperts.experts.experts_list[2].likelihood.variance ParameterSoftplus + Shift True () float640.0958555963107556
MixtureOfSVGPExperts.experts.experts_list[2].inducing_variable.Z ParameterIdentity True (4, 1) float64[[0.35725987...
MixtureOfSVGPExperts.experts.experts_list[2].q_mu ParameterIdentity True (4, 1) float64[[0.0038293...
MixtureOfSVGPExperts.experts.experts_list[2].q_sqrt ParameterFillTriangular True (1, 4, 4)float64[[[1.00016938, 0., 0....
[ ]:

What’s going on with this code?!

In this section we provide details on the Mixtures of Gaussian Process Experts code (mogpe). The implementation is motivated by making it easy to implement different Mixtures of Gaussian Process Experts models and inference algorithms. It exploits both inheritance and composition (building blocks of OOP) making it easier to evolve as new features are added or requirements change.

Class Inheritance and Composition

Let’s detail the basic building blocks and how they are related. There are three main components,

  1. The mixture of experts model (mogpe.mixture_of_experts),

  2. The set of experts (mogpe.experts),

    • And individual experts,

  3. The gating network (mogpe.gating_networks),

    • And individual gating functions.

Mixture of Experts Base

At the heart of this package is the MixtureOfExperts base class that extends GPflow’s BayesianModel class (any instantiation requires the maximum_log_likelihood_objective() method to be implemented). It defines the basic methods of a mixture of experts model, namely,

  1. A method to predict the mixing probabilities at a set of input locations MixtureOfExperts.predict_mixing_probs(),

  2. A method to predict the set of expert predictions at a set of input locations MixtureOfExperts.predict_experts_dists(),

  3. A method to predict the mixture distribution at a set of input locations MixtureOfExperts.predict_y().

The constructor requires an instance of a subclass of ExpertsBase to represent the set of experts and an instance of a subclass of GatingNetworkBase to represent the gating network.

MixtureOfSVGPExperts

The main model class in this package is MixtureOfSVGPExperts which implements a lower bound maximum_log_likelihood_objective() given both the experts and gating functions are modelled as sparse variational Gaussian processes (SVGP). The implementation extends the ExpertsBase class creating SVGPExperts which implements the required abstract methods as well as extra methods which are used in the lower bound. It also extends the GatingNetworkBase class creating the SVGPGatingNetwork class. This class implements a gating network based on SVGP’s for both the special two expert case and the general k expert case. Let’s now detail the base classes for the experts and gating network.

Expert(s) Base

Before detailing the ExpertsBase class we need to first introduce the base class for an individual expert. Any class representing an individual expert must inherit the ExpertBase class and implement the predict_dist() method, returning the experts prediction at Xnew. For example, the SVGPExpert class inherits the ExpertBase class to implement an expert as a sparse variational Gaussian process.

Any class representing the set of all experts must inherit the ExpertsBase class and should implement the predict_dists() method, returning a batched TensorFlow Probability Distribution. The constructor requires a list of expert instances inherited from a subclass of ExpertBase. For example, the SVGPExperts class represents a set of SVGPExpert experts and adds a method for returning the set of inducing point KL divergences required in the MixtureOfSVGPExperts lower bound.

Gating Network Base

All gating networks should inherit the GatingNetworkBase class and implement the predict_mixing_probs() and predict_fs() methods. This package is mainly interested in gating networks based on Gaussian processes, in particular sparse variational Gaussian processes. The SVGPGatingNetwork class implements a gating network as a sparse variational Gaussian process. Similarly to GPflow’s SVGP, its constructor requires a likelihood. This likelihood governs the behaviour of the gating network. If a Bernoulli likelihood is passed then the gating network will use a single gating function as as we know \(\Pr(\alpha=2 | x) = 1 - \Pr(\alpha=1 | x)\). As such, the kernel and inducing variables should correspond a single-output SVGP. In the general case, i.e. with more than two experts, the gating network adopts a Softmax likelihood which depends on a gating function for each expert. In this setting, the kernel and inducing variables should be of multiple-output types, i.e. SeparateIndependent and SharedIndependentInducingVariables respectively.

mogpe API

Base Classes

Mixture of Experts
Experts
class mogpe.experts.ExpertBase(*args, **kwargs)

Abstract base class for an individual expert.

Each subclass that inherits ExpertBase should implement the predict_dist() method that returns the individual experts prediction at an input.

Parameters
  • args (Any) –

  • kwargs (Any) –

abstract predict_dist(Xnew, **kwargs)

Returns the individual experts prediction at Xnew.

TODO: this does not return a tfd.Distribution

Parameters

Xnew (Tensor) – inputs with shape [num_test, input_dim]

Returns

an instance of a TensorFlow Distribution

class mogpe.experts.ExpertsBase(experts_list=None, name='Experts')

Abstract base class for a set of experts.

Provides an interface between ExpertBase and MixtureOfExperts. Each subclass that inherits ExpertsBase should implement the predict_dists() method that returns the set of experts predictions at an input (as a batched TensorFlow distribution).

Parameters

experts_list (Optional[List[ExpertBase]]) –

abstract predict_dists(Xnew, **kwargs)

Returns the set of experts predicted dists at Xnew.

Parameters

Xnew (Tensor) – inputs with shape [num_test, input_dim]

Return type

Distribution

Returns

a batched tfd.Distribution with batch_shape […, num_test, output_dim, num_experts]

Gating Networks
class mogpe.gating_networks.GatingNetworkBase(*args, **kwargs)

Abstract base class for the gating network.

Parameters
  • args (Any) –

  • kwargs (Any) –

abstract predict_fs(Xnew, **kwargs)

Calculates the set of gating function posteriors at Xnew

Parameters

Xnew (Tensor) – inputs with shape [num_test, input_dim]

TODO correct dimensions :rtype: Tuple[Tensor, Tensor] :returns: mean and var batched Tensors with shape […, num_test, 1, num_experts]

abstract predict_mixing_probs(Xnew, **kwargs)

Calculates the set of experts mixing probabilities at Xnew \(\{\Pr(\alpha=k | x)\}^K_{k=1}\)

Parameters

Xnew (Tensor) – inputs with shape [num_test, input_dim]

Return type

Tensor

Returns

a batched Tensor with shape […, num_test, 1, num_experts]

SVGP Classes

Mixture of SVGP Experts
class mogpe.mixture_of_experts.MixtureOfSVGPExperts(gating_network, experts, num_data, num_samples=1, bound='further_gating')

Mixture of SVGP experts using stochastic variational inference.

Implemention of a mixture of Gaussian process (GPs) experts method where the gating network is also implemented using GPs. The model is trained with stochastic variational inference by exploiting the factorization achieved by sparse GPs.

Parameters
  • gating_network (SVGPGatingNetwork) – an instance of the GatingNetworkBase class with the predict_mixing_probs(Xnew) method implemented.

  • experts (SVGPExperts) – an instance of the SVGPExperts class with the predict_dists(Xnew) method implemented.

  • num_inducing_samples – the number of samples to draw from the inducing point distributions during training.

  • num_data (int) – the number of data points.

  • num_samples (int) –

  • bound (str) –

elbo(data)

Returns the evidence lower bound (ELBO) of the log marginal likelihood.

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

lower_bound_analytic(data)

Lower bound to the log-marginal likelihood (ELBO).

This bound assumes each output dimension is independent and takes the product over them within the logarithm (and before the expert indicator variable is marginalised).

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

lower_bound_dagp(data)

Lower bound used in Data Association with GPs (DAGP).

This bound doesn’t marginalise the expert indicator variable.

TODO check I’ve implemented this correctlyy. It’s definitely slower thatn it should be.

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

lower_bound_further(data)

Lower bound to the log-marginal likelihood (ELBO).

Looser bound than lower_bound_tight as it marginalises both of the expert’s and the gating network’s inducing variables $q(hat{f}, hat{h})$ in closed-form. Replaces M-dimensional approx integrals with 1-dimensional approx integrals.

This bound is equivalent to a different likelihood approximation that only mixes the noise models (as opposed to the full GPs).

This bound assumes each output dimension is independent.

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

lower_bound_further_2(data)

Lower bound to the log-marginal likelihood (ELBO).

Looser bound than lower_bound_tight but marginalises the inducing variables $q(hat{f}, hat{h})$ in closed-form. Replaces M-dimensional approx integrals with 1-dimensional approx integrals.

This bound assumes each output dimension is independent.

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

lower_bound_further_experts(data)

Lower bound to the log-marginal likelihood (ELBO).

Similar to lower_bound_tight but with a further bound on the experts. The bound removes the M dimensional integral over each expert’s inducing variables $q(hat{mathbf{U}})$ with 1 dimensional integrals over the gating network variational posterior $q(mathbf{h}_n)$.

This bound is equivalent to a different likelihood approximation that only mixes the noise models (as opposed to the full GPs).

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

lower_bound_further_gating(data)

Lower bound to the log-marginal likelihood (ELBO).

Similar to lower_bound_tight but with a further bound on the gating network. The bound removes the M dimensional integral over the gating network inducing variables $q(hat{mathbf{U}})$ with 1 dimensional integrals over the gating network variational posterior $q(mathbf{h}_n)$.

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

lower_bound_tight(data)

Lower bound to the log-marginal likelihood (ELBO).

Tighter bound than lower_bound_further but requires an M dimensional expectation over the inducing variables $q(hat{f}, hat{h})$ to be approximated (with Gibbs sampling).

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

lower_bound_tight_2(data)

Lower bound to the log-marginal likelihood (ELBO).

Tighter bound than lower_bound_further but requires an M dimensional expectation over the inducing variables $q(hat{f}, hat{h})$ to be approximated (with Gibbs sampling).

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

loss - a Tensor with shape ()

marginal_likelihood(data)

Marginal likelihood (ML).

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

marginal likelihood - a Tensor with shape ()

marginal_likelihood_new(data)

Marginal likelihood (ML).

Parameters

data (Tuple[Tensor, Tensor]) – data tuple (X, Y) with inputs [num_data, input_dim] and outputs [num_data, ouput_dim])

Return type

Tensor

Returns

marginal likelihood - a Tensor with shape ()

maximum_log_likelihood_objective(data)

Objective for maximum likelihood estimation. Should be maximized. E.g. log-marginal likelihood (hyperparameter likelihood) for GPR, or lower bound to the log-marginal likelihood (ELBO) for sparse and variational GPs.

Parameters

data (Tuple[Tensor, Tensor]) –

Return type

Tensor

predict_experts_fs(Xnew, num_inducing_samples=None, full_cov=False, full_output_cov=False)

“Compute mean and (co)variance of experts latent functions at Xnew.

If num_inducing_samples is not None then sample inducing points instead of analytically integrating them. This is required in the mixture of experts lower bound.

Parameters
  • Xnew (Tensor) – inputs with shape [num_test, input_dim]

  • num_inducing_samples (Optional[int]) – the number of samples to draw from the inducing point distributions during training.

Return type

Tuple[Tensor, Tensor]

Returns

a tuple of (mean, (co)var) each with shape […, num_test, output_dim, num_experts]

SVGP Experts
class mogpe.experts.SVGPExpert(kernel, likelihood, inducing_variable, mean_function=None, num_latent_gps=1, q_diag=False, q_mu=None, q_sqrt=None, whiten=True, num_data=None)

Sparse Variational Gaussian Process Expert.

This class inherits the prior_kl() method from the SVGPModel class and implements the predict_dist() method using SVGPModel’s predict_y method.

Parameters
  • kernel (Kernel) –

  • likelihood (Likelihood) –

  • mean_function (Optional[MeanFunction]) –

  • num_latent_gps (int) –

  • q_diag (bool) –

  • whiten (bool) –

predict_dist(Xnew, num_inducing_samples=None, full_cov=False, full_output_cov=False)

Returns the mean and (co)variance of the experts prediction at Xnew.

Parameters
  • Xnew (Tensor) – inputs with shape [num_test, input_dim]

  • num_inducing_samples (Optional[int]) – the number of samples to draw from the inducing points joint distribution.

  • full_cov (bool) – If True, draw correlated samples over the inputs. Computes the Cholesky over the dense covariance matrix of size [num_data, num_data]. If False, draw samples that are uncorrelated over the inputs.

  • full_output_cov (bool) – If True, draw correlated samples over the outputs. If False, draw samples that are uncorrelated over the outputs.

Return type

Tuple[Tensor, Tensor]

Returns

tuple of Tensors (mean, variance), means shape is [num_inducing_samples, num_test, output_dim], if full_cov=False variance tensor has shape [num_inducing_samples, num_test, ouput_dim] and if full_cov=True, [num_inducing_samples, output_dim, num_test, num_test]

class mogpe.experts.SVGPExperts(experts_list=None, name='Experts')

Extension of ExpertsBase for a set of SVGPExpert experts.

Provides an interface between a set of SVGPExpert instances and the MixtureOfSVGPExperts class.

Parameters

experts_list (Optional[List[SVGPExpert]]) –

predict_dists(Xnew, **kwargs)

Returns the set of experts predicted dists at Xnew.

Parameters

Xnew (Tensor) – inputs with shape [num_test, input_dim]

Return type

Distribution

Returns

a batched tfd.Distribution with batch_shape […, num_test, output_dim, num_experts]

predict_fs(Xnew, num_inducing_samples=None, full_cov=False, full_output_cov=False)

Returns the set experts latent function mean and (co)vars at Xnew.

Parameters
  • Xnew (Tensor) – inputs with shape [num_test, input_dim]

  • num_inducing_samples (Optional[int]) –

Return type

Tuple[Tensor, Tensor]

Returns

a tuple of (mean, (co)var) each with shape […, num_test, output_dim, num_experts]

predict_ys(Xnew, num_inducing_samples=None, full_cov=False, full_output_cov=False)

Returns the set of experts predictions mean and (co)vars at Xnew.

Parameters
  • Xnew (Tensor) – inputs with shape [num_test, input_dim]

  • num_inducing_samples (Optional[int]) –

Return type

Tuple[Tensor, Tensor]

Returns

a tuple of (mean, (co)var) each with shape […, num_test, output_dim, num_experts]

prior_kls()

Returns the set of experts KL divergences as a batched tensor.

Return type

Tensor

Returns

a Tensor with shape [num_experts,]

SVGP Gating Networks