Extended material for ‘Expressions are Pragmatic Model Visualizations’

This is the extended material for Expressions are Pragmatic Model Visualizations.

Followup on Example 1

Let’s further loosen the prior on the parameters.

Model Visualization

Gaussian Process with the following kernel.

Mean: constant

Covariance: Use distance between points as follows:

 * sum([
   
   # Kernel: Factorized scalar vs choice parameters
   * sum([
     
     # Scalar parameters
     * matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     
     # Choice parameters
     * exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))]),
   
   # Kernel: Joint scalar and choice parameters
   * prod([
     matern_25(
      norm_l2([
         compare('log_epochs') / ,
         compare('log_batch_size') / ,
         compare('log_conv1_weight_decay') / ,
         compare('log_conv2_weight_decay') / ,
         compare('log_conv3_weight_decay') / ,
         compare('log_dense1_weight_decay') / ,
         compare('log_dense2_weight_decay') / ,
         compare('log_1cycle_initial_lr_pct') / ,
         compare('log_1cycle_final_lr_pct') / ,
         compare('log_1cycle_pct_warmup') / ,
         compare('log_1cycle_max_lr') / ,
         compare('log_1cycle_momentum_max_damping_factor') / ,
         compare('log_1cycle_momentum_min_damping_factor_pct') / ,
         compare('log_1cycle_beta1_max_damping_factor') / ,
         compare('log_1cycle_beta1_min_damping_factor_pct') / ,
         compare('log_beta2_damping_factor') / ,
         compare('log_conv1_channels') / ,
         compare('log_conv2_channels') / ,
         compare('log_conv3_channels') / ,
         compare('log_dense1_units') / ])),
     exp(
      -norm_l1([
         compare('choice_nhot0') / ,
         compare('choice_nhot1') / ,
         compare('choice_nhot2') / ,
         compare('choice_nhot3') / ]))])])

When comparing a point to itself, add noise value: (log scale)

Visualization 3: Model parameters now that the priors on the "lengthscales" have been loosened even further.

Compared to Visualization 2 in the original post, the parameters are more free. The discrete change in behavior as the dataset grows as been further reduced. Parameters are now reaching higher absolute values than before.

How were results impacted?

Cross-validation results with looser priors

Chart 1: Hold-one-out cross-validation results, now adding a third configuration.

There is a significant benefit for very small datasets. However, for most dataset sizes, this leads to worse results. It seems that having priors is important, especially as the dataset gets larger.

Followup on Example 2

Here’s a plot of the training run, using batch training again. This time I overlay the mean of all of the models’ losses.

Schematic Chart 2: Each of the 60 models’ negative losses after each step of training.

We find something even more interesting. Yes, the vast majority of the models converge hundreds of steps before the final model converges. But we also see that individual models often get worse on single training steps; only the average of all scores is monotonically improving. This may seem normal if you are accustomed to optimization in neural networks, but BoTorch’s optimization (scipy.optimize with L-BFGS-B) is only ever supposed to take a step if it improves the model.

BoTorch trains multiple models in parallel by adding all of their loss functions to create a single scalar loss. This type of composability is a valid strategy with optimizers like SGD or Adam, since those follow the gradient wherever it leads. But other optimizers, even if they are gradient-based, decide whether to accept a parameter update depending on the loss at the destination point. For these optimizers, you can’t simply add loss functions, or you will get a change in the optimizer’s behavior. After adding losses, only the sum of those losses is guaranteed to improve monotonically, while updates can harm the individual losses. For some model types like neural networks, we might describe this as “regularization” and treat it as a good thing, but in those cases we should just use a different optimizer.

So, not only are we evaluating converged models unnecessarily; we are taking indirect paths to the optimum. When I count model evaluations, 18,817 total evaluations happen when training sequentially, while 92,820 happen when training in parallel, so we are doing approximately 5 times too many operations. In my experiments, on GPUs it is still worth training in batch, rather than sequential, but on CPUs it is better to do simply loop over models and optimize them independently.

The effect of all of this is that cross-validation in BoTorch is much slower than it needs to be. As you increase the size of the dataset, the following slowdowns occur:

More models are trained.
Each of those models is more expensive, since it has a larger dataset.
The optimization trajectory becomes longer and longer as more models are trained in parallel. (Unnecessary)
More pointless evaluations of converged models occur, due to the increased likelihood of a single random slow-to-converge model. (Unnecessary)

These four factors multiply to create a slow experience.

I am inclined to implement batch training differently, maybe by implementing a single training run and then using something like JAX’s vmap. This would eliminate the factor 3, and maybe it could be used in conjunction with JAX’s while_loop to also solve factor 4.

(That’s it for the appendix! Here’s a link back to the main post.)