Inference confidence intervals

An inference is not very meaningful without having a confidence interval that indicates its precision. So we need to develop confidence intervals to give meaning to the network inferences.

These confidence intervals can either be based on the spread of inferences circulating within the network, or on historical performance (or a combination of both). The solution should be informed by whether the error are (non-)Gaussian, as well as by a certain level of initial pragmatism to enable launching a feasible solution that can be refined or expanded later.

1 Like

To kick this off I have (i) investigated different ways of determining an uncertainty on the weighted average combined_prediction within the get_combinator_output function, and (ii) performed a trade-off study between the accuracy and ease (and hence speed) of implementing different solutions. The hope is to find a solution which is simple (and hence quick) to implement, computationally cheap, and intuitive to understand. I outline the trade-off decisions below.

1 Like

Trade-off 1 – single-epoch vs multi-epoch approaches: The current version of the get_combinator_output function expects regrets and predictions from a single epoch. This imposes limitations on the accuracy of calculated uncertainties. Including information from prior epochs would enable a more empirical and data-driven basis for estimating uncertainties (e.g. folding in historical variance, weight adjustment based on historical performance, and correlations between different predictors). However, solutions using data from more than one epoch are significantly more complex and will take significantly longer to implement in the code. Therefore, it makes sense to first see if an acceptable solution can be found using data from a single epoch.

1 Like

Trade-off 2 – relative merits of different single-epoch approaches: I considered four different approaches to determining the uncertainty using data from a single epoch, 1. weighted standard deviation, 2. variance of the weighted mean, 3. sensitivity analysis and 4. bootstrapping. I discuss each method and the relative pros/cons of each in turn below.

Weighted standard deviation: This provides a measure of the spread of the predictions, weighted by their respective weights.

  • Pros: simplest to implement, intuitive to understand
  • Cons: assumes that the uncertainties on the predictions are equal

Variance of the weighted mean: If we assume (or know) the proportionality between the weights and the variance of each prediction (i.e. a higher weight means lower variance and more reliability) which can be encapsulated in a variances function, we can calculate an uncertainty using the variance of the weighted mean.

  • Pros: potentially more accurate than method 1 if we have a robust variances function.
  • Cons: It is not clear (to me at least) what functional form of variances is appropriate. The obvious one, variances = 1 / weights, cannot be used as std_weighted_mean becomes mathematically equal to 1 if the weights are normalised properly, i.e. np.sum(weights)=1. Changing to a power law (e.g. variances proportional to weight^alpha) solves this, but then it is not immediately obvious how to choose the exponent and normalization.

Sensitivity analysis: In principle, if we know the typical variation between predictions (e.g. +/-5%) it is straightforward to repeat the weighted mean multiple times changing each prediction across this range and quantify the resulting spread in weighted mean values. I did not like this approach for multiple reasons. Firstly, to keep the inputs of get_combinator_output the same, the typical variation value (e.g. +/-5%) would need to be hard-coded in the function. There is no reason to expect a one-size-fits-all value would work either from time step to time step, let alone between different topics. Secondly, while it might be possible to calculate an appropriate value outside of the function (which is then passed to the function), I suspect there are much better approaches if we decide to change the input to get_combinator_output.

Bootstrapping or resampling techniques: By repeatedly sampling from the data (with replacement) and recalculating the weighted average, we can get a more empirical estimate of its distribution and thus its uncertainty.

  • Pros: no assumptions need to be made about the uncertainty of individual predictions.
  • Cons: might not be robust for low-n and computationally expensive for large-n.

Given the simplicity, ease of implementation, low computational cost and straightforward way to determine when the underlying assumption breaks down, the weighted standard deviation appears the optimal approach to begin testing.

1 Like

Here is a first attempt at adding error bars using using the weighted standard deviation to the “observed” vs “predicted” returns for forecasters (called aggregators in the plot).

My initial thoughts are that this doesn’t look too bad. The magnitude of the error bars appear proportional to the loss, i.e. the error bars for forecaster 2 (which has the lowest loss, so best overall predictions) are generally smaller than for forecaster 1 and 3.

1 Like

Yes, let’s go with this for now. Of course, as I think you imply above, this type of confidence interval is a “network confidence interval”, i.e. it only captures variance coming from worker-to-worker variation. It doesn’t capture any innate error corresponding to each individual inference contributed by these workers. We might want to expand this in the future to include that, maybe using the forecasted losses – now that I think of it I’m quite interested in that possibility.

1 Like

I agree, that sounds like an interesting avenue to explore.

I thought it would be useful to see how the magnitude of the error bars in the above plots derived from the weighted standard deviation [x-axis] compared to the absolute error (abs(ground_truth - prediction)) [y-axis] for the three different forecasters (aggregators).

There is possibly a correlation here, but the most noticeable feature is that the dynamic range of the error bars [x-axis] is much larger than the absolute error, i.e. there is a long tail to very low x-values that is not matched in the y-values.

1 Like

Right – I think if you were to overplot a 1:1 line, it’d actually follow the cloud of points on the right reasonably closely.

1 Like


I’ve investigated the low x-values and these are caused by the weight distribution. For the fiducial model, around 60% of the weights used in the calculation are <1E-5 and around 40% equal zero. Although this is only for a small test network, if this is representative of more realistic models in production with larger numbers of participants, it suggests we should be careful when using weights for statistical analysis.

1 Like

Here are the same plots remade using the regular (non-weighted) std:

And here are the z-scores of the same data in green, showing the spread of data points offset from the mean [0 on the x-axis] in units of the std, with the expected theoretical distribution (Gaussian with mean of 0 and standard deviation of 1) in red. The data distribution looks symmetric with mean close to zero, but the data are more strongly peaked toward small z-scores than expected.


From the above initial tests:

  1. As many of the weights are very small or zero, we should not use weights in calculating confidence intervals.
  2. Using the std of the non-weighted network prediction alone to calculate the uncertainty will underestimate the network uncertainty.

To try and solve the narrow z-score range issue, I folded in the “one out” predictions, as these are a simple implementation of the bootstrapping/resampling method described in the trade off study above.

Just to confirm point 1 above, I first tried calculating the uncertainty on combinator_prediction_oneout using the weighted standard deviation, i.e.,

combinator_prediction_oneout_std = np.sqrt([:,i], (combinator_prediction_oneout[:,i]-combinator_prediction[i])**2))

As expected, the weight distribution for combinator_weights contains (even more) zero weight values, leading to the same problem as above. This confirms it is prudent to avoid using weights for statistical calculations until we have had the chance to test for larger/production-ready networks.

I therefore switched to using the non-weighted standard deviation to determine combinator_prediction_oneout_std and calculated the z-score for each worker normalized by the number of workers.

Here are plots of the z-score distribution (green) of the total network output calculated using z_scores = (returns - combinator_prediction) / combinator_prediction_oneout_std, overlayed with a Gaussian fit to the z_scores (black dashed line, fit results in top left) and the expected distribution (red line, Gaussian, mean=0, std=1). In the top plot, combinator_weights have been used to derive combinator_prediction_oneout_std. In the middle plot, no weights have been applied in calculating combinator_prediction_oneout_std. In the bottom plot, no weights are used, and the z-scores are normalised by z_scores /= np.sqrt(n_workers).

Because many of the combinator_weights are so small, many of the combinator_prediction_oneout_std are very small, so the spread in z-scores is huge in the top plot. I needed to truncate large values in the array before I could plot. I think this explains the larger std in the top plot.

I note that there is an offset in the mean to -ve numbers in all plots. I think it makes sense to play with the bias to see if we understand that.

Here are repeats of the bottom two plots above, but this time multiplying the bias in initialise_predictors and initialise_aggregators by 0.005, i.e. setting the bias close to zero. The mean of both distributions is now very close to zero, suggestion it is the bias which is causing the -ve offset in the fiducial model. Note how similar the fit to the normalised plot is to the expected theoretical distribution.

Thanks much for all the updates, this is shaping up really nicely. I’m a bit confused by the (rather absolute) statement that we shouldn’t use weights when calculating confidence intervals.

Surely that only applies for performance-based confidence intervals? For a network-based confidence interval (which it looks like is the v0 version that we will actually implement), using the weights is necessary. Correct?

Absolutely! Thanks for catching that. I should have provided more context with that statement.

I found the reason for the difference between the z-scores distribution and the Gaussians above. The histogram plot was automatically re-normalising so the area under the curve was equal to one, but the Gaussian fit was to the un-normalised z-scores. Having ensured both are normalised in the same way we get the following plot from the fiducial run with no bias, without needing to normalise by sqrt(n_workers).


The z-score distribution is still more centrally peaked with broader wings than expected from a Gaussian. This struck me as Lorentzian-like, so I fit and overlayed a Lorentzian profile (blue dashed line). This does a better job of fitting the z-scores — a narrower peak and wider tail — and it is striking that the peak and gamma are close to 0 and 1, respectively (although this could be a fluke).

As we would usually expect the z-score distribution to be a Gaussian due to the central limit theorem, I spent a bit of time thinking why the distribution may be naturally more Lorentzian than Gaussian. Since our system involves aggregating predictions from various sources (predictors, aggregators, etc.), the combined effect might not follow the central limit theorem, which usually leads to Gaussian distributions. If the individual prediction errors are not independent or identically distributed (a key assumption of the theorem), the aggregate might not be Gaussian.

In order to understand whether the above Gaussian vs Lorentzian result on the fiducial network settings is representative, I ran a quick parameter space study for a small number of different network sizes and number of epochs. In the title of the individual plots below, the “nb” means biases have been set to zero, and the numbers before the letters “p”, “a”, “r” and “e” given the number of predictors (inferers), aggregators (forecasters), reputers and epochs, respectively. For example, “Z-score histogram: nb 5p 3a 3r 100e” means there are 5 predictors (inferers), 3 aggregators (forecasters), 3 reputers and 100 epochs.

This quick study shows several interesting things:

  1. The Lorentzian does a better job of approximating the z-scores distribution than the Gaussian.
  2. Both Lorentzian and Gaussian fit results remain similar between 1000 and 10000 epochs.
  3. The larger the network becomes, the less well the normalised Gaussian (red curve) is representative of the Z-score distribution.

Based on this, I would suggest we explore how the uncertainties can be modelled as a Lorentzian and reported as confidence intervals.

We know that during the first few epochs, the network behaviour changes until it has had time to “settle down”. I thought it would be interesting to see how the above plots change if we ignore the settling-in period.

Here are the first three plots above reproduced after removing the initial 200 epochs (“200rm” in the title). This improves the repeatability as the epoch number increases — the first plot below is now much more similar to the other two compared with the plots above when the initial 200 epochs are included. Also, below, the overall gamma is now smaller in each epoch, suggesting some large outliers have been removed.