In equations 6 and 7 of the whitepaper, we specify the potential function that ends up setting the weights of inferences contributed to the network. This potential function features a free parameter p, which has been chosen to reasonably reflect positive contributions to the network. (The other free parameter, c, was found to have a smaller impact on network performance and was introduced for economic reasons.)

To help inform how topic creators set p (this is pnorm in the simulator and PInferenceSynthesis in the live network), we need a systematic study of network performance (measured through the network loss) as a function of p. I would suggest to consider different network configurations (n_inferers, n_forecasters) and make the p array sufficiently fine-grained to make out noise and trends if there are any.

Investigating the network loss distribution: in order to search for statistically meaningful trends in the network loss as a function of network parameters, we must first identify an appropriate metric to describe the loss distribution. I first define two “fiducial” sets of parameters, “f1” and “f2”, supposed to represent a small and medium sized network, respectively,

Choice of metric for identifying trends in network loss: The box and whisker plots above show that the distribution of log10(combinator_loss) is not Gaussian. The distribution is not symmetric, and there is a tail to very small values. This tail will skew the mean and standard deviation, so it is not advisable to use the mean and standard error of the mean (SEM [sigma/srt(n)]) to investigate trends in combinator_loss with different network parameters.

The median is a better measure of central tendency in this case, and the interquartile range is a better measure of spread. As the distribution is not Gaussian, it does not make sense to use the standard error of the median (~1.253 median(∣x_i−median(x)∣)/sqrt(n)). A better approach to determine the uncertainty on the median is to use bootstrapping:

Create a large number of bootstrap samples (~10,000) from the original data by resampling with replacement.

Compute the median for each of these bootstrap samples.

Compute the Standard Deviation of Bootstrap Medians

The plot below shows the same data as above, but this time the dots are the median of the distribution and the error bars show the standard deviation on the median using the above bootstrap method with a re-sampling size of 10,000.

This looks great – I think we can start changing the network composition at this point and look for trends there. Because there is a clear difference between these two sets, and most importantly it looks like the loss has a minimum at a different pnorm for both cases. We should try to find out if this is indicative of a systematic trend or is mostly noise.

The network performance is degraded when there are no forecasters.

The network performance is improved by ~1.5 orders of magnitude when increasing n_inferers and n_reputers from 3 to 12.

There is consistently an increase in performance going from n_forecasters= 1 to >1, and in general, increasing n_forecasters improves network performance.

For the larger network, ‘f2’, there is an interesting trend in network loss with pnorm when comparing n_forecasters > 0. The network loss decreases with increasing pnorm until a minimum and then increases. The minimum shifts to higher pnorm for increasing n_forecasters.

For the small network, ‘f1’, at larger n_forecasters and pnorm there are no values on the plot. I investigated this and it turns out the combinator_loss values are nans for this set of network parameters.

I am re-running the tests using a different seed to see if that makes a difference.

I ran the larger (f2) network for seeds 0 to 10. The ~1.5-2.0 orders of magnitude improvement in network loss when going from 0 to >0 n_forecasters was universal. So in the plots below I ignore the n_forecaster = 0 result and show the f2 results for n_forecaster = 1, 3 and 6.

Having at least one forecastor improves the network loss by factors of several to an order of magnitude.

In general, increasing the number of forecasters improves network loss. (seed 5 is the obvious exception.)

In general, having a lower pnorm improves network performance (lower combinator_loss). However, for a given n_forecasters (n_aggregators), the variation of network loss with pnorm is typically small (~0.2 dex) and comparable in magnitude to the uncertainty in the median loss.

Intermediate summary: if the above plots are representative of small-N networks, it suggests (i) a low value of pnorm is likely favourable, (ii) fine-tuning of pnorm is unlikely to give large improvements in network performance (<1.5x max).

These plots show the pnorm values corresponding to the minimum network loss (combinator_loss) for each n_forecaster (called n_aggregator in the plots). Random seeds 0 to 10 are used to determine the median and bootstrapped (10,000 samples) error on the median of the minimum pnorm values. The top and bottom plots are for the small (“f1”) and large (“f2”) networks, respectively.

OK so it’s significant and the trend inverts between f1 and f2. So then we need to find out why. I suppose it’s not the reputers, so it must be the inferers?

(I’d say it can’t be forecasters either, because these we vary independently… but maybe it’s some ratio between inferers and forecasters.)

OR (option 2) it’s just noise! But I’m afraid we’ll need to fill in n_inferers more finely and then figure out a way of assessing whether it’s n_inferers or n_inferers/n_forecasters.

Agreed. To investigate this, I ran a more finely sampled grid in steps of pnorm for different {n_inferencers, n_forecasters} combinations. I limited the grid up to a maximum of 10 inferencers and forecasters as the run time increases non-linearly with the network size and becomes prohibitively long for parameter space studies with more than several tens of workers.

Initial results from this grid showed that:
(i) the results do not depend strongly on the number of reputers
(ii) pnorm values of < 3.5 are preferred

The following heat map shows the median pnorm corresponding to the minimum combinator_loss across seeds 0 to 10 using steps in pnorm of 0.1 between 2 and 3.5, and for n_reputers = 5.

To calculate the uncertainty on the median minimum pnorm for each {n_predictors, n_aggregators} pairing, I Monte Carlo sampled the minimum median pnorm across the 11 seeds 10,000 times with replacement, and then took the standard deviation across the 10,000 samples.

Some observations from the above heat map analysis:

The heat maps for individual seeds show variation across the full range of pnorm.

There are no obvious trends in optimal values of pnorm in the (n_inferencers, n_forecasters) plane from the individual seeds.

The median plot over all seeds suggests a “standard” value of pnorm close to 2.5, with that value increasing by ~10% as n_inferers > n_forecasters and decreasing by a similar amount as n_inferers < n_forecasters .

The uncertainty is significantly larger when there are a small number of inferers and n_forecasters > n_inferers.

Thanks! This is very clear. So I think then the guideline for topic creators is that the default value of p_norm = 3 is fine. We cap the permitted values at p_norm > 2.5 anyway.

The other lesson is that the default value works better for large n_inferers and n_forecasters < n_inferers. Topic creators could potentially fine-tune this to p_norm = 2.7-2.8 if they expect significantly improved performance.

We should extend the tests to very large numbers of inferers, like n_inferers > 30 (obviously less finely sampled). I expect some topics will have that many quite soon. A log-stretch in N will be the natural way to go. Might spin up a new forum thread for that…

This illustrates slightly differently what you already found @steve, i.e. the pnorm where the network loss is minimised is ~3 for regression, independently of the number of participants, but ~5 for classification. For classification, the improvement of the network loss with the number of participants stagnates completely above 10 inferers, which means the active set of workers can be limited to smaller numbers for classification (@kenny will find this interesting!).