The topic variable, epsilon (called f_tolerance
in the simulator), controls the numerical precision that the network can achieve. Different problems will require different numbers of significant digits. If epsilon is too large, it limits the network’s ability to distinguish between losses: for problems where the network needs to detect very small log-loss differences, a large value of epsilon would result in the same regrets, weights, scores, and rewards for all workers and reputers. Therefore the topic creator must set the value for epsilon. Once set, it should be immutable, or at least only be changed through some form of voting consensus by the topic participants. The goal here is to quantify the mapping between epsilon and the numerical precision (~number of significant digits) that can be meaningfully handled by the network.
My preliminary investigations working directly with the network loss (combinator_loss
) and various precision metrics as a function of network participants did not prove useful in defining the network precision. After helpful discussions with @Apollo11 (thanks!), it became clear the best way to quantify the epsilon-to-precision mapping was directly through the get_weights
function.
The strategy to determine the epsilon → numerical precision mapping is as follows:
-
Understand coarsely how the output weight distribution depends on the input regret distribution and epsilon values in the
get_weights
function. -
Identify regimes in the regret-epsilon space in which the spread of weights is either large enough to distinguish between different workers, or too small to distinguish between different workers.
-
Run a finely sampled Monte-Carlo parameter space study to determine the transition point(s) in [mean(regrets), std(regrets), epsilon] space at which the network is (un)able to distinguish between different workers.
The following plots demonstrate important regimes of behaviour in the mapping between regrets and weights when simulating a simple network with a small number of participants.
Negative regrets: The y-axis gives the resulting weight from the get_weight
function for the epsilon (f_tolerance
variable in the simulator) value on the x-axis and for a linearly spaced regret distribution with the maximum and minimum specified in the title. The left-hand plot shows when the gradient of the potential is used to map regrets to weights (relevant for Eq 8 and 11 in the white paper). The right-hand plots shows when the potential is used to map regrets to weights (relevant for Eq 28 in the white paper). The solid (dotted) lines shows where epsilon has (not) been multiplied by np.abs(np.median(regret)) in the denominator of Eq 8, 11, 28). The vertical dashed line shows std(regrets)/np.abs(np.median(regret)).
These plots summarise the general behaviour when the regrets are negative.
When the regrets are negative, both the potential and gradient of the potential are ~power-laws.To the left of the dashed vertical line, at low epsilon (f_tolerance
) values, the std(regrets) term dominates the denominator of Eq 8, 11 and 28. In this regime, epsilon (f_tolerance
) is not important, and the ~power-law regrets-weights mapping means the weights are well spread. In this regime the network can distinguish between the different workers.
As epsilon (f_tolerance
) increases past the dashed vertical line, it eventually dominates the denominator of Eq 8, 11 and 28. The large values of epsilon (f_tolerance
) cause all the normalised regrets to shrink closer to zero, and the range of normalised regrets to decrease. At sufficiently large epsilon (f_tolerance
) values, the resulting weights are essentially indistinguishable.
The bottom plot illustrates what happens when the spread in regrets is smaller (1E-3). The qualitative behaviour is the same, but the point at which the weights become indistinguishable moves to much lower epsilon (f_tolerance
) values.
This comparison demonstrates why epsilon (f_tolerance
) should vary with topic, and is important for setting the network precision.
Positive regrets: The following plots shows shows the general behaviour when all the regrets are positive.
Focusing first on the top left-hand plot where the gradient of the potential is used to map normalised regrets to weights. For normalised regrets >~2, the gradient of the potential is flat. Therefore, when all normalised regrets are >~2, the get_weights
function returns identical weights for all workers, and it is not possible to distinguish between them. This explains the very small spread in weights for low epsilon (f_tolerance
) values.
As epsilon (f_tolerance
) increases and dominates the denominator in Eq 8 and 11, the normalised regrets decrease and become <~2. At this point, the gradient of the potential transitions from flat to a power-law, leading to a spread in the weights of different workers seen at epsilon (f_tolerance
) ~1-10. As epsilon (f_tolerance
) increases further, the normalised regrets are pushed towards zero and the spread in the normalised regrets decreases, leading to all workers having similar weights again at large epsilon (f_tolerance
).
The right-hand plot, where the potential (rather than the gradient of the potential) is used to map normalised regrets to weights, shows qualitatively similar behaviour.
The next plots show how the spread in weights between workers varies as the maximum regret and spread in regrets changes. The qualitative trend is the same, but the shift between scaled and not-scaled epsilon (f_tolerance
) values changes. This again highlights why epsilon (f_tolerance
) cannot be fixed for all topics.
Regrets close to zero: Having covered the two extremes in regret space (all negative and all positive >2), here are some examples where the regrets are closer to zero.
In this first example, the left-hand plot (mapping function = gradient of the potential) shows it is not possible to distinguish any of the workers at low epsilon (f_tolerance
), whereas the right-hand plot (mapping function = potential) shows the network can distinguish the poorest performing participant (lowest regret) from the other workers at low epsilon (f_tolerance
).
In the second and third examples, where the input regret lies between 0 & 1 and 0 & 2, respectively, the behaviour is intermediate between the negative and positive extremes.
Loving the systematic approach here! It looks like this will result in a pretty deterministic f_tolerance
-precision mapping.
Thanks! I hope so.
Perhaps at this stage it’s worth reflecting on what I think are some pertinent points from the exploration thus far:
-
For all input regret distributions, at large enough epsilon (
f_tolerance
), the weights of all workers become identical. -
The epsilon (
f_tolerance
) at which this happens varies with the input regret distribution. -
Unless all the regrets are positive, it is not desirable to have all workers with the same weight. Therefore, care must be taken to choose an epsilon (
f_tolerance
) value that is small enough that it avoids the network being in this part of parameter space. -
When the regrets are all negative, the spread in weights is constant at low epsilon (
f_tolerance
) and then shrinks as epsilon (f_tolerance
) approaches, and then exceeds, std(regrets). -
When the regrets lie in the transition region between the steep and shallow/flat mapping functions (i.e. regret<~2), there exists a maximum in the spread of weights where epsilon (
f_tolerance
) ~ np.abs(np.median(regret)).
Having understood this qualitative behaviour, my suggested next steps are to (i) run a finely sampled grid of regret distributions over a range of epsilon (f_tolerance
) values, (ii) measure the resulting spread in weights, (iii) determine the parts of parameter space where the spread in weights issufficient to distinguish between different workers. This will need to be done for both the mapping functions (potential and gradient of the potential).
To generate synthetic regret distributions, I selected the number of workers, N, and randomly drew N values from Gaussian distributions with a given mean and std. I repeated this over a range of mean and std values to create a grid of synthetic regret distributions. I then ran get_weights
for each of these synthetic regret distributions for a range of epsilon (f_tolerance
) values, using both the potential and gradient of the potential to map normalised regrets to weights. For each (mean, std, epsilon) combination this produced N weights. I took the std of these weights to quantify and compare the weight spread for different (mean, std, epsilon) combinations.
I generated the std(weights) for the following grid:
sampling = 10 # Number of samples in the range
num_workers = 10 # Number of workers in the network
mean_range = np.linspace(-20, 20, sampling) # Range of means
std_range = np.logspace(-6, 6, sampling) # Range of standard deviations
f_tolerance_values = np.logspace(-10, 10, sampling) # Range of f_tolerance values
For each (mean,std) combination, I then determined (i) the epsilon (f_tolerance
) value forwhich the std(weights) was largest = max_f_tolerance; (ii) the std(weights) at the max_f_tolerance value.
The following heat maps show the max_f_tolerance value [left] and the std(weights) at the max_f_tolerance value [right], for the gradient of the potential [top] and potential [bottom] mapping from regrets to weights.
The grid is deliberately coarsely sampled over a large range of input values to start by analysing the general trends.
In this experimental setup, the input regret distribution is Gaussian by design. So for the above experiments I took std(weights) to measure the variation in weights. However, in a typical network, it is plausible (even likely) that the weights will be skewed and/or vary over several orders of magnitude. Indeed, machine learning networks often have a heavy-tailed weight distribution with outliers. In this case, std(log10(weights)) will likely provide a clearer and more interpretable measure of spread.
Here are the same heat maps as above, this time using std(log10(weights)) as the metric to maximise.
These heat maps confirm several of the qualitative trends seen before and reveal several other interesting points:
-
When the regrets are all negative (bottom-left of each panel), the max_f_tolerance value takes the minimum allowed f_tolerance value, and the std(weights) is large.
-
The spread in weights is smallest when the regrets are all positive (top-left of each panel).
-
There is little variation in the max_f_tolerance value (left-hand plots) when the regrets are all positive (top-left of each panel). Having said that, the large dynamic range and coarse sampling will mask any variations at the level of factors of a few.
-
When at least some of the regrets are negative (all portions of the plots outside the top-left) the std(weights) values at max_f_tolerance are all >~ 1E-2.
In terms of next steps, it is clear we need to separate out different sections of the (mean, std) parameter space before sampling the parameter space more finely. Otherwise the large dynamic range of max_f_tolerance and std(weights) will make it impossible to visualise any trends.
Turning back to the original aim, we need link these results to network precision. A conceptually simple way of thinking about network precision is in terms of regrets and weights. The power of Allora is its ability to combine inferences from different workers in away that systematically outperforms any individual worker. This requires being able to distinguish the performance of different workers, i.e., the weights of workers must be “meaningfully different”. Although “meaningfully different” is subjective, we can take a sensible threshold value (e.g. a factor of 3) and assess which parts of regret (mean, std) parameter space fulfill the criteria that the weights are separated by at least this threshold factor.
The heat maps below shows the same coarse, broad parameter study using std(log10(weights)):
sampling = 10 # Number of samples in the range
num_workers = 10 # Number of workers in the network
mean_range = np.linspace(-20, 20, sampling) # Range of means
std_range = np.logspace(-6, 6, sampling) # Range of standard deviations
f_tolerance_values = np.logspace(-10, 10, sampling) # Range of epsilon
The third column now shows the maximum epsilon (f_tolerance
) value for which the weights are separated by at least 10^0.5 ~ 3. If the cell is blank, there exists no epsilon (f_tolerance
) value for which the weights are separated by the threshold factor.
Here are similar plots but for a different initial random seed:
When the regrets are all positive (top-left corner), there exists no epsilon (f_tolerance
) value for which the weights are spread by larger than the threshold factor. This is by design through the choice of potential function, and makes sense: if the regrets are all positive, all workers are outperforming the total network, so the network essentially becomes an averaging machine. The network should not be operating in this portion of parameter space often.
The third column of the above figures shows that when at least some regrets are negative, the maximum espilon (f_tolerance
) at which the std(log10(weights)) is larger than the threshold value does not depend on the mean regret. Again, this is by design through the choice of potential slope at negative regrets. In terms of linking epsilon (f_tolerance
) to network precision, this is a very useful property, as it enables a unique epsilon (f_tolerance
) to be mapped to an expected spread in regrets, which in turn can be linked to network precision.
Comparison of the plots with two different random seeds shows that the reason why some portions of the bottom-left corner (low mean(regrets), low std(regrets)) are blank and others aren’t is simply due to stochastically sampling of the regrets when they are initialised.
Having thought some more about the above tests, I think a sensible approach to try and achieve our goals would be to:
-
Define a threshold factor for which the network can “meaningfully distinguish” weights between different workers.
-
Run a more finely grained parameter space study to determine the maximum epsilon (
f_tolerance
) value at which this threshold (max_f_tolerance_thresh) is reached for different spreads in input regret. -
Determine an analytical mapping function between std(regrets) and max_f_tolerance_thresh, so that a topic leader can input an expected spread in log-loss and get an epsilon (f_tolerance) value to use for that topic.
So here are plots with vertical lines showing the epsilon (f_tolerance
) value corresponding to 0.5x and 0.3x the maximum standard deviation of log10(weight) for different regret distributions with means transitioning from negative to positive through zero. In cases where there are multiple epsilon (f_tolerance
) values at which the std(log10(weights)) are equal to 0.5x and 0.3x the maximum std(log10(weights)), the largest f_tolerance
is chosen. I ran and eyeballed a large parameter space and found the results are consistent across this range, so only include representative plots at points where the parameters transition through key regimes:
Awesome! This is what we needed I think
Yeah, I think we’re close now!
My impression is that a factor of 0.5x consistently does a good job. So here are heat maps showing log10(f_tolerance
) corresponding to 0.5x the maximum standard deviation of log10(weight). In cases where there are multiple epsilon (f_tolerance
) values at which the std(log10(weights)) are equal to 0.5x the maximum std(log10(weights)), the largest epsilon (f_tolerance
) is chosen.
First, I ran a low resolution parameter study across a wide range of input mean and std:
sampling_mean = 10 # Number of samples in the range
sampling_std = 10
sampling_precision = 100
num_workers = 10 # Number of workers in the network
mean_range = np.linspace(-20, 20, sampling_mean) # Range of means
std_range = np.logspace(-6, 6, sampling_std) # Range of stdev
f_tolerance_values = np.logspace(-10, 10, sampling_precision)
which resulted in the following heat maps:
Here are comparable plots for a low res zoom in of negative regrets:
sampling_mean = 10 # Number of samples in the range
sampling_std = 10
sampling_precision = 1000
num_workers = 10 # Number of workers in the network
mean_range = np.linspace(-20, -1, sampling_mean) # Range of means
std_range = np.logspace(-6, -1, sampling_std) # Range of stdev
f_precision_values = np.logspace(-10, 10, sampling_precision)
The general trend is the same as before: the epsilon (f_tolerance
) value at 0.5x max std(log10(weights) depends strongly on the std(regrets) and hardly at all on mean(regrets).
On closer inspection, there appears to be an almost linear relationship between log10(std(regrets)) and log10(f_tolerance
) value at 0.5x max std(log10(weights))).
To investigate this, in the plots below I have:
- Taken the (mean, std,
f_tolerance
) array above and calculated log10 off_tolerance
and std(regrets). - For each std(regrets) value, calculated the median and standard deviation of log10(
f_tolerance
) over all mean(regrets) values. - Plotted an error bar scatter plot of median ± std log10(
f_tolerance
) vs. log10(std(regrets)). - Fit and overlay a linear model to the data with the fit results in the legend.
I think this is pretty cool – these fits provide a straightforward way to determine epsilon (f_tolerance
) for a given expected precision.
Great! Can you just clarify: log10(std) == numerical precision here?
Apologies, I should have made that clearer. Yes, that is correct. The topic creator would pick their desired numerical precision as the value on the x-axis and use the above relation to set epsilon (f_precision
) to achieve that numerical precision.
All clear! And great that it’s such a nice power law relation in the end