Extend Allora's Inference Synthesis to classification tasks

In the whitepaper, we describe how Allora achieves Inference Synthesis using clearly defined losses, regrets, and ground truths, using weighted averages of scalar target variables – all characteristic of supervised regression tasks. We need a strategy for extending this framework to classification tasks, where the ground truth is a label and the target variable may need adjustment.

To start with, we should go through the whitepaper with a fine-tooth comb to comprehensively all places where changes are needed. Then we need to determine how to support classification tasks in the most general form possible, add it to the simulator, and do some preliminary exploration. I’ll get going with this.

Our goal is to come up with the most general possible support for classification tasks (i.e. not restricting ourselves to binary classification or ordinal labels).

The current Inference Synthesis mechanism relies on scalar inferences (e.g. through using weighted averages, which are incompatible with labels). This means that the network is naturally more suited to handle classification using label probabilities. The difference between a numeric probability vector and a scalar inference is smaller than between a categorical and a scalar inference.

Many models typically predict probabilities for each class anyway. The inferred label is the one with the highest predicted probability.

The probability vector paradigm best supports the generalisation goal too. It aligns with how most classification models work and allows the use of weighted averages directly. The only requirement is that the loss functions chosen by the topic creator are continuous, satisfy lower=better, and are capable of comparing real-valued probability vectors to categorical ground truths (e.g. through one-hot encoding).

If the above conditions are satisfied, the changes to the current inference synthesis mechanism can be minor.

Sifting through the whitepaper then implies the following for each of the whitepaper elements.

Inference Synthesis (Section 3.1)

  • Eq. 1 (inference definition): the inference is a vector of probabilities, with one element per class.
  • Eq. 2 (forecasted loss definition): no change — the forecasted loss is a single number, where lower = better.
  • Eq. 3 (forecast-implied inference): the forecast-implied inference becomes a probability vector, which itself is a weighted average of the worker probability vectors.
  • Eq. 4 (forecasted regret): no change — the network loss and the forecasted loss are both single numbers, where lower = better, so the regret definition does not change.
  • Eq. 5 (weight): no change — the regret definition does not change, so this does not either.
  • Eqs. 6 & 7 (potential function and its gradient): this will require tuning to optimize the network performance.
  • Eq. 8 (regret normalization): no change — the regret definition does not change, so this does not either.
  • Eq. 9 (network inference): no change — the network inference becomes a probability vector, which itself is a weighted average of the worker probability vectors.
  • Eq. 10 (weight): no change — the regret definition does not change, so this does not either.
  • Eq. 11 (regret normalization): no change — the regret definition does not change, so this does not either.
  • Eq. 12 (inference vector): given that each inference is a probability vector, this is now a probability matrix.
  • Eq. 13 (reputer-reported losses): no change — reputers will obviously need to adjust their loss functions, but a probability vector is reduced to a single loss.
  • Eq. 14 (stake-weighted losses): no change — the losses are single numbers, where lower = better, so the stake-weighting does not change.
  • Eq. 15 (regret): no change — the losses are single numbers, where lower = better, so the regret definition does not change.

Confidence Intervals (Section 3.2)

  • Eq. 16, 17, 18 (confidence intervals): these need careful evaluation and potentially rethinking. There is a naive solution where we apply these equations to each of the probabilities, but the probabilities are not independent variables so that requires more thought. An alternative could be to look at highest-p label frequencies across all inferences. Either way, this will certainly require more research.

Incentive Structure (Sections 4.1 and 4.2)

Because the losses are single numbers, nothing of this section needs to change in principle. However, with quite fundamentally new loss functions or objective metrics, the score-to-reward mapping function (Eqs. 28 and 35) will certainly require (re)evaluation and/or tuning.

1 Like

For the changes needed for the inference synthesis mechanism, I assume those change apply uniformly not just in classification-focused topics right? Do you foresee any of those changes affecting any other research that’s been done with the existing instantiation of the mechanism?

1 Like

I’m afraid we’ll need to have a topic-level flag that says if it’s a classification problem or a regression problem. The reason is that the loss function behaviour may (!) be such that a re-tuning of the potential function and score-to-reward mapping is needed to optimise network performance.

We will need to test this.

The only case in which we would not need such a flag is if it turned out that our current design is sufficiently general out of the box and requires no re-tuning. Seems unlikely, but we won’t know until we’ve tried.

As a first step in extending the network simulator, we need it to generate mock categorical data. I’ve been going back and forth for a while to decide on how to best do this, but in the end I reminded myself that we do not want to model what is happening in a classification task, but we want to get realistic behaviour.

For this reason, I decided not to generate class labels, but label probabilities as the ground truth. The mock inferences can then be perturbations of these probabilities, just as with the regression problem.

The probabilities are drawn from a Dirichlet distribution, shown below for the example of three classes.


The nice thing about a Dirichlet distribution is that the probabilities sum to unity, and the alpha vector quantifies the variation of the probabilities (through its normalisation; compare the top two panels) or how biased (through the differences).

This means that the normalisation of the alpha vector tells us how difficult the classification task is (less variation implies more similar probabilities, which indicates a higher difficulty), whereas the dissimilarity of its elements encodes how balanced the ground truth is (where a perfectly balanced ground truth implies that all labels have the same probability of occurring).

In our tests, we will want to consider data sets of varying difficulty and balance. I have therefore introduced a mapping between 0<=variation<=1 and 0<=balance<=1 parameters to the alpha vector. (I consciously chose the nomenclature “variation” as this is not technically the variance.) This mapping is somewhat arbitrary (there are many ways in which one can do this), but I think it fulfils the above goals:

def f_dirichlet(n_epochs, n_labels, variation, balance):
    alpha_base = np.arange(1, n_labels+1)**(2*(1-balance**0.5))
    alpha_base = alpha_base/np.median(alpha_base)
    adjust_alpha = 10.**(4*(0.5-variation))
    alpha = alpha_base*adjust_alpha
    probabilities = np.random.default_rng().dirichlet(alpha, n_epochs)
    return probabilities

The resulting standard deviation of the probabilities looks like this (generated over 1000 epochs):


An all-unity alpha vector is indicated by the dots. We see that the variation parameter controls the standard deviation of the probabilities. Only if the variation is zero (i.e. probabilities are as similar as possible) does the balance parameter affect the standard deviation.

Here are some examples of the probabilities drawn for default parameters.

{variation, balance} = {0.5, 1}:

array([[0.00366, 0.12113, 0.44394, 0.12758, 0.30369],
       [0.19183, 0.2053 , 0.07842, 0.13638, 0.38807],
       [0.18014, 0.40854, 0.16454, 0.11453, 0.13226],
       [0.24297, 0.21941, 0.38331, 0.00817, 0.14614],
       [0.03104, 0.30407, 0.45378, 0.12222, 0.08889]])

Here are those same examples when changing variation.

{variation, balance} = {0, 1}:

array([[0.19664, 0.20742, 0.19512, 0.19521, 0.20561],
       [0.18457, 0.22529, 0.17311, 0.21244, 0.20459],
       [0.20198, 0.19799, 0.19253, 0.1906 , 0.21689],
       [0.2085 , 0.17609, 0.20515, 0.19277, 0.21748],
       [0.17749, 0.20983, 0.21661, 0.21357, 0.1825 ]])

{variation, balance} = {1, 1}:

array([[0.     , 0.     , 0.     , 0.     , 1.     ],
       [1.     , 0.     , 0.     , 0.     , 0.     ],
       [0.     , 0.     , 1.     , 0.     , 0.     ],
       [0.71816, 0.     , 0.     , 0.     , 0.28184],
       [0.     , 1.     , 0.     , 0.     , 0.     ]])

This clearly affects the contrast (standard deviation) of the probabilities.

Here are those same examples when changing balance.

{variation, balance} = {0.5, 0.5}:

array([[0.15591, 0.1891 , 0.03249, 0.20507, 0.41743],
       [0.25992, 0.10724, 0.02145, 0.23007, 0.38133],
       [0.25912, 0.02841, 0.46691, 0.05982, 0.18575],
       [0.00858, 0.35486, 0.22506, 0.07287, 0.33863],
       [0.05241, 0.15501, 0.17667, 0.11402, 0.50189]])

{variation, balance} = {0.5, 0}:

array([[0.0005 , 0.27699, 0.36949, 0.03341, 0.31961],
       [0.00327, 0.36342, 0.05233, 0.13267, 0.44831],
       [0.00032, 0.01882, 0.06006, 0.58172, 0.33909],
       [0.     , 0.04659, 0.06193, 0.25559, 0.63588],
       [0.00044, 0.05993, 0.11238, 0.70294, 0.12432]])

This clearly affects the ordering of the probabilities, with low balance inducing a gradient of increasing probability with rank.

The interaction between the parameters is complicated, but logical. The standard deviation in the above figure is not the right parameter to probe this.

Here are the same examples when changing balance at high variation.

{variation, balance} = {1, 0.5}:

array([[0.     , 0.     , 0.     , 0.     , 1.     ],
       [0.     , 0.00001, 0.     , 0.99999, 0.     ],
       [0.     , 1.     , 0.     , 0.     , 0.     ],
       [0.     , 0.90876, 0.     , 0.     , 0.09124],
       [0.     , 0.     , 0.93811, 0.06189, 0.     ]])

{variation, balance} = {1, 0}:

array([[0.     , 0.     , 0.99664, 0.     , 0.00336],
       [0.     , 1.     , 0.     , 0.     , 0.     ],
       [0.     , 0.     , 0.     , 0.00009, 0.99991],
       [0.     , 0.     , 0.     , 0.02258, 0.97742],
       [0.     , 0.     , 0.00001, 0.     , 0.99999]])

High contrast, but increasingly skewed to the higher ranks for lower balance.

Here are the same examples when changing balance at low variation.

{variation, balance} = {0, 0.5}:

array([[0.10048, 0.13935, 0.22021, 0.25165, 0.28831],
       [0.12065, 0.15535, 0.17098, 0.28649, 0.26653],
       [0.09535, 0.18697, 0.2128 , 0.20971, 0.29518],
       [0.09062, 0.16101, 0.23497, 0.21562, 0.29779],
       [0.11131, 0.1559 , 0.21263, 0.23472, 0.28544]])

{variation, balance} = {0, 0}:

array([[0.01953, 0.08559, 0.17119, 0.25132, 0.47238],
       [0.01626, 0.06522, 0.1777 , 0.29622, 0.44459],
       [0.01388, 0.06034, 0.15471, 0.30948, 0.46159],
       [0.01842, 0.06403, 0.18416, 0.27824, 0.45515],
       [0.01976, 0.09072, 0.14122, 0.30183, 0.44647]])

Low contrast, and increasingly skewed to the higher ranks for lower balance, to the point that it can almost guarantee that the highest-rank element always has the highest probability.

I think these two parameters will give us a great way of testing network performance for a wide set of problem properties, as parameterised by the varation and balance parameters. I will go ahead and implement this. The run_network function will take four new arguments:

  • problem_type is a string with possible values 'regression' and 'classification'. This variable technically is not needed (it could be implicitly encoded by setting the next variable to zero), but I prefer a version where this distinction is made explicitly and the next variable simply has no functionality if problem_type = 'regression.
  • n_labels is an integer setting the number of labels used in the problem. This parameter only does something when we set problem_type = 'classification'.
  • variation is a real number in [0, 1] that indicates the degree of random variation between the probabilities, with higher contrasts reached for higher variation. Default is variation=0.5. This parameter only does something when we set problem_type = 'classification'.
  • balance is a real number in [0, 1] that indicates how balanced the probabilities are, with more pronounced systematic ordering occurring for lower balance. Default is balance=1. This parameter only does something when we set problem_type = 'classification'.

The actual “ground truth” generation (with labels being the ground truth) then becomes as simple as this:

def get_class_history(num_periods=1000, n_labels=5, variation=0.5, balance=1, start_date='2020-12-01'):
    probabilities = f_dirichlet(num_periods, n_labels, variation, balance)
    labels = (probabilities == np.max(probabilities, axis=1, keepdims=True)).astype(int)
    time = pd.date_range(start=start_date, periods=num_periods, freq='D') # Generate the time axis
    return time, labels, probabilities

@nick @steve @Renata

1 Like

We can keep the same infrastructure for generating the raw inferences, i.e. each worker has its own error and bias. We take logit(p) to map the probabilities from [0,1] to [-∞,∞], perturb them in that space, and take sigmoid(logit(p)+perturbation) to map the perturbed probabilities back from [-∞,∞] to [0,1]. Finally, we normalise their sum to unity.

The logit/sigmoid transformation dampens the perturbations by about a factor of four (this is because near x=0.5, the slope of sigmoid(a*x) is equal to unity (the slope of x) only for a=4. To retain a roughly linear impact in real probability space near p=0.5, we multiply the perturbation by four for classification tasks.

The resulting predictor now looks like this:

def get_predictor_output(x_obs, y_obs, error, bias, age, outperform_flag, problem_type):
    factor = outperform_factor(outperform_flag)
    xp = experience_factor(age) # predictor experience
    error = factor*xp*error # get predictor error
    bias = factor*xp*bias # get predictor bias
    difference = np.random.normal(bias, error, np.shape(y_obs)) # Generate the random predictions
    if problem_type == 'regression':
        y_pred = y_obs + difference # Generate the random predictions
    elif problem_type == 'classification':
        y_pred = sigmoid(logit(y_obs) + 4*difference) # Generate the random predictions
        y_pred = y_pred / y_pred.sum() # Normalize the predictions
    return y_pred

I played with the factor of 4 for a while to get reasonable similarity between the calibration of the perturbations in regression and classification tasks. In the end, I decided to add a boost factor of 3 for the bias and error in classification tasks, so that the parameters used to describe mock inference accuracy in regression tasks have a qualitatively similar impact for classification tasks. So then we get this:

def get_predictor_output(x_obs, y_obs, error, bias, age, outperform_flag, problem_type):
    factor = outperform_factor(outperform_flag)
    xp = experience_factor(age) # predictor experience
    error = factor*xp*error # get predictor error
    bias = factor*xp*bias # get predictor bias
    if problem_type == 'regression':
        difference = np.random.normal(bias, error, np.shape(y_obs)) # Generate the random predictions
        y_pred = y_obs + difference # Generate the random predictions
    elif problem_type == 'classification':
        boost = 3
        difference = np.random.normal(boost*bias, boost*error, np.shape(y_obs)) # Generate the random predictions
        y_pred = sigmoid(logit(y_obs) + 4*difference) # Generate the random predictions
        y_pred = y_pred / y_pred.sum() # Normalize the predictions
    return y_pred

This puts in place everything that is needed for the mock data generation. The rest of the changes will involve infrastructure for supporting inference vectors and all other points from the to-do list at the top of this thread.

1 Like

Looks great to me! Using balance and variation parameters to set the task difficulty is very neat :smiley:

1 Like

Now all we need is to add probability vector support throughout the simulator. This entails adding a problem_type flag (which can take regression or classification values), and inserting if statements wherever we need to increase the dimension of an array by one. It also requires a new loss function that sums the losses of the individual vector elements. None of this is necessarily exciting or challenging, but after putting all of it in I can say it was just quite some work… :slight_smile:

So anyway, the simulator now supports classification throughout the entire pipeline. I’ve run a first test with a simple L2 loss and three labels. Here are the figures that we know from the earlier regression tests.

Inferred vs “observed” (not really obviously, see above post) probabilities:

I also added a new figure to the simulator with all kinds of performance metrics that are relevant to classification tasks only:


Don’t take the actual numbers too seriously. However: the network outperforms all models. Nice.

Forecasters perform as usual, i.e. only the ones with high context sensitivity correctly predict contextual outperformance:

The reputers report normal losses for these forecasters (note these are the L2 loss of the forecasted loss in comparison to the true loss – this meta-loss is not reported within the actual network, where forecasters are judged and rewarded based on the forecast-implied inferences), with the context-sensitive Forecaster 2 clearly being better:

The network doesn’t really outperform the best forecaster yet (probably because in this example the models are all too close, see next figure), but it does outperform any individual inferer (compare to the first figure in this post):

And here are the losses of all inferences, both from the reputers and the full network. I ran this before setting boost=3 in the inference output (see previous post), which is why the losses are so close to one another. I don’t believe this produces a realistic spread in worker performance, but that’s just a simulator setting – nothing related to the actual network design for classification.


(Also, I’ve fixed the cold start issue that is visible for Inferer 5 since making this figure; that was related to reputers applying an EMA to strictly, i.e. with super low alpha, even when the network is still young.)

Weights behave as normal, and the most context-aware forecaster receives the highest weight:

Also error bars are good; each probability gets its own prediction, and the network inferences still fall within the percentiles:


Obviously the definition of error bars/confidence intervals needs work. It’s not obvious to me that this is how we’d want to be quantifying them. Maybe something like the PDF (incidence fraction) of inferred labels across all inferences within the network is more useful.

Scores are fine, but here we run into the problem that we calibrated the forecaster task score on an absolute scale. This is clearly wrong, as demonstrated by the figure that shows how the amplitude is much smaller for the same loss function in a classification problem. We kind of knew this, but hoped it wouldn’t matter. This will need fixing.

Listening coefficients are fine:
image

Smoothed reward fractions are fine:

Actual rewards are fine, just the forecasters are low due to the problem with the low forecaster task score that was identified above and will be fixed. Also, one inferer wins them all. Scrolling back up to the first figure, it bothers me a little bit that this doesn’t seem to be the best inferer. Needs a closer look.

Either way, this means that we now have classification running in the simulator.

2 Likes

To follow up on the above, I also put in a broader suite of loss functions. Specifically, the simulator now supports:

  • mean squared error (MSE; L2)
  • mean squared error (MSE; L2) for classification (summed over labels)
  • mean absolute error (MAE; L1)
  • mean absolute error (MAE; L1) for classification (summed over labels)
  • categorical cross-entropy (CCE) loss
  • multi-class hinge loss
  • focal loss
  • Kullback-Leibler divergence

Total list of valid loss function choices is now:
{mse, mae, cce, hinge_mc, focal, kl_div, mse_class, mae_class}.
The first two of these work for regression problems, all others for classification problems.

1 Like

I’ll address this later today.

On the point above about generalising the forecaster task score – I looked at this figure:


and realised that this forecaster utility equation (eq. 45 in the whitepaper):
image
will never work when used like this, because it relies on the absolute forecaster task score. Depending on the units of the target variable and the corresponding losses, forecasters will receive too little reward.

This is solved by changing the forecaster utility equation from an absolute forecaster task score dependence to a relative one. The question is what the denominator should be. I ran a bunch of different experiments, and found that the sum of all raw inference scores provides a good point of comparison. Strong forecaster sets will outscore the inferers, whereas weak ones will score 30-70% of the inferers’ sum. Illustrative example for a classification task:


The grey dotted line shows the sum. In this case, the forecasters outperform, so we go from this original reward distribution (BAD):

To this one (GOOD):

Now the best forecasters receive rewards that are competitive with the best inferers.

This is one of the updates that we have been developing for generalising the network design and becoming unit-insensitive (another example is making changes to the use of epsilon in eqs. 8, 11, 28, and 32 of the whitepaper).

In the end, the code change is trivially small. This:

def get_aggregators_utility(aggregators_score):
    aggregators_reward_fraction = np.maximum(0.1, np.minimum(0.4*aggregators_score+0.1, 0.5))
    return aggregators_reward_fraction

becomes:

def get_aggregators_utility(aggregators_score, predictors_scores):
    predictors_scores_sum = np.sum(predictors_scores)
    score_ratio = (aggregators_score - np.min([0, predictors_scores_sum])) / np.abs(predictors_scores_sum)
    aggregators_reward_fraction = np.maximum(0.1, np.minimum(0.4*score_ratio+0.1, 0.5))
    return aggregators_reward_fraction

and in the whitepaper we now get:

This should generalise the forecaster task utility score to any absolute loss scale.