Outlier-resistant network inferences

Inferers who consistently produce good inferences get a high score over time, and therefore get a large weight assigned to them when computing the network inference. But what happens if an inferer who used to give good results suddenly produces a totally wrong result (say in the wrong order of magnitude)? Then the network inference could be very wrong. Eventually, the network would correct by decreasing the inferer’s score, but this takes a couple of epochs.

There are many reasons such a thing could happen. For example, imagine the inferer (which is most likely a program running without human supervision) loses access to some important data source. Or the data source changes its format and suddenly the program is reading the wrong column. And that leads to its results being completely wrong as well.

So it would be nice to have some kind of outlier-resistant network inference which filters out extreme outliers. How could we achieve that? A solution should have the following properties:

  • It is relatively conservative, i.e. only filters out data points which are very clearly outliers
  • The outlier detection has to react immediately to an outlier. As I wrote above, the network itself can take care of a bad inferer eventually, but this is about the first few epochs before the network reacts.
  • The outlier-resistant network inference should not feed back into the network, i.e. is not used for scoring partipants, distributing awards, or anything like that. It’s just meant to be a convenience function for topic consumers.
2 Likes

Yes – great problem definition. I agree we need this, and I also agree with the boundary conditions that you identified!

A standard method for outlier detection is the interquartile range (IQR) method. That is, you remove everything which is p * IQR below the 25th percentile or p * IQR above the 75th percentile, where p is some parameter (a common choice is p = 1.5, but because we want to be very conservative we’ll probably want a larger value).

I studied the effect of this parameter p in a simulated network (with 5 inferers, one of which is “poisoned” in that he suddenly gives completely wrong results starting in epoch 100), and got the following plot:

The colors are different values for p (from 0 to 14), the x axis is the magnitude of the poisoning, the dashed line is the outlier fraction for “good” inferers (i.e. false positives), the solid line is for the “bad” inferer (true outliers). The fraction is taken over 20 runs for each poisoning magnitude and epochs 100-200 of the simulation for the bad inferer, epochs 50-100 for the good inferers (to avoid warm up and the effect of the poisoned inferer, which tends to make the false positive rate smaller).

I would imagine that the false positives come from the IQR being unnaturally low in some epochs - that is, half the inferers report essentially the same value. That would imply we could get better results by using a smoothed IQR of some sorts (say an EMA over epochs). One could see this as the outlier detector “learning” over time what a reasonable range of values is.

Here’s the same plot as before, but with “smoothed IQR” method I described above. That actually seems to work! For the parameter 6 we get only about 0.1% false positives, but almost 100% suppression of outliers with magnitude >= 40. Note that although there is an EMA involved it’s only over the IQR, so this is no less reactive. It still suppresses outliers in the first epoch they appear!

Another outlier detection method is the median absolute deviation (MAD) method. It is easy to define: let x be the median of the predictions x_1,…,x_n and let MAD be the median of the absolute values |x_i - x|. Then x_i is an outlier if |x_i - x| > p * MAD, where p is again a parameter. In the same way as with the IQR method above, the simulated results of this get much better when the MAD is smoothed by an EMA. Here’s a plot of the MAD and IQR results next to each other:

They are pretty similar, but notice that different parameter values give the same results. If we want a false positive rate of ~0.1% I suggest a IQR parameter of about 6 or an MAD parameter of about 11).

1 Like

I was wondering, how does detecting and removing outliers in this way affect the loss of the combined network inference? Does the removal of outliers lead to a noticeable improvement in the combined inference’s accuracy or loss metric? And are there any scenarios where removing outliers might degrade performance?

1 Like

Lots of good questions! It’s actually a little subtle to simulate because we modeled forecasters by perturbing the real loss of the inferers. When “poison” an inferer, do we base the forecasters on the loss of its poisoned inference, or the unpoisoned one? In other words, do we let the forecasters see that the inferer is bad?

Here’s what we get if one of the inferers suddenly goes bad at epoch 100, and the forecasters don’t see it at all. The values are averaged over 200 runs of the simulation to cancel out some noise:

And here’s what the losses look like when the forecaster does see that the inferer goes bad:

The reality is probably somewhere in between. The forecasters will eventually notice that an inferer produces bad results, but not immediately. As a simple model for this, I switched from “forecasters don’t see poisoning” to “forecasters do see poisoning” at epoch 150. Then here’s the effect of outlier rejection on the network loss:

The most important thing is that the initial spike in the network loss disappears with outlier rejection. So this works!

We also have plots of the inferer/forecaster balance, and the effective weights, both direct and through forecasters (log scale):

Here’s is a theory why the network loss is still higher between epochs 100-150 in the plot above, even after removing outliers: when everything works well, the forecasters report much better predictions than the inferers. That’s why the network assigns very little weight to the inferers directly and almost 100% to the forecasters. But after poisoning (and with dumb forecasters who don’t react) the forecaster implied inference is suddenly really bad. So the network assigns a low weight to the forecasters. On the other hand, the outlier-filtered forecaster implied inference is almost as good as it was before, much better than the raw inferers. But it still just gets a small weight, even in the “no outliers” inference.

So far, this was only for regression topics. Can we do a similar kind of outlier detection for classification topics?

The IQR method doesn’t have a straightforward generalization to multidimensional data. But the MAD method does: all we have to do is to use the “geometric median”, which is the point x which minimizes the sum of the distances |x_i - x|.

Let’s say we use the Euclidean distance on probability vectors. As in the simulations above, I let one inferer suddenly report bad results starting from epoch 100. Averaging over 200 individual runs to cancel out noise, we do get some separation:

Unfortunately there’s a lot of noise in this, which is hidden a bit above by averaging over 200 seeds. If we plot the true & false positive rate of detected outliers, we get this very underwhelming result (for the “(log)” variant I tried applying the same to the logarithms of the probability vectors, that made it even worse):

This might be caused by the limited dynamic range (the elements of the probability vector are all vales between 0 and 1). On the other hand, that means that the problem of having an outlier might be less significant than in the regression case. The scenario that an inferer has only 10% weight or something and then suddenly reports something 1000x the usual value and overwhelms the other inferers. That shouldn’t happen for classification anyway, the only way an inferer can really significantly throw off the network inference is if it has a high weight assigned to it.

The root of this is that we do not perform the inference synthesis in logit space. If we did, then it’d be the same problem. Maybe we’ll have the option to do inference synthesis in logit space in the future, then we would have to revisit this.

If we dial the volatility way down in the simulator, outlier detection works much better. Here are similar plots as above, but with volatility=0.01 instead of the default volatility=0.1 (“false positives” is the fraction of good values detected as outliers, “false negatives” the fraction of outliers not detected, in the second plot false positives are dashed and false negatives are solid, and the color is the “threshold” parameter, as before


And finally, here are some plots showing some examples of 20 predictor outputs, where one of them (the orange one) is “poisoned”, i.e. purposefully reports wrong results. It shows the MAD levels (degrees of outlierness, if you will) of plain probability (first row), log(probability) (second row), and logit(probability) (third row).

First with volatility = 0.1:



And also with volatility = 0.02:



I got to test this using data from the live network. This is what we got over ~10 days, using the MAD method:

There were some problems with the chain around epochs 130-300, so let’s ignore that part. The most important graph is the middle one, which shows the combined inference without outlier removal in blue, and with outlier removal in orange.

Here’s another zoomed in version which shoes all the individual inferers as green dots.

There are a couple of epochs where the outlier detection actually works and removes a spike in the blue line. But there are also three bad epochs (419, 445, 446) where the outlier removal makes the inference worse! We see in the second picture that at all of them the tolerance interval (in light orange) suddenly shifted down. That’s because it’s centered around the median, and the median makes a sudden jump.

Let’s look at the histograms of inferences at some critical epochs:

Epochs 162, 419, 445, and 466 are all examples where the outlier removal via MAD makes the network inference worse. For comparison, 123 is a random normal epoch, and 480 is an outlier that is correctly detected and removed.

So the answer is, MAD has a problem with bimodally distributed inferences. I also tried IQR. It doesn’t seem to have this problem, but also detects far fewer outliers than MAD. For example in epoch 480, MAD removes the spike in the network inference, but IQR doesn’t.

These bimodal inferences seem to be the main reason for outliers in the live data (at least in topic 1 in the last week), and it’s not something I considered in the simulator. I’ll try to come up with a fix for this.

1 Like

Nice! This is great progress and I think there have been some other issues that will actually help taking care of the bimodality. But it will be useful to test regardless :+1:

That’s right! Nevertheless, I tested a few options:

  1. The original smoothed MAD method with a threshold parameter of 11
  2. The original smoothed IQR method with a threshold parameter of 6 (which was roughly equvialent to MAD with parameter 11 in the simulator)
  3. The smoothed IQR method with a threshold of 0, showing how it struggles with bimodal distributions
  4. The MAD method, but instead of the median using the “weighted median”, meaning the inference I such that 50% of the total weight of inferers has an inference less or equal to I and 50% has an inference greater or equal to I.
  5. Using both the regular median and the weighted median, in the sense that an inference is only an outlier if it’s outside the tolerance interval around the median as well as the weighted median (but the MAD and therefore the size of the interval is computed relative to the regular median)
  6. Similar to #5 but instead of the weighted median uses the median of the last epoch. That is, x is an outlier if x < min(median[i],median[i-1]) - 11 * mad_smooth or x > max(median[i],median[i-1]) + 11 * mad_smooth. I call this the old and new variant.
  7. The regular MAD method except when the median jumped too much (more than 300), then we use the last epoch’s median. Except if we already did the same in the last epoch, or there are less than 5 inferers within the interval. This is very ad-hoc, but I just wanted to try if it helps. It’s a bit disappointing (except if also lowering threshold)
  8. Same as #6 but with threshold=4
  9. Same as #7 but with threshold=4

In summary, while #9 looks best for the data, but it’s 1) very ad-hoc and 2) has a low threshold value, i.e. a high risk of messing up perfectly good values. So I actually prefer #6. But maybe it would be a good idea to revisit this once the root cause of these bimodal inferers is taken care of.

1 Like

With this, here’s a description of the final method I recommend. I’ll first describe the “vanilla” smooth MAD method, as it makes it easier to understand the actual recommended method, which follows in point sections 2 & 3.

1. Detecting outliers (original smooth MAD method)

Short description: In every epoch, compute the median of the inferer predictions. Then subtract that from all predictions, take the absolute value of the results, and take the median of this list of values. That’s the median absolute deviation (MAD). Smooth the MAD using an EMA. Reject a prediction as an outlier if it differs from the median at least threshold times the smoothed MAD. I found threshold=11 to give good results in the simulator.

Now we describe this in more detail with pseudocode. The main ingredient is the median, which could e.g. be implement like this, for an array x:

def median(x):
  x_sorted = sorted(x)
  j = (len(x) - 1) // 2   # integer division (rounding down)
  if len(x) % 2 == 1:  # odd length
    return x_sorted[j]
  else:                # even length
    return 0.5 * x_sorted[j] + 0.5 * x_sorted[j+1]

Using this function, we can detect outliers with the following algorithm:

# the input is an array called `predictions`
# output is a boolean array `outlier` of the same length
threshold = 11.0
alpha = 0.2
mad_smooth = None # this has to be preserved across epochs

for each epoch:    # it's not a big deal if we skip some epochs, just use the last value of mad_smooth available
  median_prediction = median(predictions)
  deviations = [abs(x - median_prediction) for x in predictions]
  mad = median(deviations)
  if mad_smooth is None:
    mad_smooth = mad
  else:
    mad_smooth = alpha * mad + (1 - alpha) * mad_smooth

  for i in range(len(predictions)):
    if abs(predictions[i] - median_prediction) > threshold * mad_smooth:
      outlier[i] = True   # predictions[i] is an outlier!
    else:
      outlier[i] = False

2. Detecting outliers (“old and new” variation)

To mitigate problems with suddenly changing medians we can use this variation, which effectively does the outlier detection above with the previous epoch’s median in addition to the current one, and only declares an inference as an outlier if it one is by both measures.

It’s really just a few lines changed to the above, but we have to preserve an additional value last_median across epochs.

# the input is an array called `predictions`
# output is a boolean array `outlier` of the same length
threshold = 11.0
alpha = 0.2
mad_smooth = None # this has to be preserved across epochs
last_median = None # this also needs to be preserved

for each epoch:    # it's not a big deal if we skip some epochs, just use the last value of mad_smooth and last_median available
  median_prediction = median(predictions)
  deviations = [abs(x - median_prediction) for x in predictions]
  mad = median(deviations)
  if mad_smooth is None:
    mad_smooth = mad
  else:
    mad_smooth = alpha * mad + (1 - alpha) * mad_smooth

  for i in range(len(predictions)):
    if abs(predictions[i] - median_prediction) > threshold * mad_smooth and 
       (last_median is None or abs(predictions[i] - last_median) > threshold * mad_smooth):
      outlier[i] = True   # predictions[i] is an outlier!
    else:
      outlier[i] = False

  last_median = median_prediction

3. Computing an outlier-resistant network inference using the outlier array (in both cases)

To get the outlier-resistant network inference we need to modify equations (3) and (9) from the whitepaper. In the whitepaper, they are


for computing the forecast-implied inferences and

for computing the combined network inference.

All we have to do is remove the outliers from each of the four sums, i.e.



Here OR stands for “outlier-resistant”. The second equation just looks more complicated because I chose to separate the sum into two sums, one over inferers (j) and one over forecasters (k).

In other words, we first need to compute outlier resistant versions for all the forecast-implied inferences (we don’t do outlier detection on forecasters themselves, but we need to modify the forecast-implied inferences to not include outlier inferers). Then we can use these to compute the outlier-resistant network inference.

1 Like

Awesome, let’s get this on the chain!