Outlier-resistant network inferences

Inferers who consistently produce good inferences get a high score over time, and therefore get a large weight assigned to them when computing the network inference. But what happens if an inferer who used to give good results suddenly produces a totally wrong result (say in the wrong order of magnitude)? Then the network inference could be very wrong. Eventually, the network would correct by decreasing the inferer’s score, but this takes a couple of epochs.

There are many reasons such a thing could happen. For example, imagine the inferer (which is most likely a program running without human supervision) loses access to some important data source. Or the data source changes its format and suddenly the program is reading the wrong column. And that leads to its results being completely wrong as well.

So it would be nice to have some kind of outlier-resistant network inference which filters out extreme outliers. How could we achieve that? A solution should have the following properties:

  • It is relatively conservative, i.e. only filters out data points which are very clearly outliers
  • The outlier detection has to react immediately to an outlier. As I wrote above, the network itself can take care of a bad inferer eventually, but this is about the first few epochs before the network reacts.
  • The outlier-resistant network inference should not feed back into the network, i.e. is not used for scoring partipants, distributing awards, or anything like that. It’s just meant to be a convenience function for topic consumers.
2 Likes

Yes – great problem definition. I agree we need this, and I also agree with the boundary conditions that you identified!

A standard method for outlier detection is the interquartile range (IQR) method. That is, you remove everything which is p * IQR below the 25th percentile or p * IQR above the 75th percentile, where p is some parameter (a common choice is p = 1.5, but because we want to be very conservative we’ll probably want a larger value).

I studied the effect of this parameter p in a simulated network (with 5 inferers, one of which is “poisoned” in that he suddenly gives completely wrong results starting in epoch 100), and got the following plot:

The colors are different values for p (from 0 to 14), the x axis is the magnitude of the poisoning, the dashed line is the outlier fraction for “good” inferers (i.e. false positives), the solid line is for the “bad” inferer (true outliers). The fraction is taken over 20 runs for each poisoning magnitude and epochs 100-200 of the simulation for the bad inferer, epochs 50-100 for the good inferers (to avoid warm up and the effect of the poisoned inferer, which tends to make the false positive rate smaller).

I would imagine that the false positives come from the IQR being unnaturally low in some epochs - that is, half the inferers report essentially the same value. That would imply we could get better results by using a smoothed IQR of some sorts (say an EMA over epochs). One could see this as the outlier detector “learning” over time what a reasonable range of values is.

Here’s the same plot as before, but with “smoothed IQR” method I described above. That actually seems to work! For the parameter 6 we get only about 0.1% false positives, but almost 100% suppression of outliers with magnitude >= 40. Note that although there is an EMA involved it’s only over the IQR, so this is no less reactive. It still suppresses outliers in the first epoch they appear!

Another outlier detection method is the median absolute deviation (MAD) method. It is easy to define: let x be the median of the predictions x_1,…,x_n and let MAD be the median of the absolute values |x_i - x|. Then x_i is an outlier if |x_i - x| > p * MAD, where p is again a parameter. In the same way as with the IQR method above, the simulated results of this get much better when the MAD is smoothed by an EMA. Here’s a plot of the MAD and IQR results next to each other:

They are pretty similar, but notice that different parameter values give the same results. If we want a false positive rate of ~0.1% I suggest a IQR parameter of about 6 or an MAD parameter of about 11).

1 Like

I was wondering, how does detecting and removing outliers in this way affect the loss of the combined network inference? Does the removal of outliers lead to a noticeable improvement in the combined inference’s accuracy or loss metric? And are there any scenarios where removing outliers might degrade performance?

1 Like

Lots of good questions! It’s actually a little subtle to simulate because we modeled forecasters by perturbing the real loss of the inferers. When “poison” an inferer, do we base the forecasters on the loss of its poisoned inference, or the unpoisoned one? In other words, do we let the forecasters see that the inferer is bad?

Here’s what we get if one of the inferers suddenly goes bad at epoch 100, and the forecasters don’t see it at all. The values are averaged over 200 runs of the simulation to cancel out some noise:

And here’s what the losses look like when the forecaster does see that the inferer goes bad:

The reality is probably somewhere in between. The forecasters will eventually notice that an inferer produces bad results, but not immediately. As a simple model for this, I switched from “forecasters don’t see poisoning” to “forecasters do see poisoning” at epoch 150. Then here’s the effect of outlier rejection on the network loss:

The most important thing is that the initial spike in the network loss disappears with outlier rejection. So this works!

We also have plots of the inferer/forecaster balance, and the effective weights, both direct and through forecasters (log scale):

Here’s is a theory why the network loss is still higher between epochs 100-150 in the plot above, even after removing outliers: when everything works well, the forecasters report much better predictions than the inferers. That’s why the network assigns very little weight to the inferers directly and almost 100% to the forecasters. But after poisoning (and with dumb forecasters who don’t react) the forecaster implied inference is suddenly really bad. So the network assigns a low weight to the forecasters. On the other hand, the outlier-filtered forecaster implied inference is almost as good as it was before, much better than the raw inferers. But it still just gets a small weight, even in the “no outliers” inference.

So far, this was only for regression topics. Can we do a similar kind of outlier detection for classification topics?

The IQR method doesn’t have a straightforward generalization to multidimensional data. But the MAD method does: all we have to do is to use the “geometric median”, which is the point x which minimizes the sum of the distances |x_i - x|.

Let’s say we use the Euclidean distance on probability vectors. As in the simulations above, I let one inferer suddenly report bad results starting from epoch 100. Averaging over 200 individual runs to cancel out noise, we do get some separation:

Unfortunately there’s a lot of noise in this, which is hidden a bit above by averaging over 200 seeds. If we plot the true & false positive rate of detected outliers, we get this very underwhelming result (for the “(log)” variant I tried applying the same to the logarithms of the probability vectors, that made it even worse):

This might be caused by the limited dynamic range (the elements of the probability vector are all vales between 0 and 1). On the other hand, that means that the problem of having an outlier might be less significant than in the regression case. The scenario that an inferer has only 10% weight or something and then suddenly reports something 1000x the usual value and overwhelms the other inferers. That shouldn’t happen for classification anyway, the only way an inferer can really significantly throw off the network inference is if it has a high weight assigned to it.

The root of this is that we do not perform the inference synthesis in logit space. If we did, then it’d be the same problem. Maybe we’ll have the option to do inference synthesis in logit space in the future, then we would have to revisit this.

If we dial the volatility way down in the simulator, outlier detection works much better. Here are similar plots as above, but with volatility=0.01 instead of the default volatility=0.1 (“false positives” is the fraction of good values detected as outliers, “false negatives” the fraction of outliers not detected, in the second plot false positives are dashed and false negatives are solid, and the color is the “threshold” parameter, as before


And finally, here are some plots showing some examples of 20 predictor outputs, where one of them (the orange one) is “poisoned”, i.e. purposefully reports wrong results. It shows the MAD levels (degrees of outlierness, if you will) of plain probability (first row), log(probability) (second row), and logit(probability) (third row).

First with volatility = 0.1:



And also with volatility = 0.02: