This analysis of health is done on a per topic basis. I will assume for each topic we want to return a single scalar metric at every epoch. The focus is on evaluating the diversity of models used by the inferers (or predictors) and the accuracy of the network’s aggregated inference. The focus is on evaluating the diversity of models used by the inferers (or predictors) and the accuracy of the network’s aggregated inference. I will refer to the metric we are computing as the ‘health metric’.
To facilitate testing, I modified the sim to accept a new argument, n_identical_predictors
, which specifies the number of identical predictors within the network. I just make copies of the first predictor. This allows us to simulate varying levels of model redundancy and analyze its impact on the health metrics. @steve suggested this, ty Steve!
The other parameter I vary to see how the health metric changes is volatility. It appears the errors and biases scale according to this, so I am interpreting increasing volatility as decreasing worker performance.
@Apollo11 and I’s original idea is to get the covariance of the raw inferences down to a single number. So that is what I tried first. To calculate the covariance I use a windowed approach. Define a window size w, then calculate the covariance matrix using the last w inferences from each inferrer. This window size ensures we can catch changes, for example if two workers used heterogenous models for 10^3 epochs and suddenly started using the same model, it would take a long time to catch this change.
For reducing the covariance to a single number, I take the mean of the absolute values of the entries. This is a naive approach and can be iterated on. We shouldn’t use the trace as we want the off diagonal information, I want to look at the mean of just the off diagonal entries, the Frobenius norm, and potentially other norms.
Here I spin up a network with 50 inferers, and run with different levels of volatility and number of duplicate inferrers. I am using a window size of 50. Below is a plot of the mean of the absolute entries of the covariance matrix. Each colored cell corresponds to an experiment. I run for 1000 epochs and take the average values over all time.
This has desired behavior. For a fixed volatility, it is monotonically increasing as a function of the number of duplicate inferrers. One thing to note is the top left, when the inferrers have a low volatility themselves, this number does dramatically jump if there are lots of duplicates.
Now as suggested I will normalise by the absolute difference between the network inference and the ground truth. So point wise in time our health metric is (average absolute covariance matrix)/(absolute error in network inference). The results are below. The left most plot is the log10 of the normalised metric. The middle plot is just the covariance I showed above. The right most plot is the absolute error of the network inference.
Looking at the left most plot, the maximum occurs at the bottom right (low duplicates high volatility). This occurs because in this case the covariance is relatively high due to high volatility (even without the lack of volatility), but the network inference is still very accurate, so we are dividing by a small number and our metric blows up. I don’t think this is desirable behaviour.
Overall the metric (average absolute covariance matrix)/(absolute error in network inference) being high should alert that the covariance is high between or the loss is small. It seems that the loss being small is dominating this metric.
What if we looked at the product? I.e (average absolute covariance matrix)*(absolute error in network inference), the results look like this.
Here I feel the left most plot has the desired behaviour, the maximum metric is at the top right and it is monotonic in all the right directions.
I have a concern about the covariance I would like to express. This covariance metric can miss degenerate cases. For example, consider the case where we have three inferrers. One is well behaved, but two are using crappy models that output a constant. Shown here:
When we calculate the covariance we get:
No matter how we get this down to one number, it is not reflecting how unhealthy this degenerate case is. Overall we need to be careful with covariance, we need a window size that allows for us to detect changes but also is long enough to ensure the data in it has variance. In general, if two signals are identical their covariance is just the variance of the original signal. If that original signal has almost no variance, we won’t be able to detect that well.
To combat this I thought about replacing the covariance with a different measure of distance. We want something that gets large when the estimates are concentrated. To measure this I only look at the current inferences (inferences at epoch i), and compute their mean, then I compute the average absolute distance to the mean for all inferences. This measures how spread out the current inferences are but does not consider average behaviour.
Then I chose the health metric to be (absolute error in network inference)/(average absolute distance to current average inference). The results are below (no log on the metric here):
This appears to blow up too fast but is definitely picking up when we have duplicates. Maybe this can be tuned by considering (absolute error in network inference)/max((average absolute distance to current average inference),epsilon)
.