Monitoring inferer health

This post is part of a broader effort to design a monitoring suite for the Allora Labs research team. Our goal is to design a suite of metrics to track the health of various components in the Allora network, allowing us to evaluate performance and identify potential issues before they become critical.

In this post, we focus on the health of inferers. Specifically, my goal is to quantify how diverse the models used by inferers are and how accurate the network’s inference is. To start, I’ll explore the use of covariance based metric for capturing model diversity, as it appears well-suited for detecting similarities or redundancies among inferers. I also will explore various ways of transforming the calculated covariance matrix to a single number representing inferer health, and other metrics that are not based on covariance.

Overall, the goal is to design a scalar metric that quantifies the health, or diversity, of the inferer pool for a given topic.

This analysis of health is done on a per topic basis. I will assume for each topic we want to return a single scalar metric at every epoch. The focus is on evaluating the diversity of models used by the inferers (or predictors) and the accuracy of the network’s aggregated inference. The focus is on evaluating the diversity of models used by the inferers (or predictors) and the accuracy of the network’s aggregated inference. I will refer to the metric we are computing as the ‘health metric’.

To facilitate testing, I modified the sim to accept a new argument, n_identical_predictors, which specifies the number of identical predictors within the network. I just make copies of the first predictor. This allows us to simulate varying levels of model redundancy and analyze its impact on the health metrics. @steve suggested this, ty Steve!

The other parameter I vary to see how the health metric changes is volatility. It appears the errors and biases scale according to this, so I am interpreting increasing volatility as decreasing worker performance.

@Apollo11 and I’s original idea is to get the covariance of the raw inferences down to a single number. So that is what I tried first. To calculate the covariance I use a windowed approach. Define a window size w, then calculate the covariance matrix using the last w inferences from each inferrer. This window size ensures we can catch changes, for example if two workers used heterogenous models for 10^3 epochs and suddenly started using the same model, it would take a long time to catch this change.

For reducing the covariance to a single number, I take the mean of the absolute values of the entries. This is a naive approach and can be iterated on. We shouldn’t use the trace as we want the off diagonal information, I want to look at the mean of just the off diagonal entries, the Frobenius norm, and potentially other norms.

Here I spin up a network with 50 inferers, and run with different levels of volatility and number of duplicate inferrers. I am using a window size of 50. Below is a plot of the mean of the absolute entries of the covariance matrix. Each colored cell corresponds to an experiment. I run for 1000 epochs and take the average values over all time.

This has desired behavior. For a fixed volatility, it is monotonically increasing as a function of the number of duplicate inferrers. One thing to note is the top left, when the inferrers have a low volatility themselves, this number does dramatically jump if there are lots of duplicates.

Now as suggested I will normalise by the absolute difference between the network inference and the ground truth. So point wise in time our health metric is (average absolute covariance matrix)/(absolute error in network inference). The results are below. The left most plot is the log10 of the normalised metric. The middle plot is just the covariance I showed above. The right most plot is the absolute error of the network inference.

Looking at the left most plot, the maximum occurs at the bottom right (low duplicates high volatility). This occurs because in this case the covariance is relatively high due to high volatility (even without the lack of volatility), but the network inference is still very accurate, so we are dividing by a small number and our metric blows up. I don’t think this is desirable behaviour.

Overall the metric (average absolute covariance matrix)/(absolute error in network inference) being high should alert that the covariance is high between or the loss is small. It seems that the loss being small is dominating this metric.

What if we looked at the product? I.e (average absolute covariance matrix)*(absolute error in network inference), the results look like this.

Here I feel the left most plot has the desired behaviour, the maximum metric is at the top right and it is monotonic in all the right directions.

I have a concern about the covariance I would like to express. This covariance metric can miss degenerate cases. For example, consider the case where we have three inferrers. One is well behaved, but two are using crappy models that output a constant. Shown here:

When we calculate the covariance we get:

image

No matter how we get this down to one number, it is not reflecting how unhealthy this degenerate case is. Overall we need to be careful with covariance, we need a window size that allows for us to detect changes but also is long enough to ensure the data in it has variance. In general, if two signals are identical their covariance is just the variance of the original signal. If that original signal has almost no variance, we won’t be able to detect that well.

To combat this I thought about replacing the covariance with a different measure of distance. We want something that gets large when the estimates are concentrated. To measure this I only look at the current inferences (inferences at epoch i), and compute their mean, then I compute the average absolute distance to the mean for all inferences. This measures how spread out the current inferences are but does not consider average behaviour.

Then I chose the health metric to be (absolute error in network inference)/(average absolute distance to current average inference). The results are below (no log on the metric here):

This appears to blow up too fast but is definitely picking up when we have duplicates. Maybe this can be tuned by considering (absolute error in network inference)/max((average absolute distance to current average inference),epsilon).

1 Like

Great stuff @cal-hawk! Adding some brief feedback here:

  • “I will assume for each topic we want to return a single scalar metric at every epoch. The focus is on evaluating the diversity of models used by the inferers (or predictors) and the accuracy of the network’s aggregated inference.” — I could imagine using two metrics (e.g. accuracy and model uniqueness)
  • “so I am interpreting increasing volatility as decreasing worker performance” — I think that’s right only partially. Volatility is the stdev of the target variable. It makes sense to me that the accuracy worsens if the scatter of the target increases. It’s like the difficulty increases. The idea was that this maintains a roughly constant ratio of the MAE and the stdev (volatility) of the target, but that doesn’t seem to happen in your plots (otherwise the first metric wouldn’t increase towards the right). This requires some attention.
  • “For reducing the covariance to a single number, I take the mean of the absolute values of the entries.” — would the determinant work?
  • I like your suggestion to use the mean distance to the mean (MDM) and calculate (something like) MDM/MAE. Maybe log helps stop it from blowing up quickly?

Thanks! Ahh I see! That makes sense. It appears the volatility also controls the predictor errors and biases, I will modify the sim so I can separately modulate the scatter of the target and worker errors and biases. I will look closer at this in general. I will also try the determinant, IIRC that has some theory behind it, if the data is gaussian this corresponds to a continuous version of entropy.

First, I modified the simulator to separately modulate the volatility of the target and volatility of the predictors. predictor_volatility sets the errors and biases of the predictors, and target_volatility sets the

Then, I tried the determinant, results:


It seems that if there are any duplicates, the determinant is zero. So this doesn’t work as intended. However this could be a binary test of health. If the covariance over the last X timesteps is singular, we know we have duplicates.

In light of this I tried several ways to get the covariance down to a single number:

  • Largest eigenvalue
  • Smallest eigenvalue
  • Condition number
  • Frobenius norm
  • Mean absolute entry

Here are the results, in all of these I am normalising by the MAE in the network inference and ground truth. Log’s are pointed out in the title:



These all have the same issues as the results in my last update, metrics are increasing towards the right.

I tried log10(MDM/MAE) as Diederik had mentioned in his fb. This is my favorite metric.

Here are the results:



It is decreasing in the number of duplicates no matter the volatility. If it is small we know there is a problem, the ratio’s are constant. Overall, I really like this.

1 Like

MDM/MAE: nice!

Question: what is “small”?

As for the determinant: I expect copies to introduce tiny noise terms on their inferences. Can you maybe model with that? Then the determinant won’t vanish (but might depend on noise amplitude).

“Question: what is "small”?”

I had the same thought. Short answer: I don’t know right now. Long answer: I think we need to test on real data to determine what small is. I am curious if different topics will have drastically different results, which would require a definition of small for each topic. I guess in general let’s say if this number is less than X, it is small and unhealthy. This would imply that MDM \leq MAE*10^X. Can we calibrate based on this?

One thought I had was to look at 1/log(MDM/MAE), since it is bounded between 0 and 1, but we still have the same problem of how close to 0 or 1 is bad or healthy.

“Can you maybe model with that?”

Yes - I just made copies then perturbed with 0 mean gaussian noise and different stdev.

Result for stdev of .1


Result for stdev of 1

On “what is small”: I think we can quantify that statistically in terms of expected stdev/coveriance etc. Even a simple MC experiment would probably tell us what is right.

But maybe we actually want to observe typical values we get in practice, and learn from/decide based on that.

Totally agree. My thought is that any thresholds we develop now will be a function of the assumptions we make, such as statistical properties, first principles, etc. I fear that with how many assumptions we have to make that these thresholds won’t be meaningful in practice. I like the idea of having binary cases of healthy or unhealthy, but I think we need more data to define those cases in a meaningful way.

1 Like