Monitoring forecaster health

In this post, I’ll extend the work we’ve done on inferer health to focus on the health of forecasters. My goal is to take the strategy we developed for inferers—assessing model diversity—and apply it directly to forecasters.

Specifically I will look at the forecast-implied inferences (FIIs) and measure diversity among these FIIs and normalise by the error in the network’s combined inference. This will help us understand whether forecasters are producing distinct and valuable predictions or if they are converging in ways that might reduce the network’s overall performance.

The goal is to replicate the metric for the inferers on the forecast implied inferences. I modified the topic simulator to have a duplicate number of forecasters and to take an input to control the forecaster volatility. I am looking at the MDM of the FIIs, and normalizing by the MAE of the network combined inference. We perform simulations using 100 inferers and 30 forecasters. I will modulate three parameters in this simulation: the number of duplicate forecasters, the volatility of the inferers, and the volatility of the target.

Here are the results. Modulating inferer volatility:

Modulating forecaster volatility:

Modulating target volatility:

In general the MDMs are much smaller on the FIIs than they were for the raw inferences. In the last plot the MDM for 5 duplicate forecasters is smaller than the MDM for 7 duplicate forecasters. This seems odd. As a result (left most plot) the metric is not monotonically decreasing as duplicate forecasters increases. The second plots are a mess. It looks like those plots have better monotonicity but MDM/MAE does not maintain a constant ratio like it did for the raw inferences.

It is not immediately clear to me what is happening here. I will think about this more.

Yeah… forecasters sometimes behave unexpectedly. There actually is a subtle difference here relative to inferer health, in that we mind redundancy a bit less, because the network greatly benefits from the presence of forecasters. They may produce the same output most of the time, but whenever the rare context comes up that they foresee, we want them to be there because then they might differ.

Yeah I was thinking about that, I guess duplicate forecasters aren’t necessarily bad. Especially if the forecasters are able to foresee said context. So I guess I am not really sure if a metric based on FIIs make sense or what to test.

Let’s say we want to focus on if the forecasters are actually improving the networks performance. Here are some general ideas:

  1. Measure Regret Over Time:
  • Track the average forecasted regret across all forecasters in a topic. Sustained positive regret indicates that forecasters are expecting the network to outperform its past performance, suggesting they foresee improvement in conditions.
  1. Compare Naive vs. Full Network Performance:
  • Calculate the naive network loss (excluding forecast-implied inferences) and compare it with the full network loss (with forecast-implied inferences). If the full network consistently outperforms the naive network, this indicates that forecasters are contributing positively. This is Equation 44 in the white paper.
  1. Track One-Out Metrics:
  • Use one-out network inferences to see how the exclusion of individual forecasters’ contributions impacts overall performance. If removing a forecaster consistently results in worse performance, this shows their value. This is Equation 22 in the white paper.

All of these tell us if forecasters are helping network performance. I don’t know if this fully aligns with “They may produce the same output most of the time, but whenever the rare context comes up that they foresee, we want them to be there” but it aligns with forecaster health in general.

I really like (2) and (3). I think the point is they should collectively contribute, and these macroscopic indicators tell us that. So (2) is macroscopic, and (3) is individual. Something like that?

Yes that makes sense to me. For the individuals in (3) it may be worth exploring how to get those scores down to one number.

But in general I like this. So we monitor the scores in eq. 42 and 22. 42 gives us a sense if forecasters are helping at the macro level. And then 22 can give us finer grained information on which forecasters are helping or not. I guess if we only care about the macro level then just 42 would be sufficient.

1 Like