We can use this experiment as a way to optimise the forecaster and identify the best combination of parameters. We ran forecasters on the synthetic network with parameters from the following sets: model=(combined/global, per-inferer), target=(regret, loss, z-score), EMA set=([3], [7], [3,7], [7,14,30]), autocorrelation=(True, False). For this benchmarking exercise, we used 200 testing epochs.
This figure summarises the results of all tests by showing the mean log loss for each model combination (smaller values indicate better performance), and comparing them to the naive network inference (i.e. no forecaster input; black dashed line) and the best worker in the network (grey dashed line). In these tests all forecaster models significantly outperformed the naive network inference and best worker, i.e. all models would improve the network inference. The main take-aways from the figure are:
- Per-inferer models (coloured dashed lines; training separate forecasting models for each inferer) outperform a global/combined model (coloured solid lines; one single model with inferer ID as a feature variable). This allows the forecaster to tailor models for each inferer and prevents it predicting the mean of all inferers.
- The regret z-score models (loss z-score models are identical) outperform models predicting raw regret or raw loss.
- The models are less sensitive to EMA lengths (indicated by line colour), but generally short lengths ([3], [7] or [3, 7]) perform best due to the sensitivity to recent changes.
- Autocorrelation (left and right panels show with and without autocorrelation) does not significantly change the results, which is not surprising as there were no periodic variables built into the experiment.
To gain some insight as to why the per-inferer z-score models perform best we compare the predicted and true values for each target variable, using the forecaster models with EMA=[3,7] and autocorrelation as an example. In the figures smaller loss, larger regret and larger z-score indicate better performance. The three outperforming workers (downtrending, uptrending, crabbing) are indicated in the legends (blue, orange and green points, respectively). As previously, unfilled squares show the medians and dashed lines show linear fits for each worker.
Global/combined models:
Per-inferer models:
In general, the per-inferer models show better differentiation between workers. The combined models tend to put all ‘bad’ (random) inferers on similar relations, but the per-inferer models can distinguish some differences between them. Similarly, the linear fits for the per-inferer models tend to be closer to the ideal 1:1 line; i.e. the per-inferer models are more context aware, being better able to predict out- or under-performance for each individual worker.
For the target variable, the loss forecaster has the most difficult task: it needs to try and predict the absolute performance of each worker, which can vary dramatically from epoch to epoch. The predicted losses then need to be converted to regrets (difference from the full network loss, i.e. a measure of the expected outperformance relative to the network inference) for the weighting calculation. This can be simplified by instead directly predicting regrets, which provides a more stable property to predict by removing systematic epoch-to-epoch variations in losses, and indeed regret forecasters tend to outperform the loss forecasters.
One potential issue with regrets, which can be seen above for the crabbing worker, is that if the network inference becomes close to the worker inference (i.e. the network has identified it as an outperforming worker) its regret will trend to zero and it will no longer clearly be recognised as outperforming. For this reason we considered the regret z-score (difference from the mean regret of all inferers, divided by the standard deviation) as an alternative prediction target, as it identifies outperformance relative to other workers. Dividing by the standard deviation allows performance to be normalised across different epochs, which can be seen in the more consistent minimum and maximum true values between different workers. As we find, these properties allows z-score forecasters to significantly outperform both the regret and loss forecasters for this test.