Standard forecaster model

Apollo11 · June 28, 2024, 10:07am

We will need to develop an “off-the-shelf” forecaster model that uses a combination of:

network data (e.g. historical inference losses, raw inferences, network inferences, worker scores, worker rewards);
private data (critical context specific to the target variable that the workers are providing inferences for, which can be used to characterise under which conditions inferences are expected to be more or less accurate);

to forecast the loss of each raw inference under the current conditions. This model should be as general and problem-agnostic as possible, so that it can be used with minor changes in any Allora topic. As part of the pipeline, we need some logic that merges the network data and private data, which will probably involve some block height ↔ timestamp mapping.

Given that the forecasters are critical to Allora’s context awareness, we should make the model as accurate as possible out of the box. Obviously, workers running forecaster models are expected to differentiate themselves by adding their private data and developing these models further.

cal-hawk · November 19, 2024, 5:15pm

We should start by defining the network data or the on-chain features we collect.

Raw Data

The network operates with a large number of workers. To ensure efficiency and relevance, we focus exclusively on the active set— a subset of workers actively engaged in the network at any given epoch. Forecasts are generated only for the active set, allowing the model to operate quickly while maintaining accuracy. The active set is selected via Merit-Based Sortition, details can be found here.

For the active set, we collect historical data encompassing several key features. Each of these features plays a vital role in the forecaster’s ability to generate accurate predictions:

Inferences: the raw inferences themselves, these are the inferences that we are forecasting the losses of. This is arguably the most important feature!
Losses: Actual inference losses. This will provide the target data for training the forecaster.
Rewards: Reflect each worker’s unique contribution and are correlated to their inference losses and scores.
Scores: Quantify the impact of individual predictors on the network’s overall performance, helping interpret losses and rewards.
Addresses: Provides context awareness.
Timestamp and blockheight: Essential for synchronizing with private data.

Each of these features contributes uniquely to the forecaster’s ability to generate context-aware and accurate forecasts.

Feature Engineering

To enhance predictive power, we apply a suite of off-the-shelf feature engineering transformations to the raw network data. For the inference, rewards, and scores we apply the following transformations

Momentum Features: Capture directional trends in data.
Temporal Features: Reflect the time of day or day of the week.
Simple Differencing: Isolate changes between consecutive values to highlight deviations.
Rolling Mean and Standard Deviation: Provide smoothed averages and variability measures over fixed windows, helping the model adapt to changing patterns.
Many more!

We have found these typical time series transformations improve the overall accuracy of forecasts. The infrastructure is designed to be extensible, allowing us and users to incorporate additional transformations as needed.

By focusing on the active set, leveraging key features, and applying meaningful transformations, the forecaster model extracts insights from network data efficiently and effectively. This infrastructure forms the basis of the model’s predictive capabilities, enabling it to deliver accurate forecasts for the network.

Apollo11 · November 19, 2024, 5:55pm

I like the comprehensiveness of the feature data generation. Of course here the forecaster is forecasting losses, but do I interpret this correctly as that such a feature set would work for pretty much any time series?

cal-hawk · November 19, 2024, 6:25pm

Yes! The goal was to construct a problem agnostic feature set that could be used for pretty much any time series.

The only feature which is not very typical for time series are the addresses, but I have a plan for those. The raw addresses are alphanumeric strings such as allo1za8r9v0st4ntfyeka23qs5wvd7mvsnzhztupk0. In general this is a categorical feature. Some machine learning models, such as XGBoost and LightGBM, can natively handle categorical features. For these models, we should leave the addresses as categorical. However, typical models, such as neural networks, require all features to be numeric. For these models, we should apply label encoding. One hot encoding is also viable here, but in my testing I have observed that the dimensionality explosion caused by one-hot encoding degrades performance.

With this handling of the addresses we have a problem agnostic feature set that could be used for pretty much any time series.

Apollo11 · November 19, 2024, 7:46pm

Thanks much for explaining, this sounds excellent. Looking forward to your continued work on this!

cal-hawk · November 20, 2024, 1:56pm

No problem and thank you for asking!

For private data the goal is to aggregate features that provide “critical context specific to the target variable that the workers are providing inferences for.” To achieve this we need to be more problem specific. As of writing this, most of our topics are performing price prediction for various cryptocurrencies, thus I will focus on the private data for those topics.

At a high level the idea is to re-use the features used during for inference task.

When you run a worker node on the Allora network there are three main roles you can fill: inferer, forecaster, or reputer. It’s important to note that a worker can perform combinations of these tasks simultaneously—for example we can have workers performing both the inference and forecasting tasks. For workers like this, the private data for the forecasting tasks should be the features used to generate the inferences. In general, if these features provide enough information to generate accurate inferences, they should also provide plenty of context to the forecaster. More details about the off-the-shelf inference workers can be found here. For the forecasters who are not participating in the inference tasks, this data will not be pre-computed during the inference tasks. Thankfully it’s straight forward to reconstruct and I’ll outline the process below.

These price prediction models start by querying the binance API for the historical open, high, low, close, and volume of an asset. Then we apply typical Technical Analysis (TA) features to capture key market dynamics:

Trend Indicators: Moving averages (SMA, EMA) and momentum metrics (RSI, momentum, rate of change).
Volatility Measures: Bollinger Bands, rolling volatility, and ATR.
Volume-Based Insights: On-Balance Volume (OBV) and VWAP.
Price Behavior: Ratios such as high/low and close/open.
Lagged and Rolling Features: Lagged prices and rolling minimums.
Derived Metrics: SMA-EMA crossovers for trend identification.

This results in a private dataset that will provide “critical context specific to the target variable that the workers are providing inferences for.” This private data is then merged with network data using block height ↔ timestamp to ensure temporal consistency. The resulting dataset integrates:

Network features: Worker inferences, addresses, losses, scores, rewards, and transformed versions of those.
Private features: Contextual signals such as market trends and volatility.

I believe this combination creates a comprehensive dataset that will support accurate and context-aware forecasting.

cal-hawk · November 21, 2024, 8:50pm

I’ve been working on getting a suite of models implemented and trained.

The data, as introduced in my previous posts, has been engineered to be compatible with most off-the-shelf machine learning models. Here I want to offer a variety of initial models and engineer a training pipeline that allows any of these models to perform well out of the box. This ensures flexibility and allows us to provide a strong starting point for users to build upon.

Model Suite

Generally, the network will benefit from a diverse pool of forecaster models. We offer several models:

Tree-Based Models:
- LightGBM and XGBoost: Efficient, scalable, and excellent for structured data with categorical features.
- RandomForest and ExtraTreesRegressor: Robust ensemble methods for capturing complex interactions.
- GradientBoostingRegressor: A slower but highly accurate boosting algorithm.
Neural Networks:
- Flexible architectures for capturing non-linear relationships

This diverse pool accommodates different user preferences and also serves as a foundation for us to test how diverse forecasts influence the network.

Training Workflow

We focus on making the models as accurate as possible out of the box through a systematic training process:

Grid Search Cross-Validation:

Hyperparameter tuning is performed using grid search with cross-validation to find the optimal settings for each model. Each model is configured with an initial set of parameters to search over.
Cross-validation ensures robust performance across different data splits, minimizing overfitting and improving generalizability.

Full Data Training:

After determining the best parameters, the model is trained on the full dataset to maximize its exposure to historical patterns and its ability to capture the most recent data.
The trained model and its parameters are saved for use during forecast generation.

This process ensures that the forecaster is both robust and ready to handle new data with minimal adjustments.

Validation

I’ve implemented the process outlined above using data from the test network. Here I will focus on topics 3 and 5 which are BTC and SOL 10 minute predictions, respectively. To perform these tests I first find the best parameters using grid search cross validation. Then I iterate over the most recent active inferences, remove that inference from the dataset to use as a test point, and train on the remaining data. Then I test the models on the removed data point and compute error metrics. This let’s us directly test how well the models are forecasting the most recent losses.

Topic 3: BTC 10min prediction

• Extra Trees

Mean Squared Error: 0.2778
Mean Absolute Error: 0.3829
Mean Absolute Percentage Error: 6.79%

• Gradient Boosting

Mean Squared Error: 0.3687
Mean Absolute Error: 0.5391
Mean Absolute Percentage Error: 8.60%

• Random Forest

Mean Squared Error: 0.4050
Mean Absolute Error: 0.4834
Mean Absolute Percentage Error: 8.19%

• LightGBM

Mean Squared Error: 0.4341
Mean Absolute Error: 0.5260
Mean Absolute Percentage Error: 8.73%

• Neural Network

Mean Squared Error: 0.5356
Mean Absolute Error: 0.5850
Mean Absolute Percentage Error: 9.57%

• XGBoost

Mean Squared Error: 1.0785
Mean Absolute Error: 0.8792
Mean Absolute Percentage Error: 13.49%

Topic 5: SOL 10 min prediction

• LightGBM

Mean Squared Error: 0.1398
Mean Absolute Error: 0.2853
Mean Absolute Percentage Error: 3.46%

• Random Forest

Mean Squared Error: 0.1600
Mean Absolute Error: 0.2863
Mean Absolute Percentage Error: 4.02%

• Gradient Boosting

Mean Squared Error: 0.2376
Mean Absolute Error: 0.3786
Mean Absolute Percentage Error: 5.94%

• Extra Trees

Mean Squared Error: 0.2378
Mean Absolute Error: 0.4117
Mean Absolute Percentage Error: 6.63%

• XGBoost

Mean Squared Error: 0.3154
Mean Absolute Error: 0.4058
Mean Absolute Percentage Error: 4.95%

• Neural Network

Mean Squared Error: 0.4380
Mean Absolute Error: 0.5333
Mean Absolute Percentage Error: 7.75%

The scatter plots show the true reported losses from the test network on the x axis and the forecasted losses on the y axis, colored by model type. The values reported in the bulleted lists are computed by averaging over each test and are sorted based on MSE. The main thing to note is that every model is demonstrating the capability to provide accurate forecasts out of sample. The two topics correspond to two different learning tasks, and the results above show that the models have varying performance on the different topics. This highlights the need for a diverse pool of forecasters.

These results demonstrate that the forecaster models perform well across different topics and can generalize effectively outside the training data, and provide accurate forecasts with minimal tuning.

In this pipeline I strategically avoided doing train/test splits based on timestamps as is typically done for time series. Our setup presents unique challenges – workers have different amounts of historical data, making typical time series splits impractical. Custom solutions are required and above I opted to use all data and off the shelf cross validation, but other custom approaches could be applicable.

By combining a flexible model suite with robust training methods we ensure the forecaster is accurate and generalizable. The infrastructure we’ve built enables quick experimentation with different models, making the system adaptable to evolving network scenarios.

Apollo11 · November 22, 2024, 9:50am

Oh wow, that was quick! Very promising. So… amusingly neural nets and XGBoost look like they’re the worst? Could be worth aggregating this information over all topics where we might run them and making an aggregated model architecture ranking.

It’s still very likely that different architectures win on different topics (underlining the need for multiple forecasters!), but still it’d be good to have that – maybe an aggregated ranking as well as maybe the mean and stdev of 1/rank across all tests. The latter could be used as a score to determine whether some models win sometimes (but at least they do) as opposed to never.

cal-hawk · November 25, 2024, 10:05pm

Thanks for the analysis! I’ve gone ahead and aggregated the results across topics 1-14, across three specific days (11/22, 11/23, and 11/25) to get a more robust picture.

Here are the average inverse ranks for all models across all topics on each day (higher is better):

11/22:

11/23:

11/25:

Then, looking at the mean inverse rank across all topics over all days:

You’re right about Neural Networks and XGBoost consistently underperforming in these tests. However, this heatmap over topics (averaged over the three days) reveals some interesting patterns:

Different models do indeed excel at different topics:

LightGBM achieved the highest score (0.833) on Topic 6
NeurtalNetwork achieved the highest score (0.722) on Topic 2
GradientBoostingRegressor performed best on Topic 8 (0.750)
RandomForest dominated Topics 13 and 14 (0.750-0.778)
ExtraTreesRegressor showed strong performance on Topic 3 (0.778)

The standard deviations are shown below:

This shows considerable variation in performance across topics for all models, supporting your point about the importance of maintaining multiple forecasters.

ExtraTrees, GradientBoosting, RandomForest generally performed better than the other approaches, but the relative performance varies significantly by topic.

This analysis definitely supports the multiple-forecaster approach - while some models perform better on average, each has moments of superior performance on specific topics. The variation in performance suggests that no single model consistently dominates across all scenarios.

Apollo11 · November 27, 2024, 4:40pm

This is great. Thanks for doing the analysis!

Topic		Replies	Views
Thorough testing of new forecaster model Applications	9	47	May 29, 2025
Extend Allora's Inference Synthesis to classification tasks Inference Synthesis	11	182	September 18, 2024
Monitoring forecaster health Execution	5	41	September 18, 2024
Monitoring inferer health Execution	7	38	September 19, 2024
Read this before posting Allora General	1	208	July 1, 2024

Standard forecaster model

Related topics