Price/returns topic feature engineering

Looking at participant performance in Allora’s Forge programme, I have the impression that feature engineering is the bottleneck in several of the current price/return prediction topics.

Markets are like physical systems, with action and reaction. If you think about them that way, you can identify which variables might carry signal. As a data scientist you can then test that hypothesis. But just being a data scientist isn’t enough – you need that physics perspective to first understand where the signal may be.

Good ideas don’t come by staring at a screen on your own. They come from collaboration and discussion. I think it could help if participants got together more to work on the feature engineering, and discuss what may or may not work.

There is a clear incentive to collaborate: on mainnet, the rewards paid out in a topic will be set by the topic weight, which is calculated using the stake in a topic and the revenue that it generates. Obviously, performant topics generate more revenue. So while you will be competing for rewards, you will also collectively be competing against other topics.

This thread is aimed at carrying out research on feature engineering for financial price forecasting (i.e. log-returns topics). I would suggest we first make an inventory of commonly-used features and carry out importance analysis for the main ones. Additionally, we should reason where we expect to see the strongest signal based on our experience with the markets, engineer new features, and test their predictive power.

From the Allora research team, there may be involvement from myself, @florian, @joel, and @steve. Others might join later too.

2 Likes

The following analysis pertains to my model development for the network’s 24-hour PAXG/USD Log-Return Prediction topic (Topic 60):

  1. Base Data: To construct my dataset, I first retrieved historical price data (OHLCV — Open, High, Low, Close, Volume) from Tiingo, covering the past 250 days. This spans from October 25, 2024, to July 2, 2025. Since the topic updates every 5 minutes, I collected data at a 5-minute interval to ensure the model remains responsive to short-term fluctuations.
    However, this raw data required modification, as our prediction horizon is 1 day (1440 minutes), not 5 minutes. (Fig.1)

Thus, I modified the dataset so that each datapoint represents the past 1-day window. For example, for the datapoint at 00:05 AM on July 2 (Fig.2):

  • Open: The price at 00:05 AM on July 1 (exactly one day prior).
  • High: The highest price between 00:05 AM on July 1 and 00:05 AM on July 2.
  • Low: The lowest price between 00:05 AM on July 1 and 00:05 AM on July 2.
  • Close: The price at 00:05 AM on July 2.
  • Volume: The sum of trading volumes from all 5-minute intervals between 00:05 AM on July 1 and 00:05 AM on July 2.

The target variable was defined as: Log-Return(t) = ln(close(t) / close(t + 288)),
where t represents the current datapoint. The close price at t + 288 is used because one day consists of 288 intervals at 5-minute resolution (1 day = 288 × 5 minutes).

2. Feature Engineering:
To enhance the predictive power of the model, I engineered a variety of features capturing traditional technical indicators and statistical properties of the time series. The numbers next to the feature names represent the window length in terms of datapoints. Below is a breakdown of the feature categories:

  • OHLCV Basics:
    • open, high, low, close, volume, volumeNotional, tradesDone
      These capture standard daily trading metrics and activity.
  • Technical Indicators:
    • Volatility and Momentum:
      • Bollinger_High, Bollinger_Low: Bollinger Bands to measure price volatility relative to a moving average.
      • RSI_10, RSI_100: Relative Strength Index at short and long windows to measure overbought/oversold conditions.
      • MACD, KST: Momentum indicators that highlight trend shifts.
      • OBV: On-Balance Volume, combining price movement and volume to detect accumulation or distribution.
    • Moving Averages:
      • SMA_20, SMA_100, SMA_200, SMA_500, SMA_1000: Simple moving averages over different timeframes to capture medium- and long-term trends.
      • EMA_20, EMA_100, EMA_200, EMA_500, EMA_1000: Exponential moving averages which give more weight to recent prices.
      • Difference-based indicators:
        • EMA_100-10: Difference between EMA_100 and EMA_10 for short-vs-long-term momentum.
        • EMA_200-100: Slope between EMA_200 and EMA_100.
        • EMA_100-SMA_100: To contrast exponential vs simple moving average trends.
  • Volatility Metrics:
    • std_0.05, std_0.1: Rolling standard deviations over small windows to assess micro-level volatility. Here, the numbers (0.05 and 0.1) are the alpha parameter in Pandas ewm (exponential weighted moving) function of DataFrame.
  • Price Relations (Candle Dynamics):
    • diff_trend, high-low, high-open, low-open, close-open: Capture intraday price range, spread, and behavior.
  • Statistical Features:
    • mean, log_volume: Basic statistics and transformed versions of core metrics to normalize scale.
  • Lag Features (Autoregressive Memory):
    • return_open, return, open-close_return, 2_lag_return through 10_lag_return:
      These include raw return, return relative to open or close, and up to 10-day lagged returns to capture autocorrelation and past signal memory.
  • Seasonality & Time Encoding:
    • seasonal_decomposition: Additive seasonal decomposition components of returns over fixed lags
    • second_of_day_sin, second_of_day_cos: Cyclical encoding of time of day to capture intra-day periodicity.

3. Feature Importance Analysis:
Mutual information (MI) between each feature and the target was calculated

Feature MI Score
close-open 2.015588
EMA_20 1.419642
SMA_20 1.293194
EMA_200-100 0.864370
high-low 0.662957
low 0.570482
SMA_1000 0.567136
high 0.564218
EMA_1000 0.549707
OBV 0.534179

This analysis suggests that a mix of momentum indicators , lagged returns, and seasonality components contribute most significantly to predictive performance. Interestingly, seasonal decomposition also appeared among the top features here.

4. Model Training:

Next, I trained three models on the engineered dataset: 1) Linear Regression, 2) XGBoost, and 3) Transformer. I used scikit-learn’s TimeSeriesSplit function to split the data into training and testing sets. The gap parameter was set to 288 to prevent data leakage during training. Interestingly, the Linear Regression model achieved the best performance, with a directional accuracy of 56% (Fig.3):

Number of samples: 39998
Directional Accuracy: 0.5570
p-value: 0.000000
95% Confidence Interval for accuracy: 0.5547-0.5594
Correlation t-Test:
Pearson corr. coeff.: 0.1381 (must be >0.05)
p-value: 0.000000 (must be <0.05)
Relative absolute return-weighted RMSE improvement: 3.69% (must be >10%)

The statistical tests highlight the significance of the model’s evaluation stats.

5. Feature Importance in Linear Regression:
After fitting the Linear Regression model, feature importance was calculated based on the absolute values of the standardized coefficients:

Feature Importance
EMA_1000 0.0316
SMA_1000 0.0074
EMA_500 0.0074
EMA_200 0.0071
EMA_100 0.0071
SMA_100 0.0071

This indicates that long-term moving averages , especially EMA_1000 ,are the most predictive of the 1-day forward log-return using the Linear Regression model. These features likely capture long-term price direction or trend momentum influencing the return over the next day.

On the other hand, raw price components and volume features such as open, close, low, and volume had relatively low importance (e.g., 0.0014 or less), suggesting that absolute values are less informative than derived trend-based features.

Indicators like OBV, EMA_100-10, and diff_trend showed negligible or near-zero importance, potentially due to high correlation with stronger signals or a lack of linear relationship with the target.

Seasonality features had essentially zero importance in this experiment.

Final Notes:
I also tried using PCAs as features but they didn’t improve the model performance at all.

I’d love to know everyone’s thoughts on this! Any idea on how to improve my dataset and create better features?

2 Likes

This is great input @t-hossein, thank you for sharing your approach! Really appreciate the systematic nature of what you’re doing.

I wondered if there is no loss of critical momentum information by compressing the 5-minute data into 24h candles. On the momentum side, on 5-minute timescales I could imagine calculating momentum (close-open), its derivative (Δ[close-open]/Δt, which is analogous to a force), and its square ([close-open]**2, which is analogous to energy and a proxy for the volatility) more meaningfully contribute. For these types of features (and EMAs thereof), the fine-grained 5-minute data would probably be important. If these are calculated using 24h pseudo-candles, then the information driving the price action over that timeframe is lost.

I also wondered if it’s useful to integrate external gold price drivers (both original and lagging), such as:

  • Spot XAUUSD London PM fix;
  • DXY (USD index);
  • 1-day change in real 10-yr Treasury yield;
  • GLD ETF net flows.

Along these lines, gold (and also other assets) have time-of-day seasonality that isn’t sinusoidal. Specifically, the US day/night time could matter quite a bit. So it might be worth subdividing through labels, e.g. Asia (00-08 UTC), Europe (08-13 UTC), US (13-20 UTC).

Have you experimented with some of these features yet? I would also imagine that integrating these types of higher-order features would make it worth revisiting other model architectures (e.g. XGBoost).

1 Like