Price/returns topic feature engineering

Looking at participant performance in Allora’s Forge programme, I have the impression that feature engineering is the bottleneck in several of the current price/return prediction topics.

Markets are like physical systems, with action and reaction. If you think about them that way, you can identify which variables might carry signal. As a data scientist you can then test that hypothesis. But just being a data scientist isn’t enough – you need that physics perspective to first understand where the signal may be.

Good ideas don’t come by staring at a screen on your own. They come from collaboration and discussion. I think it could help if participants got together more to work on the feature engineering, and discuss what may or may not work.

There is a clear incentive to collaborate: on mainnet, the rewards paid out in a topic will be set by the topic weight, which is calculated using the stake in a topic and the revenue that it generates. Obviously, performant topics generate more revenue. So while you will be competing for rewards, you will also collectively be competing against other topics.

This thread is aimed at carrying out research on feature engineering for financial price forecasting (i.e. log-returns topics). I would suggest we first make an inventory of commonly-used features and carry out importance analysis for the main ones. Additionally, we should reason where we expect to see the strongest signal based on our experience with the markets, engineer new features, and test their predictive power.

From the Allora research team, there may be involvement from myself, @florian, @joel, and @steve. Others might join later too.

2 Likes

The following analysis pertains to my model development for the network’s 24-hour PAXG/USD Log-Return Prediction topic (Topic 60):

  1. Base Data: To construct my dataset, I first retrieved historical price data (OHLCV — Open, High, Low, Close, Volume) from Tiingo, covering the past 250 days. This spans from October 25, 2024, to July 2, 2025. Since the topic updates every 5 minutes, I collected data at a 5-minute interval to ensure the model remains responsive to short-term fluctuations.
    However, this raw data required modification, as our prediction horizon is 1 day (1440 minutes), not 5 minutes. (Fig.1)

Thus, I modified the dataset so that each datapoint represents the past 1-day window. For example, for the datapoint at 00:05 AM on July 2 (Fig.2):

  • Open: The price at 00:05 AM on July 1 (exactly one day prior).
  • High: The highest price between 00:05 AM on July 1 and 00:05 AM on July 2.
  • Low: The lowest price between 00:05 AM on July 1 and 00:05 AM on July 2.
  • Close: The price at 00:05 AM on July 2.
  • Volume: The sum of trading volumes from all 5-minute intervals between 00:05 AM on July 1 and 00:05 AM on July 2.

The target variable was defined as: Log-Return(t) = ln(close(t) / close(t + 288)),
where t represents the current datapoint. The close price at t + 288 is used because one day consists of 288 intervals at 5-minute resolution (1 day = 288 × 5 minutes).

2. Feature Engineering:
To enhance the predictive power of the model, I engineered a variety of features capturing traditional technical indicators and statistical properties of the time series. The numbers next to the feature names represent the window length in terms of datapoints. Below is a breakdown of the feature categories:

  • OHLCV Basics:
    • open, high, low, close, volume, volumeNotional, tradesDone
      These capture standard daily trading metrics and activity.
  • Technical Indicators:
    • Volatility and Momentum:
      • Bollinger_High, Bollinger_Low: Bollinger Bands to measure price volatility relative to a moving average.
      • RSI_10, RSI_100: Relative Strength Index at short and long windows to measure overbought/oversold conditions.
      • MACD, KST: Momentum indicators that highlight trend shifts.
      • OBV: On-Balance Volume, combining price movement and volume to detect accumulation or distribution.
    • Moving Averages:
      • SMA_20, SMA_100, SMA_200, SMA_500, SMA_1000: Simple moving averages over different timeframes to capture medium- and long-term trends.
      • EMA_20, EMA_100, EMA_200, EMA_500, EMA_1000: Exponential moving averages which give more weight to recent prices.
      • Difference-based indicators:
        • EMA_100-10: Difference between EMA_100 and EMA_10 for short-vs-long-term momentum.
        • EMA_200-100: Slope between EMA_200 and EMA_100.
        • EMA_100-SMA_100: To contrast exponential vs simple moving average trends.
  • Volatility Metrics:
    • std_0.05, std_0.1: Rolling standard deviations over small windows to assess micro-level volatility. Here, the numbers (0.05 and 0.1) are the alpha parameter in Pandas ewm (exponential weighted moving) function of DataFrame.
  • Price Relations (Candle Dynamics):
    • diff_trend, high-low, high-open, low-open, close-open: Capture intraday price range, spread, and behavior.
  • Statistical Features:
    • mean, log_volume: Basic statistics and transformed versions of core metrics to normalize scale.
  • Lag Features (Autoregressive Memory):
    • return_open, return, open-close_return, 2_lag_return through 10_lag_return:
      These include raw return, return relative to open or close, and up to 10-day lagged returns to capture autocorrelation and past signal memory.
  • Seasonality & Time Encoding:
    • seasonal_decomposition: Additive seasonal decomposition components of returns over fixed lags
    • second_of_day_sin, second_of_day_cos: Cyclical encoding of time of day to capture intra-day periodicity.

3. Feature Importance Analysis:
Mutual information (MI) between each feature and the target was calculated

Feature MI Score
close-open 2.015588
EMA_20 1.419642
SMA_20 1.293194
EMA_200-100 0.864370
high-low 0.662957
low 0.570482
SMA_1000 0.567136
high 0.564218
EMA_1000 0.549707
OBV 0.534179

This analysis suggests that a mix of momentum indicators , lagged returns, and seasonality components contribute most significantly to predictive performance. Interestingly, seasonal decomposition also appeared among the top features here.

4. Model Training:

Next, I trained three models on the engineered dataset: 1) Linear Regression, 2) XGBoost, and 3) Transformer. I used scikit-learn’s TimeSeriesSplit function to split the data into training and testing sets. The gap parameter was set to 288 to prevent data leakage during training. Interestingly, the Linear Regression model achieved the best performance, with a directional accuracy of 56% (Fig.3):

Number of samples: 39998
Directional Accuracy: 0.5570
p-value: 0.000000
95% Confidence Interval for accuracy: 0.5547-0.5594
Correlation t-Test:
Pearson corr. coeff.: 0.1381 (must be >0.05)
p-value: 0.000000 (must be <0.05)
Relative absolute return-weighted RMSE improvement: 3.69% (must be >10%)

The statistical tests highlight the significance of the model’s evaluation stats.

5. Feature Importance in Linear Regression:
After fitting the Linear Regression model, feature importance was calculated based on the absolute values of the standardized coefficients:

Feature Importance
EMA_1000 0.0316
SMA_1000 0.0074
EMA_500 0.0074
EMA_200 0.0071
EMA_100 0.0071
SMA_100 0.0071

This indicates that long-term moving averages , especially EMA_1000 ,are the most predictive of the 1-day forward log-return using the Linear Regression model. These features likely capture long-term price direction or trend momentum influencing the return over the next day.

On the other hand, raw price components and volume features such as open, close, low, and volume had relatively low importance (e.g., 0.0014 or less), suggesting that absolute values are less informative than derived trend-based features.

Indicators like OBV, EMA_100-10, and diff_trend showed negligible or near-zero importance, potentially due to high correlation with stronger signals or a lack of linear relationship with the target.

Seasonality features had essentially zero importance in this experiment.

Final Notes:
I also tried using PCAs as features but they didn’t improve the model performance at all.

I’d love to know everyone’s thoughts on this! Any idea on how to improve my dataset and create better features?

3 Likes

This is great input @t-hossein, thank you for sharing your approach! Really appreciate the systematic nature of what you’re doing.

I wondered if there is no loss of critical momentum information by compressing the 5-minute data into 24h candles. On the momentum side, on 5-minute timescales I could imagine calculating momentum (close-open), its derivative (Δ[close-open]/Δt, which is analogous to a force), and its square ([close-open]**2, which is analogous to energy and a proxy for the volatility) more meaningfully contribute. For these types of features (and EMAs thereof), the fine-grained 5-minute data would probably be important. If these are calculated using 24h pseudo-candles, then the information driving the price action over that timeframe is lost.

I also wondered if it’s useful to integrate external gold price drivers (both original and lagging), such as:

  • Spot XAUUSD London PM fix;
  • DXY (USD index);
  • 1-day change in real 10-yr Treasury yield;
  • GLD ETF net flows.

Along these lines, gold (and also other assets) have time-of-day seasonality that isn’t sinusoidal. Specifically, the US day/night time could matter quite a bit. So it might be worth subdividing through labels, e.g. Asia (00-08 UTC), Europe (08-13 UTC), US (13-20 UTC).

Have you experimented with some of these features yet? I would also imagine that integrating these types of higher-order features would make it worth revisiting other model architectures (e.g. XGBoost).

1 Like

@t-hossein this is great! I also love your detailed explanation.

So do I understand it right that all your features are derived from your modified (added over 288 intervals) OHLCV data? Don’t you lose a lot of your data by smoothing it out like this? Like, if the gold price makes sudden jumps within a 5 minute interval, that probably means something? I would have even tried to include the 1 minute intervals from Tiingo, just to have more data to work with. But I don’t know much about this, maybe there’s a reason you don’t do that?

Also I’m noticing you define log returns as Log-Return(t) = ln(close(t) / close(t + 1day)) but I believe the Allora topic (the reputers, to be precise) defines it as Log-Return(t) = ln(close(t + 1day) / close(t)), which would give exactly the negative result. I could be wrong, but maybe you want to double check that.

2 Likes

Thanks for sharing @t-hossein!

In addition to the features you already use, I’ve found gradients (of a linear fit over some window), acceleration/force and difference from moving average to be other very useful features that are often among the most important.
For time encoding, perhaps time in a week (e.g. in hours) could be useful to capture weekly cycles (like due to weekends).

Do you do any feature reduction? That’s quite a lot of features for 250 independent data points so you could reduce pairs of very highly correlated features, remove features that are consistently of least importance, etc.

For the ML model, LightGBM and CatBoost may also be worth testing. I tend to find LightGBM a bit better than XGBoost most of the time (though don’t have much experience with CatBoost).

You could also consider modifying the evaluation metric to give larger true returns more weight in the minimisation. “Z-transformed Power-Tanh Absolute Error” (ZPTAE) is used in returns topics which has this behaviour. Let me know if you want any more info on that.

2 Likes

Hey Apollo, thanks a lot for the feedback!

Regarding the loss of momentum — in hindsight, I agree that using 24-hour candles may result in the loss of important short-term dynamics like momentum. That said, I think relying entirely on 5-minute data points could introduce excessive noise. A good compromise might be to derive features over intermediate timeframes — for example, 4-hour windows. This could help retain meaningful trends while minimizing noise.

I really like the idea of incorporating physics-inspired concepts like momentum and force into feature engineering. I’m especially interested in how derivative-based features (e.g., Δ[close - open]/Δt) might influence model performance, since they capture the rate of change over time. I believe this approach could be central to designing a set of informative features.

Your point about time-of-day seasonality is also very compelling. Handling seasonality differently across key time zones seems like a smart direction and could improve predictive accuracy.

I’ll also experiment with incorporating external price drivers, as you suggested. One challenge, though, is the inconsistency in data granularity and completeness across sources. For instance, Yahoo Finance offers a broad range of indices and assets, but often lacks data for off-market hours or holidays. Additionally, Yahoo’s data is typically at 1-hour granularity, which doesn’t align well with our Tiingo OHLCV data that’s available at 5-minute intervals and has far fewer missing values (which can be interpolated or forward/backfilled if needed). Thus, this is also something to consider regarding the integration of data outside Tiingo.

Thanks again — these are great suggestions and give me plenty to explore further!

2 Likes

Hey @florian!

Not exactly — many of my features, like EMA_100 and SMA_100, are moving averages computed over the past 100 5-minute intervals. In the final dataset, the open value corresponds to the price exactly 24 hours ago, while the close value reflects the current price. So any significant change in close is still captured — both directly and indirectly through indicators like EMAs and SMAs.

The only notable information loss comes from low and high values that aren’t the absolute min/max over any 24-hour window — those finer movements do get missed.

As for the 1-minute granularity: I chose 5-minute intervals because the epoch length in this specific competition (topic 60) is 5 minutes. I structured the data accordingly so the model could learn with the temporal resolution it will be evaluated on.

You’re absolutely right about the log-return definition — that was a typo in my original post. The correct formulation used during training was:
Log-Return(t) = ln(close(t + 1 day) / close(t))
just as you mentioned. I don’t think I can update the original post at this point, but I’ll see if there’s a way to fix it.

Also, the dataset includes a return feature representing the return from yesterday to now:
Log-Return(t) = ln(close(t) / close(t − 1 day))

I’ve uploaded a sample dataset here, which includes 3 days of data.

I’ve also created a repo called Allora_Data_Fetcher, where I’ve uploaded the code used to generate this dataset. It’s still a work in progress, so apologies in advance for any issues :slightly_smiling_face:.

2 Likes

Hey @joel — thank you so much for the fantastic tips!

I’ll definitely try implementing the additional features you and @Apollo11 mentioned — they sound very promising.

You’re right — I haven’t applied any feature reduction in this model so far, but given the relatively small number of independent data points (around 250), I agree it’s something I should incorporate.

I also plan to expand my evaluation to include more models like LightGBM, CatBoost, and others. I’ve mostly worked with XGBoost, but I’ll make sure to include a broader comparison in my next update.

Thanks as well for pointing out ZPTAE — I’ve come across it in the forums and did a preliminary read, but I’ll go over it more carefully and try using it for evaluation in my next iteration. If I run into any confusion, I’ll definitely reach out!

Thanks again — this is super helpful.

2 Likes

These are the features I use for topic 60 - 24 hour PAXG/USD Log-Return Prediction

  • Price Features

    • Close (1, 5, 10, 20 hours)
    • Log returns
    • High/low ratios, close/open ratios
  • Technical Indicators

    • Moving Averages (SMA, EMA)
    • RSI with multiple periods
    • Bollinger Bands with position and width
    • MACD with signal and histogram
    • Stochastic Oscillator
    • ATR (Average True Range)
    • ADX (Average Directional Index)
    • CCI (Commodity Channel Index)
  • Time-Based Features

    • Hour, day of week, month extraction
    • Cyclical encoding for time components
    • Weekend flags
    • Market open/close proximity
  • Statistical Features

    • Rolling moments (mean, std, skew, kurtosis)
    • Percentiles and ranges
    • Price statistics across multiple windows
3 Likes

Thank you for joining the discussion @phamhung3589, and thanks much for sharing your further thoughts @t-hossein!

In addition to what has been said, I was also thinking that many of the feature classes (price data, technical indicators, statistics) shouldn’t only be calculated for price, but also for the returns themselves. Given that the target variable is typically log-returns in price prediction topics, it is probably important to calculate RSI, MACD, OBV, Bollinger, stochastic oscillator, ATR, ADX, CCI, and all kinds of MAs for log-returns in addition to those for price. This isn’t a very hard engineering step and might yield stronger signal.

Based on the many ideas that are floating around in this thread now, would it make sense to prioritise and/or split off tasks for quantitative testing?

If I summarise the above, I see the following initiatives:

  • Time granularity (5m vs 24h)
  • Including force and energy features (i.e. multi-timeframe Δ[close-open]/Δt, [close-open]**2, difference from MA, multi-timeframe linear gradients)
  • Including external price drivers (e.g. spot XAUUSD, DXY, 10y yield, GLD ETF flows – obviously these are specific to gold and not always generalisable to other topics, except maybe BTC?)
  • Including labelled time-of-day and real-valued time-of-week
  • Performing feature reduction
  • Modifying the training evaluation metric to match the [ZPTAE loss function]
  • Add returns-focused feature set (all quantities you can calculate for price, but for log-returns)
    (Losses in returns prediction topics - #8 by joel)

For ease of prioritisation, let’s do a poll on what we think are the high ROI things to test first (max 3 votes/person):

  • Time granularity
  • Force & energy features
  • External price drivers
  • Improved time-of-day & time-of-week
  • Feature reduction
  • ZPTAE evaluation metric
  • Returns-focused feature set
0 voters

Let’s make it run for 24h after this post so that we don’t slow down too much here. Great stuff everyone!

1 Like

I built a pipeline to predict the 24-hour log return of PAXGUSD using resampled hourly data. In addition to technical features from the PAXGUSD time series, I also experimented with incorporating external features from BTCUSD price movement, since gold and Bitcoin sometimes show inverse or lagged correlations.

1. Data & Target

Data source: OHLC 1-hour API

df[‘target_close’] = df[‘close’].shift(-24)
df[‘log_return’] = np.log(df[‘target_close’] / df[‘close’])

2. Feature Engineering

The following features were created from raw data:

Momentum & Oscillators

  • RSI (Relative Strength Index): rsi_14, rsi_24, rsi_48
  • ROC (Rate of Change): roc_12, roc_24, roc_48
  • MACD: macd_line, signal_line, macd_histogram
  • Williams %R: measures the closing price relative to the high-low range over the past 24 hours

Lag & Change
Created lag-based features:

for lag in [1, 2, 3, 4, 5, 12, 24]:
    df[f'close_lag_{lag}'] = df['close'].shift(lag)
    df[f'close_delta_{lag}'] = df['close'] - df[f'close_lag_{lag}']
    df[f'close_ratio_{lag}'] = df['close'] / df[f'close_lag_{lag}']

Trend & Volatility

  • EMA: ema_12, ema_26, ema_50, ema_100
  • Volatility: rolling standard deviation (normalized)
  • ATR (Average True Range)
  • Bollinger Band Width: captures expansion/contraction of price

Time Features

  • hour, day_of_week, is_weekend
    => Designed to capture seasonal/time-based patterns

Cross-Market Features from BTCUSD
I fetched hourly BTCUSD data from a separate API and resampled it to align with the PAXGUSD timestamps. Then I created parallel technical indicators:

btc_df['log_return'] = np.log(btc_df['close'] / btc_df['close'].shift(1))
btc_df['rsi_14'] = compute_rsi(btc_df['close'], window=14)
btc_df['ema_12'] = btc_df['close'].ewm(span=12).mean()
btc_df['roc_12'] = btc_df['close'].pct_change(periods=12) * 100

These features were merged into the main dataframe:

df = df.merge(btc_df[['timestamp', 'log_return', 'rsi_14', 'ema_12', 'roc_12']], on='timestamp', suffixes=('', '_btc'))

The idea is to let the model learn from recent Bitcoin volatility or momentum and whether that leads or lags gold token behavior (PAXGUSD).

3. Modeling

  • Model: XGBRegressor with TimeSeriesSplit
  • Scaler: StandardScaler applied to all features
  • Loss: RMSE

Incorporating BTCUSD signals added modest performance gains in backtesting. It’s worth exploring other macro or crypto-related cross-asset signals (like ETH, DXY, or VIX) as feature inputs.

3 Likes

Thanks for joining the discussion @its_theday! I like the systematic approach to feature construction and model setup you’ve applied when building out the pipeline. The inclusion of BTCUSD-derived signals is particularly interesting, and while the marginal gains were modest, that’s perhaps not unexpected given the mixed correlation dynamics between Bitcoin and gold-backed assets like PAXG. Have you thought about exploring how these correlations behave under different market regimes (e.g. high volatility, macro-driven risk-off periods)? I could imagine that incorporating regime classifiers or volatility clustering might help surface more conditional relationships where BTC features become more predictive.

As you noted, expanding to additional macro and crypto signals (ETHUSD, DXY, VIX) is a natural next step. I’d be very keen to see how this affects model performance! From an Allora-inspired perspective, this opens the door to a more context-aware forecasting framework — not just ensembling models, but assigning weights to signals or model outputs based on their expected contribution in a given context. That mirrors how Allora combines “workers” by forecasting their error, and it could be approximated here by tracking model performance across different volatility/time segments and learning dynamic weights accordingly. Looking forward to seeing where you take this — the structure you’ve built seems well-suited to supporting a richer meta-learning layer.

You mentioned using time-based features—hour, day_of_week, is_weekend—which is great. In earlier threads there was interest in exploring whether returns differ by time of day, month, or season. Have you noticed any meaningful signal in those temporal features? For instance, do certain hours consistently offer stronger predictability, or do weekend returns behave differently? Identifying such patterns could guide a regime-focused approach—e.g., applying context-aware weights when you detect statistically significant time-based effects.

3 Likes

Thanks a lot for the thoughtful response @steve
I haven’t formally explored market regimes yet, but I plan to segment by volatility (e.g high vol vs low vol periods) & test whether BTC features are more predictive under certain conditions, HMM or simple thresholds might be a good starting point. I’m also starting to log residuals by time-of-day & volatility, which could help with context-aware weighting - a concept that really aligns with Allora’s meta-learning approach. Your point about assigning dynamic weights to signals or models based on performance is spot on.
On the feature side, I’ve begun adding macro signals from US and AU (like DXY, AUD/USD), given their potential influence on gold, early stage, but excited to see what patterns emerge.
Thanks again - will keep sharing as I dig deeper

2 Likes

Thanks all for voting in the poll! Looks like we have a clear list of three priorities, so let’s get working on these:

  • Including force and energy features (i.e. multi-timeframe Δ[close-open]/Δt, [close-open]**2, difference from MA, multi-timeframe linear gradients)
  • Add returns-focused feature set (all quantities you can calculate for price, but for log-returns)
  • Modifying the training evaluation metric to match the ZPTAE loss function

The way we should go about these is to perform an A/B test, i.e.:

  • use your own default model;
  • record its performance across a set of (sufficiently long) time intervals (more than one to achieve statistical significance);
  • develop one of the above modifications;
  • add this to your own model;
  • record the performance of the modified model across the same set of (sufficiently long) time intervals;
  • quantify any differences and compare statistical significance.

We can collectively define some of the unknowns in the above plan (e.g. which time intervals, how long, which metrics) and I suggest you just propose what you’d like to use.

It’s great that we have three model builders involved in the discussion already (@t-hossein @phamhung3589 @its_theday). Given that each of your models is quite different (and uses a different feature set), can I maybe suggest that we work through the above ideas simultaneously? So then we pick one, all do the A/B test for that, and compare results. That way, we also test the robustness of these ideas under differing modelling approaches and I think that could be very useful. Given that we’re looking at historical data for these tests, we can continue to use the PAXG target, but if any of you would like to switch to the target of one of the new Forge topics (e.g. BTC), please let us know. Of course, more model builders are welcome to join at any time!

I then would like to suggest we start with Add returns-focused feature set (all quantities you can calculate for price, but for log-returns). My reasoning is that this is a relatively small amount of work to try (applying the transformations you are already using to another variable). Just be sure they’re sensible in this context – a log-return is a two-point quantity (expressing a difference between two moments), whereas a price is a one-point quantity (exists at any given moment in time). For instance, it makes sense to apply moving averages, RSI, MACD, Bollinger (and many other TA indicators) to log-returns, but maybe some other indicators relying on e.g. volume information or open-close data do not.

If you think this is a good plan and you’ll participate, just like this post and let’s get going!

2 Likes

We will continue to use this thread for organising and coordinating all feature engineering experiments. We’ll spin off a new thread for each feature engineering experiment to make sure the discussions don’t become intractable.

I have created a new thread for ongoing work and results on the returns-focused feature variables.

2 Likes