The following analysis pertains to my model development for the network’s 24-hour PAXG/USD Log-Return Prediction topic (Topic 60):
- Base Data: To construct my dataset, I first retrieved historical price data (OHLCV — Open, High, Low, Close, Volume) from Tiingo, covering the past 250 days. This spans from October 25, 2024, to July 2, 2025. Since the topic updates every 5 minutes, I collected data at a 5-minute interval to ensure the model remains responsive to short-term fluctuations.
However, this raw data required modification, as our prediction horizon is 1 day (1440 minutes), not 5 minutes. (Fig.1)
Thus, I modified the dataset so that each datapoint represents the past 1-day window. For example, for the datapoint at 00:05 AM on July 2 (Fig.2):
- Open: The price at 00:05 AM on July 1 (exactly one day prior).
- High: The highest price between 00:05 AM on July 1 and 00:05 AM on July 2.
- Low: The lowest price between 00:05 AM on July 1 and 00:05 AM on July 2.
- Close: The price at 00:05 AM on July 2.
- Volume: The sum of trading volumes from all 5-minute intervals between 00:05 AM on July 1 and 00:05 AM on July 2.
The target variable was defined as: Log-Return(t) = ln(close(t) / close(t + 288)),
where t represents the current datapoint. The close price at t + 288 is used because one day consists of 288 intervals at 5-minute resolution (1 day = 288 × 5 minutes).
2. Feature Engineering:
To enhance the predictive power of the model, I engineered a variety of features capturing traditional technical indicators and statistical properties of the time series. The numbers next to the feature names represent the window length in terms of datapoints. Below is a breakdown of the feature categories:
- OHLCV Basics:
- open, high, low, close, volume, volumeNotional, tradesDone
These capture standard daily trading metrics and activity.
- Technical Indicators:
- Volatility and Momentum:
- Bollinger_High, Bollinger_Low: Bollinger Bands to measure price volatility relative to a moving average.
- RSI_10, RSI_100: Relative Strength Index at short and long windows to measure overbought/oversold conditions.
- MACD, KST: Momentum indicators that highlight trend shifts.
- OBV: On-Balance Volume, combining price movement and volume to detect accumulation or distribution.
- Moving Averages:
- SMA_20, SMA_100, SMA_200, SMA_500, SMA_1000: Simple moving averages over different timeframes to capture medium- and long-term trends.
- EMA_20, EMA_100, EMA_200, EMA_500, EMA_1000: Exponential moving averages which give more weight to recent prices.
- Difference-based indicators:
- EMA_100-10: Difference between EMA_100 and EMA_10 for short-vs-long-term momentum.
- EMA_200-100: Slope between EMA_200 and EMA_100.
- EMA_100-SMA_100: To contrast exponential vs simple moving average trends.
- Volatility Metrics:
- std_0.05, std_0.1: Rolling standard deviations over small windows to assess micro-level volatility. Here, the numbers (0.05 and 0.1) are the alpha parameter in Pandas ewm (exponential weighted moving) function of DataFrame.
- Price Relations (Candle Dynamics):
- diff_trend, high-low, high-open, low-open, close-open: Capture intraday price range, spread, and behavior.
- Statistical Features:
- mean, log_volume: Basic statistics and transformed versions of core metrics to normalize scale.
- Lag Features (Autoregressive Memory):
- return_open, return, open-close_return, 2_lag_return through 10_lag_return:
These include raw return, return relative to open or close, and up to 10-day lagged returns to capture autocorrelation and past signal memory.
- Seasonality & Time Encoding:
- seasonal_decomposition: Additive seasonal decomposition components of returns over fixed lags
- second_of_day_sin, second_of_day_cos: Cyclical encoding of time of day to capture intra-day periodicity.
3. Feature Importance Analysis:
Mutual information (MI) between each feature and the target was calculated
Feature |
MI Score |
close-open |
2.015588 |
EMA_20 |
1.419642 |
SMA_20 |
1.293194 |
EMA_200-100 |
0.864370 |
high-low |
0.662957 |
low |
0.570482 |
SMA_1000 |
0.567136 |
high |
0.564218 |
EMA_1000 |
0.549707 |
OBV |
0.534179 |
This analysis suggests that a mix of momentum indicators , lagged returns, and seasonality components contribute most significantly to predictive performance. Interestingly, seasonal decomposition also appeared among the top features here.
4. Model Training:
Next, I trained three models on the engineered dataset: 1) Linear Regression, 2) XGBoost, and 3) Transformer. I used scikit-learn’s TimeSeriesSplit
function to split the data into training and testing sets. The gap
parameter was set to 288 to prevent data leakage during training. Interestingly, the Linear Regression model achieved the best performance, with a directional accuracy of 56% (Fig.3):
Number of samples: 39998
Directional Accuracy: 0.5570
p-value: 0.000000
95% Confidence Interval for accuracy: 0.5547-0.5594
Correlation t-Test:
Pearson corr. coeff.: 0.1381 (must be >0.05)
p-value: 0.000000 (must be <0.05)
Relative absolute return-weighted RMSE improvement: 3.69% (must be >10%)
The statistical tests highlight the significance of the model’s evaluation stats.
5. Feature Importance in Linear Regression:
After fitting the Linear Regression model, feature importance was calculated based on the absolute values of the standardized coefficients:
Feature |
Importance |
EMA_1000 |
0.0316 |
SMA_1000 |
0.0074 |
EMA_500 |
0.0074 |
EMA_200 |
0.0071 |
EMA_100 |
0.0071 |
SMA_100 |
0.0071 |
This indicates that long-term moving averages , especially EMA_1000 ,are the most predictive of the 1-day forward log-return using the Linear Regression model. These features likely capture long-term price direction or trend momentum influencing the return over the next day.
On the other hand, raw price components and volume features such as open, close, low, and volume had relatively low importance (e.g., 0.0014 or less), suggesting that absolute values are less informative than derived trend-based features.
Indicators like OBV, EMA_100-10, and diff_trend showed negligible or near-zero importance, potentially due to high correlation with stronger signals or a lack of linear relationship with the target.
Seasonality features had essentially zero importance in this experiment.
Final Notes:
I also tried using PCAs as features but they didn’t improve the model performance at all.
I’d love to know everyone’s thoughts on this! Any idea on how to improve my dataset and create better features?