I’ve noticed some strange behaviour of the losses in topics predicting log-returns. Some workers occasionally provide extremely large inferences (up to ~ 10^12) compared to typical returns values (<0.1), but still have reasonable values for their losses:
So at some point the losses flatten out, and inferences very far from the true value can have a lower loss than those close to the true value. That seems like unintended behaviour to me. Could it be an issue with the loss function?
Yeah we originally took the loss function for log-returns topics from the OpenGradient page here. It could indeed be related to the functional form of the (M)ZTAE loss function, because it flattens at large absolute deviations. It uses a tanh, which is symmetric and saturates at ±1.
Probably we don’t want it to saturate, but exhibit continued but shallower power law behaviour. The knee should then be the same sigma where it is now, and the power law slope should be a parameter we control. But all of this depends on whether this hypothesis is correct. What would the above figure look like for a standard MSE loss?
I see. I made a quick figure of the ZTAE loss function for different true values with mean=0 and standard deviation=1. So the function becomes more asymmetric as the true value increases from the mean. But because it flattens off “infinite” values in the correct direction (relative to the mean) can receive quite low losses, which is what we’re seeing above.
This is how losses look for a topic with an MSE loss function. The scatter around the expected loss function is I think related to uncertainty on how the ground truth is defined (see here), plus differences between data providers. So the “true” values I’ve used (from Tiingo, rounded to the nearest minute) might be slightly different from what the reputers used, which is why the scatter increases as the difference from ground truth decreases. Still, inferences with much larger differences than the ground truth uncertainty should largely be unaffected, and indeed they have the largest losses.
Oh wow, that’s a great visualisation: so if x_true = 1, then x = +infinity receives a better loss than x < 0.5 in this case. That’s the somewhat ridiculous consequence of the tanh saturation.
Sounds like a PL modification of the tanh could do wonders. For instance, if we use this:
def smooth_power_tanh_general(x, alpha=0.5, beta=2.0, x0=1.0):
y = x / (1 + (x / x0)**beta)**((1 - alpha) / 2)
return y
xtest = np.linspace(-5, 5, 1000)
plt.plot(xtest, np.tanh(xtest), ':k', lw=2, label='tanh')
plt.ylim((-5,5))
for alpha_test in np.array([0.25, 0.5, 1]):
ytest = smooth_power_tanh_general(xtest, alpha=alpha_test)
plt.plot(xtest, ytest, label='alpha = '+str(alpha_test))
plt.legend()
Here are some tests replacing tanh in the ZTAE loss function with a power-tanh function.
Regarding the best value for alpha, it seems there’s a trade-off between larger loss values for inferences far from the ground truth (higher alpha) and keeping the asymmetric behaviour of the tanh function (lower alpha).
Another way is to look at a version of the above figure that’s ‘folded’ about the true value, so for each function the lower line is in the same direction as the true value.
Based on this I think alpha=0.1 is too close to the initial tanh function, so will still have much of the same issues because it is still quite flat past the ‘knee’ of the function. alpha=0.5 might be too close to a power law function, there’s not much difference in the positive and negative directions so may not sufficiently reward predictions in the right direction.
That leaves alpha ~= 0.25 as a nice middle ground?
If we apply this power-tanh loss function to the initial returns topic data we get the following median losses, which I think is a good improvement to the current tanh function!
Perfect! We can keep alpha (and plausibly some other moving parts) as free parameters, set alpha = 0.25 as default, and keep a close eye on how this improves the network inference. Glad we got this sorted so quickly!