CZAR loss function for returns prediction topics

The CZAR (Composite Zero-Agnostic Return) loss function is designed to address limitations observed in previous loss formulations used for inference synthesis in returns prediction topics. This post documents the motivation, functional form, and recommended parameter settings for CZAR loss. This function will replace ZPTAE in returns predictions topics.

Full loss function, including gradient and hessian for model training

def derivative(x):
    return 1.0 / (1.0 + x**2)

def antiderivative(x):
    return np.arctan(x)

def double_derivative(x):
    return 2.0 * np.abs(x) / (1.0 + x**2)**2

def eps_effective(eps, delta):
    # Rescale epsilon so that 1 - loss(z_true, 0) / loss(0, epsilon) crosses zero at epsilon
    if abs(delta) == 0:
        return np.arctan(eps)

    A = (1 + delta**2) * (antiderivative(eps + delta) - antiderivative(delta))
    beta = delta / (1 + delta**2)  # coefficient on eps_eff^2 in loss(0, eps_eff, 1)

    # Solve beta*x^2 + x - A = 0 for positive x
    return (-1 + np.sqrt(1 + 4 * beta * A)) / (2 * beta)

def softplus(x):
    return np.maximum(x, 0.0) + np.log1p(np.exp(-np.abs(x)))

def norm_smooth(z_true, eps, delta, tau):
    # Minimum value of the normalisation at z_true, set by the limit that loss(z_true,0)
    # does not decrease as z_true increases.
    # Simplified from: 1 - loss(z_true, 0) / loss(0, epsilon)
    a = np.abs(z_true)
    d2p1 = delta**2 + 1
    num = d2p1 * (antiderivative(a + delta) - antiderivative(delta))
    denom = eps + delta / d2p1 * eps**2
    norm_min = 1.0 - num / denom

    if tau <= 0:
        # Hard transition
        return np.maximum(norm_min, 0.0)

    # Smooth transition when norm drops below zero
    # Scale tau_eff by |norm_inf| so the asymptote is invariant across eps, delta
    # Asymptotic value of norm_min as |z_true| -> inf
    num_inf  = d2p1 * (0.5*np.pi - antiderivative(delta))
    norm_inf = 1.0 - num_inf / denom
    tau_eff = np.abs(tau) * np.abs(norm_inf)
    return softplus(norm_min / tau_eff) / softplus(1 / tau_eff)

def czar_loss(y_true, y_pred, std, mean=0, alpha=1, epsilon=1, tau=0.05):
    """
    Composite Zero-Agnostic Return Loss

    Asymmetric, piecewise function that is
        * Linear (alpha=0) or quadratic (alpha>0) when y_pred has opposite sign to y_true
        * Linear (alpha=0) or quadratic (alpha>0) when |y_pred| > |y_true|, with a decreasing gradient as |z_true| increases
        * Arctangent transition from 0 < |y_pred| < |y_true|

    Args:
        y_true: True returns
        y_pred: Predicted returns
        std: Standard deviation of true returns
        mean: Mean of true returns
        alpha: MSE term constant (alpha=0 is linear only, alpha=1 is maximum gradient)
        epsilon: Loss softening scale, in units of standard devation. Optimum is eps~1
        tau: Scaling for softening hinge function
    Returns:
        Value of loss
    """

    if alpha < 0 or alpha > 1:
        raise ValueError(f'alpha must be between 0 and 1, got {alpha}')

    z_true = (y_true - mean) / std
    z_pred = (y_pred - mean) / std

    s = np.where(z_true == 0, 1, np.sign(z_true))
    s_pred = np.where(z_pred == 0, 1, np.sign(z_pred))
    a = np.abs(z_true)
    u = s * z_pred

    # Apply horizontal shift to function for smooth change in gradient
    # Alpha should be between 0 and 1. 1/sqrt(3) shifts to the peak of the hessian function
    delta = alpha / np.sqrt(3)
    d2p1 = delta**2 + 1

    d_true = z_true + s * delta
    d_pred = z_pred + s_pred * delta

    h1 = d2p1 * double_derivative(delta)
    h3 = d2p1 * double_derivative(d_true)

    # Region 1: opposite sign (u <= 0): grad = -s + MSE term
    # Constant so that the middle branch hits zero at z_pred = z_obs
    C = s * d2p1 * (antiderivative(d_true) - antiderivative(s * delta))
    L1 = 0.5 * h1 * z_pred**2 - s * z_pred + C

    # Region 2: same sign, before threshold (0 < u <= a): grad = -s * antiderivative(z_pred)
    # antiderivative(d_true) term so that the middle branch hits zero at z_pred = z_obs
    L2 = s * d2p1 * (antiderivative(d_true) - antiderivative(d_pred))

    # Region 3: past threshold (u > a): grad = s * derivative(z_obs) + MSE term
    dz = z_pred - z_true
    L3 = 0.5 * np.minimum(h3, h1) * dz**2 + s * d2p1 * derivative(d_true) * dz

    # Softening term
    if epsilon > 0:
        eps_eff = eps_effective(epsilon, delta)
        softening_0 = czar_loss(0, eps_eff, 1., epsilon=0, alpha=alpha)
        norm = norm_smooth(z_true, eps_eff, delta, tau)
        Lsoft = norm * softening_0
    else:
        Lsoft = 0

    return np.where(u <= 0, L1, np.where(u <= a, L2, L3)) + Lsoft

def czar_gradient(y_true, y_pred, std, mean=0, alpha=1):
    z_true = (y_true - mean) / std
    z_pred = (y_pred - mean) / std

    s = np.where(z_true == 0, 1, np.sign(z_true))
    s_pred = np.where(z_pred == 0, 1, np.sign(z_pred))
    a = np.abs(z_true)
    u = s * z_pred

    # Apply horizontal shift to function for smooth change in gradient
    # Alpha should be between 0 and 1. 1/sqrt(3) shifts to the peak of the hessian function
    delta = alpha / np.sqrt(3)
    d2p1 = delta**2 + 1

    d_true = z_true + s * delta
    d_pred = z_pred + s_pred * delta

    h1 = d2p1 * double_derivative(delta)
    h3 = d2p1 * double_derivative(d_true)

    # Region 1: opposite sign (u <= 0): grad = -s + MSE term
    # Acutal gradient:
    #   G1 = h1 * z_pred - s
    # Psuedo gradient for numerical stability:
    G1 = h1 * z_pred - np.sign(z_true)

    # Region 2: same sign, before threshold (0 < u <= a): grad = -s * derivative(z_pred)
    G2 = -s * d2p1 * derivative(d_pred)

    # Region 3: past threshold (u > a): grad = s * derivative(z_true) + MSE term
    # Actual gradient:
    #   G3 = np.minimum(h3, h1) * (z_pred - z_true) + s * d2p1 * derivative(d_true)
    # Psuedo gradient for numerical stability:
    G3 = np.minimum(h3, h1) * (z_pred - z_true)

    return np.where(u <= 0, G1, np.where(u <= a, G2, G3)) / std

def czar_hessian(y_true, y_pred, std, mean=0, alpha=1):
    z_true = (y_true - mean) / std
    z_pred = (y_pred - mean) / std

    s = np.where(z_true == 0, 1.0, np.sign(z_true))
    s_pred = np.where(z_pred == 0, 1.0, np.sign(z_pred))
    a = np.abs(z_true)
    u = s * z_pred

    # Alpha should be between 0 and 1. 1/sqrt(3) shifts to the peak of the hessian function
    delta = alpha / np.sqrt(3)
    d2p1 = delta**2 + 1

    d_true = s * (np.abs(z_true) + delta)
    d_pred = s_pred * (np.abs(z_pred) + delta)

    # Region 1: opposite sign (u <= 0): grad = -s + MSE term
    h1 = d2p1 * double_derivative(delta)
    H1 = np.full_like(d_pred, h1)

    # Region 2: same sign, before threshold (0 < u <= a): grad = -s * derivative(z_pred)
    # Actual hessian:
    #   H2 = double_derivative(d_pred) * d2p1
    # Psuedo hessian for numerical stability
    H2 = (1.0 + d_pred**2) * double_derivative(d_pred)

    # Region 3: past threshold (u > a): grad = s * derivative(z_true) + MSE term
    # Actual hessian:
    #   h3 = double_derivative(d_true) * d2p1
    # Consistent with H2 psuedo hessian:
    h3 = (1.0 + d_true**2) * double_derivative(d_true)
    H3 = np.full_like(d_pred, np.minimum(h1, h3))

    return np.where(u <= 0, H1, np.where(u <= a, H2, H3)) / std**2
1 Like

As detailed in previous discussions about losses in returns prediction topics, we identified issues with the ZTAE (Z-score Tanh Absolute Error) loss function where losses saturate at extreme inference values. The ZPTAE loss (Z-score Power-Tanh Absolute Error) was introduced to address this by replacing the hyperbolic tangent with a modified version that transitions to a power law at large values, preventing saturation for outliers. However, further testing revealed additional considerations for inference synthesis that motivated the development of CZAR loss.

In returns prediction topics, the network often puts too much weight in constant predictions with values near the mean (similarly in machine learning model training). The most obvious illustration of this is the fact that predicting zero returns is technically better than drawing predictions from the same PDF as the ground truth (a direct consequence of increasing the dimensionality of the problem):

Another way to visualise this is by comparing the integrated expected log loss (since inference synthesis uses log loss) for a Gaussian “true” returns distribution as a function of constant predicted returns values for various loss functions. Common loss functions (e.g. MSE, MAE, etc) have a clear minimum in the expected log loss for predicted returns=0 (solids lines). Circle points indicate the integrated loss for a model that randomly draws predictions from the same PDF as the true returns distribution. In general, most loss functions clearly favour predicting zero over drawing from the same PDF.

ZTAE/ZPTAE are better loss functions in this regard because they are designed to better reward close predictions when the true value is far from the mean, and indeed, their expected log loss is flatter near zero. This is due to the way the asymmetry of the loss functions shift depending on the true return value. However both loss functions have issues:

  • ZTAE flattens for large values, meaning extreme outliers can receive relatively low losses if they are in the right direction.
  • ZPTAE was created to address the issue with ZTAE, but it (and ZTAE) has the problem that it is largely a concave function (hessian < 0) so cannot be used for model training.

The zero-returns issue can be somewhat alleviated by adding a constant ‘smoothing’ term to the loss function (dashed lines in the above figure). It increases the integrated log loss most at predicted returns = 0, but doesn’t sufficiently flatten the integrated expected losses to disfavour predicting the mean.

We can take this idea further by introducing smoothing that scales inversely with the absolute true returns value, i.e. the smoothing is maximal at the mean true value and decreases as the absolute value of the true return increases. Applied to the ZPTAE loss function shows how this results in a non-zero floor in the loss for true returns near zero:

Adding this to the integrated log loss test (zptae_scaled_smooth), we see it is so effective there is now a local peak at returns=0, and is otherwise very flat between +/- 1. So with adaptive smoothing we can level the playing field between a zero returns model and a ‘same PDF’ model (dots) in the network.

To summarise, we want a loss function for returns topics that:

  1. Is asymmetric and rewards predictions when the true value is far from the mean (ZTAE/ZPTAE-like)

  2. Trends to ~infinity for large differences in predicted and true values to adequately handle large outliers

  3. Has adaptive softening to down-weight constant returns ~= 0 models

  4. Is convex, so can be used for training models

We tested versions of asymmetric linear and quadratic functions (by modify the gradients for predictions on opposite sides of the true value) and a sigmoid gradient function (that shifts vertically depending on the true value), both with adaptive smoothing, but found they did not outperform the ZPTAE function. We attributed this to the steepness of the loss function about the true value (i.e. the loss functions were too wide), so to the above points we can add:

  1. Has a sharp change in the gradient function about the true value (ZTAE/ZPTAE-like)
1 Like

The CZAR (Composite Zero-Agnostic Return) loss function addresses all of these points.
The gradient function 1 / (1 + x^2) was chosen to approximately match the ZPTAE function, but the concave parts of the ZPTAE function are replaced with quadratic functions of varying gradients so that it remains convex (hessian > 0) and can be used for model training. For predictions in the correct direction, the gradients decrease with increasing true values, so that overestimates are punished less than underestimates. There is one parameter alpha which controls the value of the constant hessian, which has no effect on inference synthesis results so long as it is small (<~0.1, the default is 0.01 to help suppress the influence of very large outliers). alpha applies a horizontal shift in the arctangent function so that when get a smooth transition in the gradient and hessian functions across y=0. alpha=0 gives linear behaviour for large errors, and alpha=1 gives the maximum constant for the MSE term. This simultaneously addresses Points 1, 2, 4, 5.

The softening normalisation factor is constrained by the requirement that the loss for a predicted return = 0 does not decrease as the absolute true value increases. The solid black line shows the function that keeps the loss at predicted return=0 constant, but this becomes negative when true returns > epsilon (the softening scale at true returns = 0). Therefore we pass the ‘ideal’ normalisation through a hinge function (softplus) to prevent negative normalisation, with the sharpness of the hinge parameterised by tau. This addresses Point 3.

This leaves us with two free parameters, epsilon (softening scale) and tau (hinge scale), which we can optimise by comparing the mean log losses for the ‘same PDF’ model (random sampling from true returns PDF) and a model predicting constant returns=0. The difference between the models is minimal for epsilon ~ 0.75-1 and tau <~ 0.1. For simplicity and to avoid sharp transitions in the normalisation, we therefore chose default values of epsilon=1 and tau=0.05.

We can visualise the different behaviour of the ZPTAE and CZAR loss functions by comparing colourmaps of the loss (left panels) and log loss (right panels) as a function of the true and predicted returns (for standard deviation=1). Both are asymmetric functions where the region of low losses increases as |true_returns| increases and predictions in the wrong direction receive the highest losses. However, CZAR has higher losses than ZPTAE when true returns are close to the mean (= 0), so provides less benefit to predictions that are easier to make.

Alternatively, comparing loss versus the predicted return for different true values highlights the steeper behaviour of CZAR (solid lines) compared to ZPTAE (dotted lines) for predictions with large errors from the true values, depending on the chosen value of alpha.

1 Like

This is great @joel! So for model training we set alpha=1? I suppose there might be a difference between training and validation steps?

Yes indeed. Very small gradients and hessians can lead to some rather unwanted behaviour in gradient boosting models (due to Newton descent during boosting iterations), so alpha=1 is more stable during training. Small alpha values results in a very large change in the gradient/hessian function between positive and negative directions.

This is a bit different from inference synthesis or model validation/evaluation, where large gradients can overly punish predictions in the wrong direction (relative to the reward for getting the direction right). So in these cases we want alpha ~ 0. It’s also preferable to use the logarithm of the loss in these cases because it provides better contrast for accurate predictions (see the true vs predicted returns figures above), but it cannot be used for training as log(loss) is not a convex function.

I’ll also add here that we have slightly modified the gradient and hessian functions (mainly in “Region 2” between zero and the true value) to achieve a smooth function for gradient boosting. This doesn’t affect cases where we only use the loss function (inference synthesis, model validation/evaluation).

Without this boosting steps would increase when approaching the true value from zero: