Monitoring topic and validator reward health

In this post, I will explore using the Gini coefficient to assess the health of topic and validator rewards. The goal is to ensure that rewards are distributed fairly and that no single validator or topic is dominating.

I will generate topic and validator reward distributions using both a simulation of the network as well as a power-law distribution. These values will then be normalized to create a probability mass function (pmf), and from there, we will compute the Gini coefficient. The Gini coefficient will give us a measure of inequality within the reward distribution, helping us determine whether rewards are being fairly distributed across topics and validators. The goal is to calibrate the use of the Gini coefficient to gain a clearer understanding of how well-balanced the rewards are.

For simulations, I will focus on topics but the same ideas can be applied for validators.

I’m computing the gini coefficient as follows:

def gini_coefficient(x):

  mad = np.abs(np.subtract.outer(x, x)).mean()
  
  rmad = mad / np.mean(x)
  
  g = 0.5 * rmad
  
  return g

I am applying gini to the reward fractions for topics and validators, i.e., looking at the rewards given to all topics and normalise by the total reward given to topics at that epoch. From wikipedia, a gini coefficient of 0 corresponds to perfect equality and 1 represents the situation where one individual gets all of the income. In general I started by looking at it on simulated data.

This doesn’t provide much insight. So instead I simulate the network for 1000 epochs and store the reward fractions over time. For each reward fraction, or pmf, I compute the gini coefficient, the number of topics or validators, the minimum entry of the reward fractions, and the maximum of the reward fractions.

In these simulations, if we vary the number of topics or validators and calculate the gini coeffecient:


We see gini is proportional to N. I don’t think that’s necessarily good for our metric. A gini of 1 is unhealthy, in general more topics/validators is not unhealthy.

For the following scatter plots, I am computing the standard deviation of the log of the reward fractions, and the log of the ratio of the maximum to minimum reward fraction as well as our metric:


We see the ~N behavior in these plots as well.

I played around with several methods to try to normalise out this dependency on the number of topics or validators - to no avail. After some offline discussions with the team I have decided to look into entropy:

def entropy(pmf):

  return -np.sum(pmf * np.log(pmf))



It has the same issues and is ~N. To mitigate this, we can normalise by ln(length). This normalised entropy is bounded on [0,1]. Here are the results:

This is better, much less relation to N and it slightly decreases as N increases. This comes with the drawback of being much more compressed (all values within 0.5) of each other.

All simulations above were done by generating reward fractions using the network simulator. But “we expect power law distributions of wealth and thus topic or validator weight.” So I will perform MC simulations with the PL.
I generated reward fractions from a PL with:

def generate_power_law(alpha, x_min, x_max, n_topics):

  # Generate n values from a power-law distribution using inverse transform sampling
  
  r = np.random.uniform(0, 1, n_topics)
  
  x = (x_min**(1-alpha) + r * (x_max**(1-alpha) - x_min**(1-alpha)))**(1/(1-alpha))
  
  return x/sum(x)  # Return with normalization

With the parameters

x_min = 1

log_xmax_xmin_range = np.arange(0.5, 3.5, 0.5)

alphas = [-1.01, -1.51, -2.01, -2.51, -3]

For each combination of parameters, I generate 1,000 samples. The results are below, here I’m colouring by alpha and log(x_max/x_min)




To me this is behaving nicely. The normalised entropy clearly decreases as the max/min ratio and stdev increase.

Now how do develop a decision rule? With binary classification. Say we want to detect if the log max/min ratio is bigger than 1 (the max reward fraction is 10 times larger than the minimum). Our decision rule is an entropy value, I.e., if the entropy is smaller than X then I conclude the log(max/min) ratio. For a given decision rule we can compute typical classification metrics, here I am optimising based on the F1 score which depends on the precision and recall. So here I sweep over a range of decision boundaries and compute their f1 scores to find the best decision boundary.

Result for detecting if the log max/min ratio is larger than 1:

n_topics = 11:



n_topics = 500:

Above, the resulting decision rules vary significantly depending on N.

We can also develop a decision rule based on stdev log(fractions), maybe that is less sensitive to N.

Results for detecting if the stdev is larger than 0.5:

n_topics = 11:


n_topics = 500:

This is much less sensitive then the decision rule based on max/min ratio.

Is it possible to determine a stdev of log fractions we want to detect? Maybe I need to explore normalisation more, maybe performance would be better if our x axis criteria was normalised based on n as well…? I can also perform the classification training process on data from different number of topics.

This looks great (in principle). My only concern is that the normalised entropy has a tiny dynamic range, i.e. it’s pretty much always close to 1. I suppose that’s controlled by alpha, but it requires alpha<1 before it ever gets low… so the full range [0,1] never gets used. We should probably rescale one way or another.

Yeah I agree, it is very compressed. The entropy itself has higher dynamic range but still has the ~N behaviour. I feel like we need something between the two. I’m not sure how to achieve that immediately.

Rescaling ‘naively’ seems to work fairly well. By naively, I mean I compute the normalised entropy as entropy/ln(n) then scale it to so the min is .9 and and the max is 1. These are decent boundaries based on a naive analysis, i.e., with 10 topic/validators the entropy is above 0.9, as n increases the normalised entropies get closer and closer to 1 so we really just need a lower bound.

Here is the data for n_topics = 10:



and n_topics = 1000:


the dynamic range gets smaller as n increase but it is not as bad and performing classification with this to detect if log(max/min)>1 seems to work much better now:
For 10 topics:


For 1,000 topics:


for 10,000 topics:


The decision cutoff is barely moving above! That’s what we want.

Now say we want to detect if std(log(fractions))>.5.
for 10 topics:



for 10,000 topics:


The decision cutoff here is moving drastically.

Overall, I am proposing to define unhealthy as the case where log(max/min)>1. To detect this we calculate entropy, normalise by ln(n), rescale to [.9,1] and then if this number is less than about 0.95, we deem the network as unhealthy. This should become more accurate as the number of participants increases which is desirable.

Potential issue: there is a non zero probability that the normalised, rescale entropy is less than 0.9. In this case we can deem it as unhealthy automatically. This is really only a problem when there are only few participants, we may need to be careful with the topics reward distribution since we currently only have 11 topics.

Potential improvement: Make the lower bound (0.9) a function of n. I don’t know what this function should be, we could probably learn it fairly well on this data. However, I don’t know if that is necessary.

If we define unhealthy as log(max/min)>X, what should X be?

Great stuff! I also have some questions…

Not sure I agree with unhealthy as a factor of >10 difference between max and min. Max and min are the most stochastic you can get, and we don’t want noise to make us oscillate… Better to say something based on the stdev, but even then it’s not obvious why a factor of 10 would be critical. You clearly demonstrate something happens there, but to what extent are you not making an assumption about the decision boundary that implies this as a critical point? If I look at the scatter plots, there is no feature at the cutoff. So just looking at the data: where does it come from? Why is a system with a factor of 100 between min/max unhealthy?

Maybe naive, but these are the questions I have going through the results :slight_smile:

Thanks!

“Not sure I agree with unhealthy as a factor of >10 difference between max and min”:
I guess my thought was that if the max value was much larger than min value, this would indicate unhealthiness. But now I realize this is really just detecting how small the smallest entry is. For example, if log10(max/min)>X we have max>min*10^X, but since we are looking at fractions max<1 which implies min<1/10^X. But with what I did, X=1, we are just looking for min<0.1. Which doesn’t have anything to do with health really. So I agree. This would make more sense with a factor of 100 between min/max ie X=2 but we are not seeing a log10(max/min)>2 in the simulations.

“You clearly demonstrate something happens there, but to what extent are you not making an assumption about the decision boundary that implies this as a critical point?”
“So just looking at the data: where does it come from?”

In my script, I am basically fixing an X at the beginning and then developing a decision rule to detect if either std(log10(fracs)) or log10(max/min) is >X. I am deciding X, I picked the values .5 for stdev and 1 for the ratio just arbitrarily based on the plots and their typical ranges. So these values for X aren’t really coming from the data. I’m not sure what the best way to approach that would be.

I think this goes back to what we discussed here, it may be useful to monitor these values over time once deployed and decide thresholds based on the experience we get in practice.

Yes I agree, let’s monitor over time instead of trying to draw lines in the sand :+1: