As a first step in extending the network simulator, we need it to generate mock categorical data. I’ve been going back and forth for a while to decide on how to best do this, but in the end I reminded myself that we do not want to model what is happening in a classification task, but we want to get realistic behaviour.
For this reason, I decided not to generate class labels, but label probabilities as the ground truth. The mock inferences can then be perturbations of these probabilities, just as with the regression problem.
The probabilities are drawn from a Dirichlet distribution, shown below for the example of three classes.
The nice thing about a Dirichlet distribution is that the probabilities sum to unity, and the alpha vector quantifies the variation of the probabilities (through its normalisation; compare the top two panels) or how biased (through the differences).
This means that the normalisation of the alpha vector tells us how difficult the classification task is (less variation implies more similar probabilities, which indicates a higher difficulty), whereas the dissimilarity of its elements encodes how balanced the ground truth is (where a perfectly balanced ground truth implies that all labels have the same probability of occurring).
In our tests, we will want to consider data sets of varying difficulty and balance. I have therefore introduced a mapping between 0<=variation<=1
and 0<=balance<=1
parameters to the alpha vector. (I consciously chose the nomenclature “variation” as this is not technically the variance.) This mapping is somewhat arbitrary (there are many ways in which one can do this), but I think it fulfils the above goals:
def f_dirichlet(n_epochs, n_labels, variation, balance):
alpha_base = np.arange(1, n_labels+1)**(2*(1-balance**0.5))
alpha_base = alpha_base/np.median(alpha_base)
adjust_alpha = 10.**(4*(0.5-variation))
alpha = alpha_base*adjust_alpha
probabilities = np.random.default_rng().dirichlet(alpha, n_epochs)
return probabilities
The resulting standard deviation of the probabilities looks like this (generated over 1000 epochs):
An all-unity alpha vector is indicated by the dots. We see that the variation parameter controls the standard deviation of the probabilities. Only if the variation is zero (i.e. probabilities are as similar as possible) does the balance parameter affect the standard deviation.
Here are some examples of the probabilities drawn for default parameters.
{variation, balance} = {0.5, 1}:
array([[0.00366, 0.12113, 0.44394, 0.12758, 0.30369],
[0.19183, 0.2053 , 0.07842, 0.13638, 0.38807],
[0.18014, 0.40854, 0.16454, 0.11453, 0.13226],
[0.24297, 0.21941, 0.38331, 0.00817, 0.14614],
[0.03104, 0.30407, 0.45378, 0.12222, 0.08889]])
Here are those same examples when changing variation
.
{variation, balance} = {0, 1}:
array([[0.19664, 0.20742, 0.19512, 0.19521, 0.20561],
[0.18457, 0.22529, 0.17311, 0.21244, 0.20459],
[0.20198, 0.19799, 0.19253, 0.1906 , 0.21689],
[0.2085 , 0.17609, 0.20515, 0.19277, 0.21748],
[0.17749, 0.20983, 0.21661, 0.21357, 0.1825 ]])
{variation, balance} = {1, 1}:
array([[0. , 0. , 0. , 0. , 1. ],
[1. , 0. , 0. , 0. , 0. ],
[0. , 0. , 1. , 0. , 0. ],
[0.71816, 0. , 0. , 0. , 0.28184],
[0. , 1. , 0. , 0. , 0. ]])
This clearly affects the contrast (standard deviation) of the probabilities.
Here are those same examples when changing balance
.
{variation, balance} = {0.5, 0.5}:
array([[0.15591, 0.1891 , 0.03249, 0.20507, 0.41743],
[0.25992, 0.10724, 0.02145, 0.23007, 0.38133],
[0.25912, 0.02841, 0.46691, 0.05982, 0.18575],
[0.00858, 0.35486, 0.22506, 0.07287, 0.33863],
[0.05241, 0.15501, 0.17667, 0.11402, 0.50189]])
{variation, balance} = {0.5, 0}:
array([[0.0005 , 0.27699, 0.36949, 0.03341, 0.31961],
[0.00327, 0.36342, 0.05233, 0.13267, 0.44831],
[0.00032, 0.01882, 0.06006, 0.58172, 0.33909],
[0. , 0.04659, 0.06193, 0.25559, 0.63588],
[0.00044, 0.05993, 0.11238, 0.70294, 0.12432]])
This clearly affects the ordering of the probabilities, with low balance inducing a gradient of increasing probability with rank.
The interaction between the parameters is complicated, but logical. The standard deviation in the above figure is not the right parameter to probe this.
Here are the same examples when changing balance
at high variation
.
{variation, balance} = {1, 0.5}:
array([[0. , 0. , 0. , 0. , 1. ],
[0. , 0.00001, 0. , 0.99999, 0. ],
[0. , 1. , 0. , 0. , 0. ],
[0. , 0.90876, 0. , 0. , 0.09124],
[0. , 0. , 0.93811, 0.06189, 0. ]])
{variation, balance} = {1, 0}:
array([[0. , 0. , 0.99664, 0. , 0.00336],
[0. , 1. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.00009, 0.99991],
[0. , 0. , 0. , 0.02258, 0.97742],
[0. , 0. , 0.00001, 0. , 0.99999]])
High contrast, but increasingly skewed to the higher ranks for lower balance
.
Here are the same examples when changing balance
at low variation
.
{variation, balance} = {0, 0.5}:
array([[0.10048, 0.13935, 0.22021, 0.25165, 0.28831],
[0.12065, 0.15535, 0.17098, 0.28649, 0.26653],
[0.09535, 0.18697, 0.2128 , 0.20971, 0.29518],
[0.09062, 0.16101, 0.23497, 0.21562, 0.29779],
[0.11131, 0.1559 , 0.21263, 0.23472, 0.28544]])
{variation, balance} = {0, 0}:
array([[0.01953, 0.08559, 0.17119, 0.25132, 0.47238],
[0.01626, 0.06522, 0.1777 , 0.29622, 0.44459],
[0.01388, 0.06034, 0.15471, 0.30948, 0.46159],
[0.01842, 0.06403, 0.18416, 0.27824, 0.45515],
[0.01976, 0.09072, 0.14122, 0.30183, 0.44647]])
Low contrast, and increasingly skewed to the higher ranks for lower balance, to the point that it can almost guarantee that the highest-rank element always has the highest probability.
I think these two parameters will give us a great way of testing network performance for a wide set of problem properties, as parameterised by the varation
and balance
parameters. I will go ahead and implement this. The run_network
function will take four new arguments:
problem_type
is a string with possible values 'regression'
and 'classification'
. This variable technically is not needed (it could be implicitly encoded by setting the next variable to zero), but I prefer a version where this distinction is made explicitly and the next variable simply has no functionality if problem_type = 'regression
.
n_labels
is an integer setting the number of labels used in the problem. This parameter only does something when we set problem_type = 'classification'
.
variation
is a real number in [0, 1] that indicates the degree of random variation between the probabilities, with higher contrasts reached for higher variation
. Default is variation=0.5
. This parameter only does something when we set problem_type = 'classification'
.
balance
is a real number in [0, 1] that indicates how balanced the probabilities are, with more pronounced systematic ordering occurring for lower balance
. Default is balance=1
. This parameter only does something when we set problem_type = 'classification'
.
The actual “ground truth” generation (with labels
being the ground truth) then becomes as simple as this:
def get_class_history(num_periods=1000, n_labels=5, variation=0.5, balance=1, start_date='2020-12-01'):
probabilities = f_dirichlet(num_periods, n_labels, variation, balance)
labels = (probabilities == np.max(probabilities, axis=1, keepdims=True)).astype(int)
time = pd.date_range(start=start_date, periods=num_periods, freq='D') # Generate the time axis
return time, labels, probabilities
@nick @steve @Renata