The following contributors also contributed to this article: Adrian Lazarev, Lyubov Kislinskaya, Alexander Kosov.
Authors of the Kolmogorov ABacus library: Vadim Glukhov, Yegor Shishkovets, Dmitry Zabavin.
Are you familiar with this situation: you developed a machine learning model and tried to evaluate the effect, but the experiment showed that the model did not have any economic effect?
Does this mean that the model is really inefficient? Or is there something else to reason for the low results? And if so, in what way?
Something similar happened to the AlfaBank team, and we, GlowByte Advanced Analytics, went to their aid, taking with us the Kolmogorov ABacus A/B testing library developed by us (which means “accounts”)!
By the way, even before you get acquainted with the advantages and advantages of our library, we are glad to inform you that at the time of publication, we have already posted the library’s code in open source: github.com/kolmogorovlab/abacus. Just run “pip install kolmogorovabacus” and ABacus is at your disposal!
The case of AlfaBank
Colleagues from AlfaBank have developed a model for personalizing advertising banners: depending on the client’s attributes, the model ranks the banners of various bank products in descending order of their predicted attractiveness for the client. At the training stage, the accuracy of the model turned out to be quite high: the model managed to select the banners that the client was most likely to click on quite accurately. But when a pilot was launched to assess the economic effect of the implementation of this model, it turned out that it was negative for most banners (!), and almost no effect for the rest.
Let us once again pay attention to this symptom, which indicates the possible presence of an error in the design of the experiment: the model has an acceptable accuracy at the training stage, but does not confirm this accuracy at the stage of effect evaluation.
Causes of Erroneous Results in Statistical Experiments
1) Small number of groups (low dough power)
First, it should be remembered that a statistical test does not guarantee a correct answer to the question of whether the groups being compared are different or not. Moreover, any statistical test assumes that it will be wrong EXACTLY from time to time: in some cases it will notice an effect where it is definitely not present (the proportion of such cases is called a type 1 error in statistics), and in a certain proportion of cases it will not notice an effect where it is definitely present (the proportion of such cases is called a type 2 error in statistics).
Moreover, the smaller the number of groups participating in the experiment, the more likely it is that we will not notice the achieved effect (as they say, the higher the error of the second kind).
However, here the question arises: how can we not notice the achieved effect? After all, if we can say that we have achieved an effect, then we have already noticed it. That is, we can either notice the effect or not. But how can we estimate the proportion of cases where we “should have noticed the effect” but didn’t? How can we know in advance whether an experiment will achieve a statistically significant effect or not? This will be the subject of a separate article, but the short answer is yes. We can simulate an experiment: we can identify two groups of customers, add an effect to the metric values in one of the groups, and then see if we can see the effect or not. We can repeat this simulation many, many times and calculate the fraction of times when we failed to notice an effect that was definitely present in the data.
In the ABacus library, such a multiple simulation of the experiment is performed (except for the config description) with a single command.
But more on this, as I promised, in a separate article.
2) Incomparability of groups
The second reason for the erroneous results of the experiment is the incomparability of the groups.
If the groups compared in the experiment are not comparable to each other, then the results of the experiment can be anything. For example, if the control group (which receives no exposures) is dominated by customers with higher average spending or more frequent purchases, then no matter how effective our impact on the customers of the test group, we are unlikely to be statistically significantly superior to the results of the control group. That is, the groups to be compared must be comparable to each other.
However, even if the groups being compared are comparable to each other but not comparable to the total customer population, the impact that showed an effect in the experiment may not be shownIf you try to apply it to all your clients, you will be able to do this for a long time. But according to the results of such an incorrectly set up experiment, an ineffective marketing impact can even be put on a regulation, that is, it will regularly bring a loss. Therefore, the groups to be compared must necessarily be comparable both with each other and with the general population of customers.
So, an incorrectly set up experiment can, on the one hand, deprive us of truly effective ways of influencing the client, and on the other hand, it can encourage us to regularly perform unprofitable actions, as well as spend time and effort to maintain their regularity.
As we managed to find out, in the case of AlfaBank, the problem was precisely in the incomparability of the compared groups. Now let’s talk about how exactly we found this out and how we managed to fix it.
How to ensure the comparability of groups? Approaches to the Formation of Comparable Groups
1) Formation of groups randomly
One of the most common misconceptions is the belief that when we randomly select groups, we are guaranteed to get comparable groups.
Let’s test this hypothesis using the A/A test! Let’s try to repeat the random selection of groups many times, each time assessing the significance of the differences between the formed groups according to the target metric. And let’s calculate the proportion of cases where we will be able to notice a statistically significant difference. This fraction is called the first type of error (i.e., it is the proportion of cases where we notice an effect where there is none). Ideally, the proportion of such cases should not exceed the permissible level of type 1 error (for market research, the industry standard is alpha = 0.05). Otherwise, the use of a random method in the formation of groups is unacceptable: it means that it forms disparate groups, which is why we will too often notice an effect where there is none.
Illustration: matrix of errors of the first kind in random splitting
As can be seen from the table above, the proportion of cases when we notice an effect that did not occur randomly behaves extremely unpredictably during the formation of groups and almost always exceeds the maximum permissible value of 5%. That is, random splitting does not allow the formation of comparable groups for the experiment.
2) Stratification
The most reliable method for forming groups for an experiment is stratification:

Clustering the customer base: allocating an excessive number of clusters. Why redundant? To form “dense” homogeneous clusters of customers with similar buying behavior. To identify really “dense” clusters in the ABacus library, we use HDBScan, a clustering method based on estimating the density of clustered data. Formed clusters are also sometimes called strata, hence the name of the method: stratification.

Let’s choose a randomly fixed SHARE of customers from each cluster. In this case, we can randomly select clients from the cluster, because in the clustering process, clients that are known to be comparable to each other are combined into clusters. That is, no matter which customers from the same cluster we choose, they will be comparable to each other. This means that even if we randomly select clients from the cluster, we do not violate the comparability of the groups – as long as we take exactly the same SHARE of clients from each cluster.

We will combine clients selected from different clusters into one group.

Let’s form the second group in the same way (without intersection with the first one).
This approach allows you to get groups that are comparable both to each other and to the general population of customers: the numerical ratio of representatives of different clusters in both groups will be the same – and the same as in the total population of customers.
In the ABacus library, stratification is also performed by a single command (again, except for the config description).
Looking ahead, I would like to say that now all AlfaBank’s experiments to evaluate the effectiveness of personalization models are launched using stratification from the ABacus library. But let’s finally move on to the question, how did our library manage to gain such trust from colleagues from AlfaBank?
So, in the case of AlfaBank, the difficulty was that it was necessary not to prepare a new experiment, but to correctly summarize the results of the one that had already been completed, when we could no longer influence the composition of the groups that took part in the experiment.
3) Poststratification
We have figured out how to avoid the incomparability of groups when preparing for the experiment. But what if the experiment used disparate groups of customers and the experiment has already ended?
For this purpose, there is a Poststratification method: In the results of the experiment, let’s exclude as many observations from each stratum (from each client cluster) that the proportion of this stratum in the group coincides with the proportion of this stratum in the original set of clients.
It was this method that we used in the case of AlfaBank. But before showing how exactly we managed to ensure the comparability of groups in the already completed AlfaBank experiment, let’s talk about how we managed to understand that the problem of this experiment was precisely the incomparability of the compared groups.
How do you assess whether groups are comparable or not?
1) Evaluation of the Group Formation Algorithm: Multiple A/A Test
As a rule, this issue is resolved NOT at the level of specific groups formed, but at the level of trust in the algorithm used for the formation of groups as a whole.
Just as we estimated the level of type 1 error using the A/A test when forming groups randomly, we can estimate the level of type 1 error for any other algorithm for group formation. In particular, when choosing which method of group formation to use as the basis of the ABacus library, we decided on stratification via HDBScan clustering, because this method has proven itself on multiple A/A tests: it consistently demonstrated the level of error of the first kind below the permissible limit of 5%.
But what if the groups of participants in the experiment have already been formed? How do you assess whether these groups are comparable or not?
2) Assessment of the comparability of already formed groups: A/A test
The simplest way is to apply the same A/A test method to the formed groups and compare the values of the target metric in these groups in the period before the experiment (based on historical data).
If we observe a statistically significant difference between the compared groups before the experiment begins, then these groups are not comparable to each other.
In the ABacus library, the stratification method based on client clustering is used to form groups, so we have no doubts about the quality of its work, but nevertheless, every time groups are formed in the ABacus library, we have provided for an A/A test to be completely sure of the comparability of the formed groups.
The A/A test method, when comparing the formed groups according to the values of the target metric in the period “before the experiment”, most often allows you to notice the initial incomparability of the formed groups. However, even if the A/A test did not reveal a statistically significant difference between the groups formed, this does not mean that these groups are comparable to each other in terms of stratification.
3) Comparison of the shares of different strata in the formed groups
A more reliable method is based on comparing the shares of different strata (clusters) in the formed groups:

For each stratum, let’s calculate what fraction of it is included in the group.

Let’s compare the calculated shares for each of the strata: if different strata are part of the group in different proportions, it means that the formed group is no longer comparable with the initial set of customers, which means that the results of the experiment will be impossible to summarize correctly without poststratification.
These are the symptoms we found in the analysis of the groups that took part in AlfaBank’s experiments. And it was this circumstance that immediately made it clear to us that the conclusions made by our colleagues about the effectiveness (or rather ineffectiveness) of the banner personalization models they developed needed to be adjusted. Yes, it was not yet clear whether the effect of these models would be detected after the application of poststratification, but it was already clear that without poststratification it was too early to recognize these models as ineffective.
So, here is what the strata fractions looked like in the groups that took part in the AlfaBank experiment:
And here is what the strata lobes in the same groups looked like after applying poststratification:
Another very indicative characteristic is the average score of the model within each of the groups:
You can see how much poststratification has made it possible to equalize the average score between the groups (them?):
And here, finally, are the results of the experiment before and after the application of poststratification:
As a result, before the application of poststratification, the effect of the models looked statistically insignificant, but poststratification restored its validity: the models proved their effectiveness for all the banners they recommended.
P.S. Other Methods for Assessing the Comparability of Formed Groups
Another method for assessing the comparability of formed groups is pseudo labeling. It involves building a binary classification model that is trained to distinguish control group members from test group members. The intuition here is as follows: if the constructed model will have sufficient accuracy (i.e., if the model has been able to distinguish the participants of the control group from the participants of the test group), then the formed groups are different from each other, that is, they are not comparable to each other.
We will consider the pseudo labeling method in more detail in our course “Immersion in A/B testing” On the platform A2NCED, which I was lucky enough to coauthor and cohost. Therefore, we are happy to invite everyone who, like us, is interested in disassembling the entire methodology of statistical experiments into atoms.
We also often make presentations at meetups NoML Communities, where, in addition to various approaches to testing statistical hypotheses, the issues of applied artificial intelligence, advanced statistics, optimization methods and complex architectures based on them are actively discussed. You can get acquainted with the recordings of previous meetups at NoML YouTube channel. Including with video overview of ABacus library features.
———
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of Дмитрий Забавин the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.