Are you tired of figuring out every hypothesis, and most importantly, with what limitations to the available data, is tested by countless statistical tests?
Then the bootstrap is your choice. It does not require any parametric assumptions about the data or any nontrivial mathematics, and yet can be applied to a wide range of statistical estimates.
Prelude
Bootstrap allows you to build from samples obtained from a database or as a result of an A/B test by reselecting observations Thumb Distribution of any sample statistics (metrics) No Prior data restrictions or requirements. With it, you can:

Build confidence interval, for example, for the 60th percentile, sum, or even variance;

Count Experiment Results for the median;

Find the pvalue for the acceptance Ratio Metrics with dependent observations, such as CTR or average check;

Spend Power Analysis different statistical criteria in the A/B test;

Try to find in which part of the distribution There was an effectIn this case, it is important to note that this is not the case.
❗️Leitmotif❗️
Bootstrap is trying pretend population. Ideologically, you can think of it this way: you have the opportunity to conduct as many “experiments” as you want to test one hypothesis, where you have access to the whole hypothesis «of the general population», which is our original sample. This illusion is maintained by sampling observations with repetition.
However, the bootstrap does not generate new information and, accordingly, does not increase the representativeness of the initial data.
In addition, bootstrapped samples are sampled in the original volume. This is necessary to get a fairly accurate estimate of the variability (variance) of the statistics of interest, which just depends on the size of the initial sample, as standard error for the mean difference in Ttest.
A generalized implementation of the bootstrap function looks simple enough.

Take the original sample;

Conduct an “experiment” by selecting observations with repetition in a boot sample;

On the resulting sample, calculate the statistics of interest and put them in an array;

Repeat items 2 and 3 many, many times and get an empirical distribution of statistics;

Get the necessary information from this distribution;

Visualize for information.
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
def bootstrap(sample, n_trials=5_000, statistic=np.median):
rng = np.random.default_rng()
stat_distrib = []
# get booted samples and count stat
for _ in range(n_trials):
boot_sample = rng.choice(sample, size=len(sample), replace=True)
stat_distrib.append(statistic(boot_sample))
result = do_some_math(stat_distrib)
do_some_viz(result, stat_distrib)
return result
def do_some_math(data):
pass
def do_some_viz(data):
pass
Practice
Let’s take a look at these bootstrap features with examples. Let’s take a set of continuous values from an exponential distribution, such as user spending, as a basis. This distribution might look something like this.
1. Constructing confidence intervals
Let’s set the value of the 60th percentile for our sample with an interval score with a confidence level of 95%.
CONF_LVL = 0.95
def bootstrap_ci(sample, n_trials=5_000, statistic=np.median):
rng = np.random.default_rng()
stat_distrib = []
# get booted samples and count stat
for _ in range(n_trials):
boot_sample = rng.choice(sample, len(sample), replace=True)
stat_distrib.append(statistic(boot_sample))
result = do_some_math(stat_distrib)
do_some_viz(result, stat_distrib)
return result
def do_some_math(data):
# confidence interval counting
left_q = (1  CONF_LVL) / 2
right_q = 1  left_q
ci = np.quantile(data, [left_q, right_q])
return ci
def do_some_viz(res, data):
hist = plt.hist(data, bins=32, color="lightsalmon")
ymax = hist[0][np.argmax(hist[0])]
plt.vlines(np.mean(data), ymin=0, ymax=ymax, colors="black",
label=f'Statistic mean: {np.mean(data).round(3)}')
plt.vlines(res, ymin=0, ymax=ymax//2.5, linestyle="", colors="black",
label=f'CI: {res[0].round(3)} and {res[1].round(3)}')
plt.xlabel('statistic value', fontsize=14)
plt.ylabel('frequency', fontsize=14)
plt.legend(loc=0)
return None
def quant_60(data):
return np.quantile(data, 0.6)
ci = bootstrap_ci(spendings, statistic=quant_60)
As a result, we will get approximately the following graph of the empirical distribution of the selected statistics. The mean of this distribution will be very close to the 60th percentile estimate in our original sample.
In practice, this can be used to build daily charts of the metrics of interest with their CI boundaries, or to present an important business metric that was obtained in the test group of the experiment, usually about money.
2. Results of Experiments on Arbitrary Statistics
Sometimes there is a product need to evaluate the results of the experiment not for the average, but for the median. Here’s how to do it:
CONF_LVL = 0.95
def bootstrap_ab(s1, s2, n_trials=5_000, statistic=np.median):
rng = np.random.default_rng()
stat_distrib = []
for _ in range(n_trials):
boot_s1 = rng.choice(s1, len(s1))
boot_s2 = rng.choice(s2, len(s2))
stat_distrib.append(statistic(boot_s1)  statistic(boot_s2))
result = do_some_math(stat_distrib)
do_some_viz(result[1], stat_distrib)
return result
def do_some_math(data):
# confidence interval counting
left_q = (1  CONF_LVL) / 2
right_q = 1  left_q
ci = np.quantile(data, [left_q, right_q])
# p_value
quant = stats.norm.cdf(x=0, loc=np.mean(data), scale=np.std(data, ddof=1))
p_value = quant * 2 if 0 < np.mean(data) else (1  quant) * 2
return p_value, ci
p_value, ci = bootstrap_ab(spendings_t, spendings_c)
In this case, it is necessary to make boot samples from two “general populations”, the test and the control. Having obtained an empirical distribution of statistical differences, it can be treated as in the implementation of the “CPT”.
To find pvalue For a twoway hypothesis, it is necessary to know the proportion of cases where the difference was equal to 0 and cases that are even more pronounced. To do this, we simply find in which quantile of our empiricalFor example, if you want to be a member 0, and count the tails.
DI can also be used as a criterion. If 0 beyond the boundaries of the DI, we reject the null hypothesis H0.
To test the hypothesis about the difference in average values, there is a developed statistical apparatus, namely the beloved one TTest. In it, according to the CPT, the difference in means has a normal distribution, and for this distribution the measure of noise (standard error) and estimates of confidence intervals.
So, the bootstrap will give approximately the same results when calculating the average values in the experiment as TTest. The accuracy of these results will increase as the size of the original samples and the number of iterations increases. It can be said that the bootstrap on average values is computationally costly approximation Ttest. They will also have approximately the same power, but more on that below.
Using the example of averages, you can more clearly demonstrate why boot samples are formed in the volume of source data. If you increase the size of the sampled boot samples by a factor of 25, the empirical distribution of mean differences will narrow, and standard deviation will be about 5 times smaller (if the group sizes are approximately the same) of the calculated value standard error for TTest.
3. Working with Ratio Metrics
If we divide each user’s spending by the number of orders, we get a binomial distribution. However, the average value for such a distribution would be Average check per user, i.e. the average of the averages.
Usually, the Average bill In the product, it is defined as the sum of all user spending divided by the sum of all orders. Such indicators are called ratiometrics, and they also include CTR.
The picture below shows the discrepancy between the values of the average check per user and the average check.
The essence of the problem with ratio metrics lies in their evaluation by statistical criteria, which usually require Independence input values. That is, if you build a distribution where one observation is the value of the check, then it will always contain Dependent Quantities, since several orders could be made by one user, hence the dependence.
But even here there are no obstacles for the bootstrap.
def bootstrap_ratio(s1_num, s1_denom, s2_num, s2_denom, n_trials=5_000):
rng = np.random.default_rng()
stat_distrib = []
for _ in range(n_trials):
boot1_num = rng.choice(s1_num, len(s1_num))
boot1_denom = rng.choice(s1_denom, len(s1_denom))
ratio1 = boot1_num.sum() / boot1_denom.sum()
boot2_num = rng.choice(s2_num, len(s2_num))
boot2_denom = rng.choice(s2_denom, len(s2_denom))
ratio2 = boot2_num.sum() / boot2_denom.sum()
stat_distrib.append(ratio1  ratio2)
result = do_some_math(stat_distrib)
do_some_viz(result[1], stat_distrib)
return result
p_value, ci = bootstrap_ratio(spending_t, n_check_t, spending_c, n_check_c)
In this implementation, you need to determine which samples will go into the numerator (numerator), which in the denominator (denominator). Constructing an empirical distribution and getting answers from it is already a learned stage.
From the picture below, you can see that the average value in the distribution is already more similar to the observed difference in average checks.
4. Power Analysis
The power of the test usually refers to the probability of accepting an alternative hypothesis H1when it is really true. The power usually depends on the size of the available data, as well as the model of the effect obtained from the experimental exposure, and the ability of the stattest to detect this effect.
Since we have the ability to conduct as many “experiments” as we want, we can literally count the fraction of events when we have adopted H1 at a given level alpha on different stat tests. Take ttest, mannwhitney test and bootstrap on the medium.
ALPHA = 0.05
def power_compute(s1, s2, n_trials=200):
rng = np.random.default_rng()
# dict for stattests' p_values
test = {'t': [], 'mw': [], 'btsp': []}
for _ in range(n_trials):
boot_s1 = rng.choice(s1, len(s1))
boot_s2 = rng.choice(s2, len(s2))
test['t'].append(stats.ttest_ind(boot_s1, boot_s2)[1])
test['mw'].append(stats.mannwhitneyu(boot_s1, boot_s2)[1])
test['btsp'].append(bootstrap_ab(boot_s1, boot_s2,
n_trials=500, statistic=np.mean))
result = do_some_math(test)
do_some_viz(test)
return result
def do_some_math(test_dict):
power_dict = {}
for test in test_dict:
np_arr = np.array(test_dict[test])
power = len(np_arr[np_arr<=ALPHA]) / len(np_arr)
power_dict[f'{test}_power'] = power
return power_dict
def do_some_viz(test_dict):
df = pd.DataFrame(test_dict)
df = df.melt(var_name="test", value_name="p_val")
sns.ecdfplot(data=df, x='p_val', hue="test")
plt.vlines(x=ALPHA, ymin=0, ymax=1, colors="black")
plt.xlabel('Alpha')
plt.ylabel('Power')
return None
power_dict = power_compute(spendings_c, spendings_t)
The output will be a dictionary with the power values for the selected tests. In the picture, you will get Empirical cumulative distribution graph (Empirical Cumulative Distribution Function). It can be treated as a regular histogram in which the bins are stacked from left to right, on top of each other. However ECDF is more informative, because we don’t want frequencies, but the total fraction of cases where pvalue was lower alpha. The proportion of cases we are interested in is at the intersection of the vertical line with the line of the corresponding test.
From the graphs you can see that the power TTest Same as for Bootstrap on Medium and is at the level of 80%.
Bootstrap power analysis can be useful for summarizing the results of an experiment.
For example, there may be cases when an A/B test was stopped in advance because it was painted in a positive direction, but after analyzing the power, it turns out that it is correct to accept H1 You can do it at 50%, which is comparable to flipping a coin.
5. Percentile by percentile
Sometimes you can get surprising results in A/B tests that don’t meet our expectations. Let’s assume that an experiment has been conducted, and the target metric on mobile phones has statistically increased, but on the desktop the result is negative, but gray. The following criteria were used TTest. However, it is very difficult to pass by such a drop, and for the desktop you can find out that the nonparametric test colors the result with a power of 90%.
Since the MannWhitney test works with ranks, you can use the bootstrap to “run” through the deciles of the test and control groups and find where the shifting of ranks (positions) occurred. In the variant below, this happened in the first four deciles.
Total
It is clear that bootstrap does not scale well due to its computational demands, especially if the company has a stream of A/B tests, and dozens and hundreds of product metrics can be counted at a time.
However, in order to solve specific problems or when calculating the static significance of the most unexpected functions in experiments, you can always try to “knock” on your data with a rough computational stick so that they try to give a little more information. If you grasp the main idea behind the implementation of the booster, everything that follows will be limited only by the imagination of the analyst who uses it.
———
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of Павел Мосин the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.