Posted by

Tawsif Khan

on July 27, 2018

Share to

More Posts...

Bootstrapping Data for Significance Testing

This is a question I came across on Quora –

Why does it seem as if there is a disproportionate number of socially inept mathematicians/mathematics students compared to the average population?

To validate whether math graduates really suffer from low social skills, we can set up an experiment to compare the differences in the social skills of people who studied math and those who didn’t.

Technically, this is an example of a test for statistical significance.

In the retail industry, this concept is also used to calculate incremental sales by comparing a test and control group during a promotional event. At Rubikloud, for example, we help retailers optimize promotions by choosing optimal segments of customers and products using machine learning algorithms. The validation of the recommender systems requires continuous measurements of the effectiveness of these promotions and that requires testing for significance all the time.

But often there is a certain challenge in this process which is sometimes overlooked.

To demonstrate this issue and how we can solve it, we will construct a scenario to compare the social skills of math graduates and other graduates.

Note: Data presented in this scenario is hypothetical. Any resemblance to reality is pure coincidence.

We randomly picked 100 math graduates and another 100 who graduated from a different program. We measured their social skills based on a survey among their peers. Business graduates showed an unreasonable bias in their social skills and thus were excluded from this experiment.

We would like to now answer two questions;

  1. Is there a significant difference in social skills of math graduates vs others?
  2. How big is this difference?

The distribution of the social skills score between the two groups is found to be skewed as shown below:

The majority of the participants have very low social skills, while the portion with a decent score is small. This is a distribution which is not Gaussian or Normal. But, a normal distribution would be desirable because it is very easy to work with compared to what we have now.

Sure we can see a difference in the graph above, but effectively determining whether this is statistically significant requires either;

  • a transformation to a normal distribution and then performing t-tests or,
  • tests that don’t have a normality assumption (called nonparametric tests)

The problem, in either case, is that the interpretation of the results is not intuitive and often challenging to communicate. For example, in a more practical scenario, a transformation of the data might take the context out of it. And a nonparametric test, like the Wilcox/Man-Whitney, measures the changes in the locations of the distributions. But how does one quantify this difference in a meaningful way?


Bootstrapping is a method where the data sets would be repeatedly sampled to create a distribution of the mean metric of interest.

What this means is that we will pick random individuals from a group in our experiment and calculate the average score of their social skills. We repeat this process for a number of times.

This is useful because according to the central limit theorem;

  • The averages of randomly drawn samples will approach the actual average of the population
  • The average of the sample will follow a normal distribution
def create_bootstrap_mean(df,input_col,iterations=500):
    return([np.random.choice(df[input_col], size=df.shape[0]).mean() 
            for i in range(iterations)])


To demonstrate this, using the above function one of the groups was sampled and the average score of social skills was calculated for a different number of iterations. As the number of iterations increases, the average value approaches a normal distribution and the mean of these averages approaches the true mean of the group. Bootstrapping the other group would return another normal distribution.

Since the distributions of these averages from the two groups are normal, the distribution of the difference of the averages would also be normal. We now take the distribution of the difference between math graduates and other graduates and figure out the probability that this difference is zero. This can be done through what is called the calculation of a confidence interval.

In this example, the average social skills score was 0.72 for math and 0.87 for others. And, the 95th percentile (95% certainty) confidence interval of the average difference is [0.11, 0.19]. This range does not include 0, which implies these two groups are significantly different in their social skills.

Hence from our scenario, the conclusion would be –

Math graduates are 0.15 units or 20% worse than an average graduate in their social skills and this difference is statistically significant.

We have answered both the questions –

  1. Is there a difference? Yes, and this difference is significant.
  2. How big is the difference? It’s 0.15 units of social skills on average.

Similar to this example, data with non-normal distributions are very common in the industry. For example, the transaction amounts of customers would often have a skewed distribution. The majority of customers spend very little, while the portion of large spenders is small. For comparing such distributions, although transformations and nonparametric tests are appropriate methods, these are often difficult to interpret. Bootstrapping, on the other hand, not only answers the question of significant difference but also quantifies this difference in a simple and intuitive way.