Everything a front end engineer should know about A/B testing

Everything a front end engineer should know about A/B testing

What is A/B testing?

A/B testing is a data driven paradigm to test how users react to your user interface in real world and making changes to user experience in that manner. It is one way to build highly efficient user interfaces that user come to love and also helps your core business. A front end engineer should remember the underlying statistical properties of designing and implementing effective bug free A/B tests.

Table of contents

A/B testing in simple words

A/B testing though is a popular technology term the concept itself is very old. You divide your users in two groups. A groups which you call default behaviour and B group which you call the experiment group.

You give a different user experience for both these groups and ask youself which of these performed better. Once you find that a particular group did better, you know that you should change your default behaviour to the better performing behaviour.

Take an example. Say you have an shopping website with the "add to cart" button displayed as big white colored button. Can you make that button better ? You think making it red colored button will make it more easy to spot and hence more people will click it. However, there is also a possibility that red color can be associated with danger and hence users might click it less often.

The best way is to test this behaviour by letting say 5% of your users see this red button and see if they click the button more often than rest 95%.

If you do see an improvement in CTR on this red button you can then roll out this change to 100% of your users.

Important factors to consider when designing A/B tests.

The user selection should be 100% random

The decision to put a user into group A or B should be 100% random. For example say you have a website which sells both mens and womens clothes. You want to decide which clothes you should show on your homepage by default. If you put all your male users in group B, and then show them male clothes on home page, you will see that there is better engagement on homepage.

However this engagement is because of the gender bias in the user selection. That is why any experiment group should always be 100% randomly selected. It should be representative of the large user group.

One way to do this is to generate a random number for every user that visits your site then then put the user into group A or group B based on whether the number is even or odd.


const userId = getUserIdFromLocalStorage() ?? Math.floor(Math.random() * 100);
setUserIdInLocalStorage(userId);

if(userId%2==0) {
// The user is in bucket B.
}

Not that you need not always divide users 50-50. You can also run experiments for much smaller percentage of traffic. But see the caveats in the next section.

User selection should be stable

When you put a particular user into a particular bucket, you should always put that user into same bucket for the duration of the experiment.

This is what we mean by stability of the user selection.

The reason for this is that if the user sees different experiment buckets user might get surprised by the change and might do actions they normally would not.

For example if a user sees the "Add to Cart" button as white and when they refresh the page it becomes red, user might get surprised and click the button to figure out why.

Now, you might ask what if the user had seen the white button 2 days ago and now user sees it as red because the experiment was rolled out today itself ?

This is a good question. To mitigate this you should run your experiment for a long time so the surprise effect of the user goes away.

Experiment data should be statistically significant

Remember that laws of probability apply only when dealing with large experiments. For example you can not guess with 100% certaining if a toss of a coin will result into heads or tails. However you can very confidently claim that if you toss a coin 1000 times, it will turn up as head around about 500 times.

Similarly if you are doing an A/B test, you should always make sure that both groups see large enough traffic to be representative of your users.

For example for the "Add to Cart" button experiment. You will conclude the experiment by calculating following numbers.

Group A conversion = (number of clicks on white button)/(numer of times white button was shown) Group B conversion = (number of clicks on red button)/(number of times red button was shown)

If your Group A conversion number is like 5/15, it is far too small a sample set to draw any conclusions from. It should have thousands of datapoints. If you do not have lot of traffic run the experiment for many days to get to desired level of numbers.

I will not go into things like statistical diviation in this post, but there is fair bit of maths involved here. There are statistical significance online calculators like these to compute if your experiment results are statistically significant.

The only thing different about experiment groups should be the change under test

Imagine you decide to run an experiment where "Add To Cart" button is made red. However the developer responsible for it accidently adds a very large CSS file to achieve this color change. This adds several MB to the payload and reduces the page load time.

Not only, now the color between the buttons is differen but so is the load time.

Looking at the experiment data you might conclude that the red button was ineffective but it turns out it was because of the page load time.

So, always remember to make sure the change under test is as small as possible codewise.

Beware of multiple experiments running.

For large companies there will be many experiments running at the same time. This is totally fine as long as experiments are independent and users are chosen using a pure random criteria.

For example, you might have an experiment on homepage where each user gets classified into A or B.

At the same time you have another experiment running about the Cart icon. Lets call this split C and D.

So if you classify users purely randomly then it would be like this

Group A 50% of users, by laws of probability 50% of these will be in Group C and rest 50% in D. Group B 50% of users, by laws of probability 50% of these will be in Group C and rest in D.

Similarly if you just look at C and D you will find that each of them comprises of 50% of A and 50% B groups.

That what your experiment groups are very identical to each other.

However there is one big caveat. Web applications often have multiple screens. For example a home page where you search products, then you go to products details page, then to Cart page and so on.

It may happen that one of your experiment on home page performs so much better that more users from one of these bucket ends up on the product details page. So all experiments that run on product details page will have a skewed representation from one of the buckets in previous step.

Ideally such sort of skews should be avoided.

Libraries for A/B tests

Collecting data is very important for gathering data. That is where analytics tools come into picture.

There are multiple tools in markets to collect data.

  • Google Analytics
  • Mixpanel
  • Amplitude
  • Firebase
  • Adobe target

I also recomend A/B-Test calculator.

Firebase also allows you to do the user split into buckets and it does it smartly as described in earlier sections.

Cost of A/B tests

Remember that A/B tests have an associated cost. For example if your experiment performs very poorly then you are losing revenue and pissing off users. While this is inevitable, remember to monitor the test results very closely so you do not run such bad experiments for a long time.

Also document all the experiment hypothesis that you found out to be inefficient because you do not want future engineers to run the same experiments again.

Summary

A/B tests are a very important in modern front end engineering and every front end engineer must know its basic principles. Without getitng these principles right one might cause serious bugs making the test data invalid or worse misleading.