Hi, this is Jing, a data scientist with great passion for data science and big data technology. In this blog I want to share how I usually analyze the result for a Web experiment aka A/B test analysis while I am working in a product team. I will talk about what we need to measure for a test, what we need pay attention to and how we conclude on which one is a winner for this test.
Metrics
For each AB test, we need to have primary metrics which as the decision indicator for a tester winner. However, there is a high chance that we will not see a significant difference on the primary metric we choose. So we need the supportive metrics to guide us which variant to choose.
The Primary Metric
The selected primary metric should be aligned with the product team’s objective to make sure the potential change is aligned with the business goal. Ideally, it should be only one. You need to be determined before the test on what you are actually aiming for optimizing by this test. For example, for most of the tests run within teams with an e-commerce platform, the primary metric is always the online conversion rate, how many customers purchased after they landed on their website .
What we need to remember is that not everyone who visited the website is ready to pay something directly, especially something something expensive. What we are trying to optimize through test is to make sure the whole conversion (shoping) experience is smooth and seamless enough so that they could convert easily when they are finally ready. So, we can use the conversion rate to the step before they are actually going to pay to estimate the potential impact on the primary metric which is conversion rate, for example the step of reviewing the order. As a result, there will be a set of metrics within the primary metric to indicate whether the change leads us be closer to the business goal. All in short, there can be some sub-metrics under the primary metric.
Supportive Metrics
If you are working with upper funnel customer acquisition, it can be normal to see no statistic significant difference in any of the set of sub-metrics of the primary one. That’s why we also need supportive metrics. Usually, they are the engagement metrics you are interested in.
For example, if you added a new module or made change to an existing module on your website, which intends to onboard visitors better. It is extremely hard to move the primary metric if you are selling high-unit-price product such as a car or a luxury trip . In this case, you need to be more focusing on the engagement metrics as supportive metrics to help you to decide whether the change is good for the business and the customers. Common Engagement Metrics Can be
- Bounce and Exit rate on the tested page
- CTR to the following steps
- Interaction Rate with the module
Segmentation
Analysis for a web experiment should not only focus on changes on certain metrics to decided whether we have a winner or not. It should also include the learnings on the user behavior patterns. Even if we don’t have a winner for a test, we could still gain some insights. All the insights we gathered through previous tests will lead us to a test with a significant winner in the end. To be able to gain more insights, doing segmentation on the tested users within the analysis is really recommended and needed for developing better hypothesis later test iterations. From my previous experience of running test on e-commerce platform, always segment the metrics on (if applicable):
- Device level
- Market level (if you have one more market)
- Users who engaged with tested module vs. not engaged
Decision
When we have tracked the primary metric and supportive metrics, usually I follow certain rules to help me to decide whether we have a winner and which one is the winner.
If primary metric is:
- positive for one variant with statistical significance (declare it as the winner)
- indecisive among variants (look into your supportive metrics to determine a winner)
- negative for test variants with statistical significance (do not implement, adjust design and test again if you still believe in the hypothesis)
If both the primary metric and the supportive metrics are inconclusive, use other data sources to evaluate if the hypothesis can be negated, e.g. qualitative data (user research) or best practice. If nothing is found, implement the new version since you don’t want to waste the effort on running this test. Most importantly, usually the test variant is designed as an improved version.
Statistical Models
I talked about statistical significance above, which means I am analyzing the metrics using a Frequentist Model. The good practice while using this model is to set confidence level: 90% minimum. And another popular model is Bayesian statistical model which is used by Google Optimise, VWO, Launch Darkly Experiment and etc… most of the AB test tool. For more difference about these two models, goes to Pre-test Analysis before turning on the AB Test since which model you will use should be decide before turning on the test.
Here are some online Frequentist A/B-test Calculators, I find pretty useful:
Result Sharing Tips
When the analysis is done, you have the numbers for your metrics and decided on which one is the winner. It is time for sharing the result! Here are some tips while you sharing the result with large audience
- Do not share results in numbers that did not reach statistical significance.
- Always include confidence level (90%, 95% or 99%) together with uplift (e.g. +14%).
- Always include screen shots of A and B for easy overview.
- To ensure ease of interpretation and sharing within the organization, make sure to use certain template to communicate results.
End
I hope you will find this post useful. Currently I am writing more in this topic, A/B testing within product teams and try to keep each post short. Thanks for reading. I am Jing, a data scientist aiming to be better and better.