Statistical Validity Is Leading You Astray

Sometimes, the idea of testing in PPC accounts can get a little dicey. What can you test? What makes for a valid test? Is a significant test always valid? What “untestable” factors might influence performance? To most positively affect the return your tests will offer, it’s important to both plan your tests carefully and to understand the various interactions of factors beyond your control with your test data.

There are many ways to increase the validity of any test you run in PPC, and making changes based on statistical significance is a primary step in ensuring real-life predictive accuracy, but I think anyone who has managed PPC accounts for a while probably has had more than one “WHAT?!” moment in terms of something working or not working as well as you had anticipated, even when your decisions were based on statistically significant data. I have been trying to understand: why does that happen? Can we control it? Maybe, maybe not, but considering the factors outside of an account that influence performance can help us respond more usefully when performance doesn’t follow our plan, and can help prevent those unexpected surprises in the first place. The quest to better control the outcomes of PPC tests has dragged me into the terrifying world where statisticians dwell, so I’m going to do my best to explain what I think I understand and try to distill the complicatedness into something useful for the everyday PPC advertiser.

So what is statistical validity?  The real answer to this is more complex than we usually act like it is in PPC, and that gets a bit at the reasons why “valid” tests don’t always have the impact we expect. To simplify for PPC purposes, when we talk about something being statistically valid we’re generally using that data to attempt to increase our statistical conclusion validity. However, as referenced in the article, there are several threats to this type of validity (factors which decrease its power), and this is why we can’t assume it’s the only thing we need to consider in crafting predictions and making plans based on those predictions.

You can find general suggestions to improve the predictive ability of your tests floating around, like “an ad should have at least 1000 impressions before you make a decision about it”, but to determine whether an observed difference is actually statistically significant, you should use an analysis tool. It can be surprisingly tricky to just look at something and guesstimate whether there is a significant performance difference, especially as tests get more complicated. Luckily the internet is here to help us with that too, and there are a variety of tools at your disposal which can assist in making that determination.

For landing page testing, there is of course Google Website Optimizer, which can assist you in performing simple or complicated tests on changes to your landing pages. Chad Summerhill has released a PPC ad text testing validity tool (which can also help you determine significance for tests that test two identical ad texts with different landing pages, in a different sort of landing page testing), and the folks at MarketingExperiments will explain how to determine statistical validity of your data samples as well.  With the introduction of AdWords Campaign Experiments, Google will nicely split test a lot of elements of your PPC account for you and report on the significance of the tests without making you do complicated math as well.

There’s a reason we’re not all statisticians, and this stuff is complicated. That’s why you hear a lot of generalizations about PPC testing. In any case, using these types of tools to verify validity (or at least significance) rather than relying on assumption can greatly increase the likelihood that your testing will positively influence ROI.

As referenced above, there are some requirements a test must fulfill to reach significance. In PPC, we generally consider these to be adequate traffic numbers and proper setup for A/B or multivariate testing, and when considering these factors, if you have a sufficiently high-volume account you can reach statistically significant conclusions fairly quickly. But what about validity? This is where it gets more complicated, and I think there’s sometimes a tendency to oversimplify and assume significance is enough.

For example, say you have an account that can give significant data after one day of testing. Great! You can test ad text messages, bid modifications, whatever you want pretty quickly. You could perform seven ad text tests a week! The problem is, as anyone who has an account that performs differently by day of the week may realize, that the results of Monday’s significant test aren’t necessarily valid for Saturday traffic. If you make decisions based on too short a time range, even with statistically significant data, you’ll still increase your chances for error. Consider another factor beyond your control: say your main competitor runs out of budget and their ads are off for a week. Any conclusions you draw from testing at that time might be affected by the lack of competition, and may not be valid for the same audience when the competition re-enters the scene. The same goes for seasonal trends, and for any other influences on your account that vary over time.

These aren’t necessarily things that we can prevent, but we can definitely think about factors outside our immediate control and either ensure that our tests span an adequate range, or plan to re-test to verify the result in a different time frame. Some accounts are more prone to variation than others- ever have a keyword that works beautifully one month and the next with the same bid and position and ad texts and landing page, spends at the same level for ¼ the leads? There’s going to be variation, the best we can hope for is to understand the level of variation our own accounts experience and both set up and apply the results of tests wisely to account for as much of it as is needed to get valid results.

Next time you’re getting ready to say apply changes fully in AdWords Campaign Experiments or change your ad messaging based on statistically significant data, think twice and make sure you’ve got enough regularity in the data set to increase your chances of validity as well.

A little more info about validity and testing in marketing, if you’re so inclined: