Statistical Validity Is Leading You Astray

By ,

113 SHARES

Sometimes, the idea of testing in PPC accounts can get a little dicey. What can you test? What makes for a valid test? Is a significant test always valid? What “untestable” factors might influence performance? To most positively affect the return your tests will offer, it’s important to both plan your tests carefully and to understand the various interactions of factors beyond your control with your test data.

There are many ways to increase the validity of any test you run in PPC, and making changes based on statistical significance is a primary step in ensuring real-life predictive accuracy, but I think anyone who has managed PPC accounts for a while probably has had more than one “WHAT?!” moment in terms of something working or not working as well as you had anticipated, even when your decisions were based on statistically significant data. I have been trying to understand: why does that happen? Can we control it? Maybe, maybe not, but considering the factors outside of an account that influence performance can help us respond more usefully when performance doesn’t follow our plan, and can help prevent those unexpected surprises in the first place. The quest to better control the outcomes of PPC tests has dragged me into the terrifying world where statisticians dwell, so I’m going to do my best to explain what I think I understand and try to distill the complicatedness into something useful for the everyday PPC advertiser.

So what is statistical validity?  The real answer to this is more complex than we usually act like it is in PPC, and that gets a bit at the reasons why “valid” tests don’t always have the impact we expect. To simplify for PPC purposes, when we talk about something being statistically valid we’re generally using that data to attempt to increase our statistical conclusion validity. However, as referenced in the article, there are several threats to this type of validity (factors which decrease its power), and this is why we can’t assume it’s the only thing we need to consider in crafting predictions and making plans based on those predictions.

You can find general suggestions to improve the predictive ability of your tests floating around, like “an ad should have at least 1000 impressions before you make a decision about it”, but to determine whether an observed difference is actually statistically significant, you should use an analysis tool. It can be surprisingly tricky to just look at something and guesstimate whether there is a significant performance difference, especially as tests get more complicated. Luckily the internet is here to help us with that too, and there are a variety of tools at your disposal which can assist in making that determination.

For landing page testing, there is of course Google Website Optimizer, which can assist you in performing simple or complicated tests on changes to your landing pages. Chad Summerhill has released a PPC ad text testing validity tool (which can also help you determine significance for tests that test two identical ad texts with different landing pages, in a different sort of landing page testing), and the folks at MarketingExperiments will explain how to determine statistical validity of your data samples as well.  With the introduction of AdWords Campaign Experiments, Google will nicely split test a lot of elements of your PPC account for you and report on the significance of the tests without making you do complicated math as well.

There’s a reason we’re not all statisticians, and this stuff is complicated. That’s why you hear a lot of generalizations about PPC testing. In any case, using these types of tools to verify validity (or at least significance) rather than relying on assumption can greatly increase the likelihood that your testing will positively influence ROI.

As referenced above, there are some requirements a test must fulfill to reach significance. In PPC, we generally consider these to be adequate traffic numbers and proper setup for A/B or multivariate testing, and when considering these factors, if you have a sufficiently high-volume account you can reach statistically significant conclusions fairly quickly. But what about validity? This is where it gets more complicated, and I think there’s sometimes a tendency to oversimplify and assume significance is enough.

For example, say you have an account that can give significant data after one day of testing. Great! You can test ad text messages, bid modifications, whatever you want pretty quickly. You could perform seven ad text tests a week! The problem is, as anyone who has an account that performs differently by day of the week may realize, that the results of Monday’s significant test aren’t necessarily valid for Saturday traffic. If you make decisions based on too short a time range, even with statistically significant data, you’ll still increase your chances for error. Consider another factor beyond your control: say your main competitor runs out of budget and their ads are off for a week. Any conclusions you draw from testing at that time might be affected by the lack of competition, and may not be valid for the same audience when the competition re-enters the scene. The same goes for seasonal trends, and for any other influences on your account that vary over time.

These aren’t necessarily things that we can prevent, but we can definitely think about factors outside our immediate control and either ensure that our tests span an adequate range, or plan to re-test to verify the result in a different time frame. Some accounts are more prone to variation than others- ever have a keyword that works beautifully one month and the next with the same bid and position and ad texts and landing page, spends at the same level for ¼ the leads? There’s going to be variation, the best we can hope for is to understand the level of variation our own accounts experience and both set up and apply the results of tests wisely to account for as much of it as is needed to get valid results.

Next time you’re getting ready to say apply changes fully in AdWords Campaign Experiments or change your ad messaging based on statistically significant data, think twice and make sure you’ve got enough regularity in the data set to increase your chances of validity as well.

A little more info about validity and testing in marketing, if you’re so inclined:

http://www.socialresearchmethods.net/tutorial/Colosi/lcolosi2.htm

http://www.marketingexperiments.com/blog/practical-application/top-14-free-marketing-tools-and-resources.html

Twitter Facebook LinkedIn Google+ Email Print More
  • http://www.chadsummerhill.com Chad Summerhill

    Great post, Jessica! One thing that you can do is to run redundant ad variations. Have two copies of the control and the experiment.

    Does the performance converge? Or are the copies performing very different from each other? If they do perform very differently from each other, perhaps there are variables outside our control that are affecting the test.

    Doesn’t necessarily tell you why, but if the two copies of the ad you are about to push are performing identically (or close enough) and they pass the statistical significance test you may feel more confident in your decision.

    I wrote a little about it here: http://blog.performable.com/the-optimizer%e2%80%99s-guide-to-google-adwords-copy-testing/

    Thanks for the mention!

    Chad

  • http://www.ikom-consult.com Tihomir Petrov

    Very good advices. I am using these methods and I could assure the readers that they work very well.

  • http://www.epiphanysolutions.co.uk Steve Baker

    Hi Jessica,

    The problem is that even if ‘Advert A’ is more effective that ‘Advert B’ every day, and irrespective of who else is advertising, if the performance of your adverts varies for any other reason than due to random variation, it will inflate the standard deviations that you are using in your significance test, making it less reliable.

    One example of this happens if you have to change your bids in the middle of a test. Suddenly your adverts appear in different positions, and have different click through rates. Since the average click through rate has just moved for both adverts, the variation from these averages will be artifically high, and the significance of the difference between the averages will be reduced.

    Then again, it could be argued that if your test is taking a while to run (more than a week or two, maybe) then waiting for statistical significance is unnecessary. If you can determine that ‘Advert A’ has an 85% likelihood of being better after 4 days, should you wait another 8 days for a 95% significance? You could run three tests with an 85% significance in the time it takes to run one with a 95% level – and the variations described above mean that it may NEVER reach 95%. If ‘Advert A’ is winning after 4 days, even if it’s due to random variation, it’s probably not going to be much worse than ‘Advert B’, and the benefits of running three tests instead of one are likely to outweigh this risk many times over.

    I would suggest that waiting for a statistically significant result may be the right thing to do from a purist’s standpoint, but if you can come up with enough variations to test, then it may be worth pulling the plug early, particularly if the existing version is ahead. Losing days or weeks of testing time to make certain (ish) that the new version isn’t actually very slightly better is not in the best interests of the account…

    Thanks,

    Steve Baker
    Chief Analyst
    Epiphany Solutions