Feb 242010

# Detecting Significant Changes In Your Data

For statisticians, significance is an essential but often routine concept. For those who don’t remember the details of college statistics courses, significance is a nebulous concept that lends magical credence to whatever data it describes. Sometimes you make a change in your paid search program, watch the data come in, and want to claim that numbers are improving because of your initiative.

How can you support this claim?  Can you discredit the possibility that the apparent improvement is just noise? How can you apply that authoritative label of “significant”?

Here I’d like to walk you through a basic test of significance that you can use to de-mystify changes in your paid search data.

1. First, you need to know what value brand CTR is potentially improving from.  Let’s call this value mu (pronounced myoo), and you can choose it in a variety of ways: the average or median CTR over the past month, the average or median CTR from this time of year last year, etc. It should really be whatever value you believe CTR to truly center around.
2. Next, you need data points.That is, you need several days of CTR data since the Site Links have been running. How many days is up to you. Generally, more is better, but I’ll touch on that later. The number of days you have is n. Take the average of the CTRs from those days; this is called xbar. Lastly, take the standard deviation (excel function stdev) of these CTRs and call it s.
3. Now we can compute a t-score, and with it, the probability that the change in CTR you’re seeing is or isn’t attributable to chance. Set t = |xbar mu| / (s/squareroot(n)). Then use the function tdist in excel, and for the arguments, plug in t, n-1, and 1. The number that this function returns is the probability that the change in CTR is simply due to chance, aka noise. If this probability is very small, then we say CTR has changed significantly.

Enough Math! Is The Change In My Data Significant?

I’ve prepared an excel spreadsheet that handles the arithmetic. In this model, change the gray shaded cells to reflect your data. Enter the data that you think has fundamentally changed in column C. Only include data points since the change began. Then, in cell G2, enter the value from which you believe the data to have changed. That is, the average value of the data before the change.

The value p, produced in cell G7, is the probability that the change you’re seeing is only due to chance, and thus meaningless. Typically, a p-level must be below 5% to be considered significant. (If you want to be super, super sure, you can use 1% or 0.1% instead.) In other words, if your p-value is 5% or less, you can confidently say that the change in your data is real, definite, and due to something other than statistical noise. It’s a pretty safe bet that whatever initiative you took – whether it was switching landing pages, altering ad copy, or refining your bidding – was the catalyst for the improvement instead.

Allow me to fill in the spreadsheet with an example. For an imaginary online retailer, brand CTR hovers around 4.4%, so I fill in cell G2 with the value 4.4. The retailer enables Google Site Links, and CTRs for the 3 days afterward are 4.3, 5.2, and 5. So I enter those three data points into column C. And voila… the p-level comes back as 12.66%. This says that there is a 12.66% chance that the rise in CTR was due only to noise.

Not significant. Sorry, click-through-rates haven’t really increased, or at least, we can't be very confident that the observed change is anything more than random noise.

But… three days is not much data. As smart analysts, we are cautious when examining trends over only a few days, and this significance test incorporates such wisdom. As the number of data points (n) you use increases, p-levels fall. For example, if all the numbers in the above example were the same except that you used 7 days instead of 3 (so n=7), the corresponding probability drops to 2.6%. In this instance, it’s very unlikely (2.6% unlikely) that the increase in CTR was due to noise, so here you can rather confidently say, “Yes, CTR has increased, and it wasn’t due to chance. It was probably due to the site links.”

says:
I think a lot of people make assumptions or form opinions without considering the stats behind them. By doing so, they come to conclusions that may be nothing more than normal variations that have no significance. I'd like to see more publishing of the raw data, so that statistically insignificant conclusions can be called out by the community.
Ankur Mody says:
Brilliant. Advanced PPC tactics at its best. I recently incorporated Sitelinks in 2 sites so this is actually very relevant and I am sure I will use your spreadsheet for my analysis.
Stephen, I was on a panel not long ago when one of the presenters did exactly that. Claimed that moving from last touch attribution to first touch sometimes moved results 300%, but his slide showed the raw data: moving from 1 order to 3 on 300 clicks or so for a particular term. I thought about calling him out on the fact that it's random noise, and that going from 1 order to 3 is actually a 200% increase, not 300%...but I didn't. Enough people in the industry are mad at me as it is :-)
Brian Senf says:
Also keep in mind that the T-test (the formula in the excel) can get a little 'iffy' if the sample sizes are small, like, under a hundred or so. Everybody's got a different opinion as to what's a small sample size, but you shouldn't get into too much trouble if you're looking at 100 or bigger.
Brian - Are you sure you aren't thinking of the Z-distribution instead? The t-distribution is like the Z-distribution but meant for a) small sample sizes and b) cases where the population standard deviation is estimated (rather than already known). The t-distribution is suited to smaller sample sizes because sample size itself is a parameter of the distribution (via the degrees of freedom)... that is, the smaller the sample size, the more spread out the distribution is, and the higher your t-statistic must be to get a statistically significant result.
Ken Truman says:
Typically when testing for changes in CTR (or CVR for that matter), I consider the sample size to be the number of impressions during a given time frame. This is independent of the number of days, and I model CTR with a Bernoulli distribution to ascertain statistical significance. Do you guys have any thoughts on this? It seems to me that using the number of days as a sample size is rather immaterial. Why not hours? Or weeks? It's an arbitrary selection.
Ken Truman says:
Worth noting that if I were to have been more careful in my previous post - I would have said that in the case of CVR the number of clicks is my sample size.
Ken - I don't immediately see any problems with your method, and I think it sounds like a great way to attack the problem. In that case, you'd use p for the proportion you'd expect to see, phat for the proportion you observe having changed, and the test statistic z = (phat - p)/((p*(1-p))/n). I might just use your method next time!
awhinston says:
Thanks for the post! We've been running analysis to measure (at various confidence intervals) HOW MUCH improvement can be attributed to campaign changes versus metric noise. To do this, we calculate standard deviation of mu, in addition to mean. We then look at CTR following the test, knowing we can attribute (with 68% confidence) any delta that falls outside the mean + or - one standard deviation. For example, say pre-test CTR was 5% and standard deviation over that period was 0.5%. If, after a few days of testing, xbar settles around 5.7% we would say (with 68% confidence) that .2% of the increase was due to the change. Any feedback on our methodology? I'd also love to combine these calculations with those in your post. As more days go by, we're be more confidant in our calculations as (n) increases. Is there a way to do using a formula? Thanks.
Social comments and analytics for this post... This post was mentioned on Twitter by AllThingsM: Detecting Significant Changes In Your Data http://bit.ly/dBdoQw...
Shay OReilly says:
Thanks for a fantastic post Jen. An articulate refresher on basic stats that I for one was a bit fuzzy on from my class room days. Also really like Ken's point. Do you think the same is true for day of week analysis? Looking at individual days leaves you with few data points but seems to underestimate the significance since there are a large number of data points behind each day's average.
Ken, I like your approach, too. Running true A/B tests on web pages and looking for conversion rate differentials can be a very long and frustrating process. The number of conversions needed to detect small differences in conversion rates can make evaluating the results akin to the Bataan Death march. Sometimes, you can get a pretty valid read, with much less data by simply saying: starting today, if version A beats version B five days in a row, the odds of that happening by random chance are: 1/32 or ~3%. Sometimes that helps you identify a winner much sooner. The problem of course is that 1) you have to be disciplined: you can't say over the last 20 days of the test were their any five day runs by one or the other? That won't produce the right answer; and 2) if version A is only 2% better than version B, this 5-day test will usually fail to show that, too.
says:
Shay - I definitely think the same logic applies to day of week analysis. George - That's an extremely interesting way of approaching the problem. Have you done any sort of simulations to get a better read on what sort of confidence levels you're dealing with when taking that approach? What I mean is (oversimplifying here), assume CVR for page A is lower than CVR for page B. What percent of times can page A outperform page B 5 days in a row? I realize this is a computational nightmare, but a Monte Carlo simulation could shed some light on the issue. I really, really wish I had the time to explore that question. :-) I bring it up because I'm hesitant to take an approach where it's not clear to me what my true confidence level is. This is perhaps not the best solution, but when dealing with CVR, I typically use a higher alpha than I would otherwise. This is a subjective judgement, but I think the potential dangers associated with making a type I error are low - after all, it's only advertising!
Ken, You're absolutely right if the CR difference between A and B is small (2 or 3%) the odds of A running the table aren't much worse than the odds of B running the table, hence you may end up picking the wrong winner. The stats tricks for dealing with sparse data are tricks, and as such the results are generally less certain than some folks would have us believe. The assumptions under the hood of MVT analysis, for example, are large and often mean that confidence levels are overstated. There's no substitute for rich data.
[...] Detecting Significant Changes In Your Data, Rimm Kaufman [...]