Six months ago we launched an A/B testing framework for our site, SpanishDict.com, that has allowed us to make ad ops decisions smarter, faster, and with measurable improvements to revenue. Our first test led to an increase of more than 20% on our mobile CPMs. Here’s how we did it.
For the past 10 years, we at SpanishDict have chronically made sub-par decisions about what ads to show on our site. Sure, we “optimized” like everyone else. We tested dozens of networks. We tried different ad placements on our site. We explored creative new ads units. We'd run the experimental setup for a week or two, and then analyze the results versus the previous week or two, carefully inspecting changes in our CPM per impression, CPM per pageview, and overall revenue.
But something always nagged us about these tests. We could never actually tell if the results we saw were due to the changes we made or some exogenous factor. CPMs are constantly in flux. CPMs change daily--sometimes as much as 30%--even when we do nothing. How were we to know if our tests actually made a difference? In truth, we just made our best guess. We tried to use the information we had to make a decision, but the information we had was limited. And even the best guess is still a guess.
Because there were so many confounding variables, we were making decisions in the dark. And these decisions were critical for our site. Very few things a publisher does have as much impact on the bottom line as deciding what ads to show on the site.
Late last year, we decided to take a lot of the guesswork out of our ad ops decision making process and turn it into more of a science. We had seen the power of A/B testing in nearly every other aspect of our business: marketing, user engagement, subscriptions. Ad ops was the last major holdout.
Running A/B tests was the answer to the problem of the confounding variables. If we could randomly assign our users to two different ad setups, pit the ad setups against each other head-to-head, and analyze the results, we could know with far more certainty that the change we made in our ad setup was the cause of a change in CPM. No more worries about seasonality, campaigns ending, or traffic fluctuations. All those variables would be the same across the test groups. With a testing framework in place, we could make better decisions with greater confidence.
There is a reason very few publishers run A/B tests on ads. A/B testing has been hobbled by the fragmentation of ad providers, causing publishers great difficulty when trying to see a unified performance report, especially when the performance needs to be segmented by A/B channel. We realized that we would have to do a major overhaul of our ad setup and reporting structure if we were going to be able to run tests.
The principle requirement for our A/B testing framework was to be able to run two independent ad setups simultaneously and achieve full visibility into the performance of each. We wanted to report both the revenue performance and traffic performance of a test--some ads pay well, but the user experience cost is too great. We wanted to measure this.
Our A/B tests started in the user’s browser, where we assigned every user a “test_group” property from 1-100 stored in a persistent cookie. We sent this property to DFP like this: googletag.pubads().setTargeting('test_group', '80');. We also sent this property to our traffic analytics system. Our structure is to have an “A channel”, which is our default setup and is where we run 90% of our traffic using test groups 11-100, and a “B channel”, which is our experimental channel and is where we run 10% of our traffic using test groups 1-10. Our A/B testing setup looks something like this:
We then developed a Line Item system in DFP that enabled detailed reporting. We have a standardized line item structure that we use throughout our setup that follows the format: CHANNEL_ADUNIT_GEO_ADVERTISER. So the line itemA_D160X600SIDERAIL_USA_TRIBAL would target the A channel, the desktop 160x600 siderail ad unit, in the US, for Tribal Fusion while the line item B_M300X250CONTENT_INT_RUBICON would target the B channel, on the mobile 300x250 content unit, for international traffic, from Rubicon Project. This line item setup provided us granular performance data that we could use to segment performance by channel, ad unit, geography, and partner. It also allowed us to run tests on particular ad units in particular regions.
We used targeting presets in DFP to make it faster to setup new line items. When setting up desktop advertisers we asked our partners for as many as 24 individual creatives that could be individually tracked (2 channels x 6 ad placements x 2 geos = 24 creatives), which we added to DFP. If we needed to setup pass backs for the partner, we’d setup 24 creatives in the pass back partner, using the same naming convention. For the Rubicon tag above, if we wanted to pass back to AdSense, we’d setup a creative in AdSense with the name B_M300X250CONTENT_INT_RUBICON and use that for the pass back. This allowed us to track the performance of a top level advertising partner all the way through the pass backs, leading to a full picture of the advertiser performance taking into account pass backs and impression discrepancies.
At this point, we had line items that enabled fine grained reporting, but we needed a unified database to get the full picture. In the past, we tried this with Excel but the volume of data proved to be too much to handle. So we moved over to Redshift, a fast SSD based data warehouse from Amazon that was easy to get started with and scales effortlessly.
We set up one table to handle data coming from our ad servers--DFP on the web, MoPub on apps--with fields for the date, line item, impressions, advertiser, region, ad unit, platform, channel, and ad server. This data was used to understand how many impressions were sent to each advertiser. We created another table for the data from the advertiser reporting systems that included the date, line item, impressions received, impressions filled, clicks, and revenue. We joined these tables using our line item structure described above, which resulted in a complete picture of ad performance.
Loading the data manually each day would have taken far too long. So we wrote scripts to automate the process. A couple partners had APIs that we could use. Some had daily emails that we could parse and load. And for others, we used a tool called CloudScrape to automatically log into the advertiser dashboard, scrape the data, and make it available for us to load into the database. We set up a system of checks to make sure our data had loaded correctly, and we started doing monthly audits to verify that our data matched the data from the ad server and advertisers sites.
Once all the data was loaded, the fun part began. For our reporting we adopted the business intelligence tool Looker, where we could set up automated dashboards and could drill down into data on the fly. The data junkies inside our team were salivating at the reports.
We could clearly compare the performance of ad units, regions, and partners:
And of course, we could analyze the A and B channels, giving us a way to run tests:
For us, this system marked a new era in ad ops. We had finally shine a light on the information we needed to make sound decisions.
Identifying tests that improve performance can be surprisingly difficult. We source ideas for improvements from:
We then keep a log of our test proposals in a spreadsheet, where we are able to track the test description, the revenue at stake, the priority, and timeframe. Each quarter we can then set goals for how many tests we want to run, and prioritize the tests that are most critical. In the same spreadsheet, we also keep a log of all our tests and how they performed. Here's a snapshot of a few of the tests we have run:
We generally think about our ad experiments in terms of the following test categories:
In our planning each quarter, we can think about the tests that we want to run in each category.
We have a checklist of sorts that we go through when analyzing the results of an A/B test to make sure that we remain rigorous in our decision making. We pull data on the impact on CPMs, our primary outcome variable. But we also pull data on user engagement, to see if there’s been a discernible impact on the user experience, and on ad complaints from users, to ensure that ad quality is high.
User engagement comparison from MixPanel, our user analytics system.
Ad complaint analytics from PubNation, our ad quality management platform.
It’s important to analyze the impact of new ad partners on the user experience. Time and again, we’ve seen new networks start with a high CPM and appalling ad quality problems.
We then plug in these results to our test log, where the statistical significance is calculated. We aim for tests that show an improvement with a 95% confidence interval. The numbers don’t quite make the decision for us though. We always take some time to step back and consider whether the change has any strategic impact on the site, some unmeasured impact on the user experience, and whether the change is going to be sustainable over time. We’ve seen ad partners perform well in the initial test, but fade quickly shortly thereafter. The results from a test may not last forever, so from time to time we will re-test significant changes to make sure the results still hold.
With all the pieces in place, we were ready to run our first test. We decided to test sending more of our mobile traffic from Rubicon to Tribal Fusion in the US. We placed Tribal Fusion tags in the B channel and let it run for five days. The results were amazing. We immediately saw a 20-40% increase in the overall CPM for the B channel. It’s also evident from the chart that had we not been able to A/B test the change, we could have been misled by the natural fluctuations in the CPM.
Header bidding, which is fast becoming the industry standard for managing programmatic inventory, negates the need for certain types of A/B tests. The A/B testing approach outlined above is conducted over a period of days and weeks. With header bidding, every ad on every page is a test, and the highest bidder wins. As header bidding expands its reach, targeting tests will become less necessary. No longer will we need to tweak a price floor and see if performance improved. But core questions remain: advertiser tests (should I add this new header bidder?) and placement tests (should I use this new ad unit?) are still high impact tests that are framework makes easy to conduct. We also anticipate that there will inevitably be a few vertical ad networks where we will continue to need to run targeting tests.
We’ve committed to A/B testing all changes to our ad server. After seeing so many tests reveal surprising insights, we can’t imagine going back. We now have a path forward for better decisions, more revenue, and a better experience. But we are still learning more about the world of A/B testing all our changes. We want to improve our process and we want to share the insights with others in the industry. If you have any questions or suggestions, please let us know in the comments below. Hopefully this article spurs the discussion onward.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form
In the past two months, we've radically overhauled our programmatic advertising stack on SpanishDict.com, shifting our focus to header bidding advertisers--advertisers that submit the price they wi...
On our site, SpanishDict.com, adblock has become a major problem. We lose over 15% of our revenue each year because of adblockers--and the impact is growing larger every month. We are now embarking on