Russian Affiliate Congress and Expo
  • English
  • Русский
Тел. : +7(495) 2121128

Mozscape Correlation Analysis of Recent Google Algorithm Changes

Tuesday, 2 October, 2012 - 09:33

At SEOmoz, we compute and track correlations against Google search results with each Mozscape index release. Recently, we've noticed some interesting changes in the page level vs domain level link correlations and decided to investigate. We uncovered some striking differences between the new 7-result SERPs and the standard 10-result SERPs.
Tracking Link Metric Correlations in Mozscape

Before I dive into the data, I want to provide some background information on our data set and methodology. We use correlations against Google search result pages (SERPs) to track algorithm changes and the quality of our Mozscape index. We have published the results of these many times in the past, including the Search Engine Ranking Factors post and in the blog post announcing each monthly index update (see the September update). To summarize the process, we first take a set of keywords and run them through Google to collect the top 10 or 50 results. We then pull the link metrics from Mozscape for each URL in the SERPs (Page Authority, number of linking domains to the domain, sub-domain mozTrust, etc). Then, we compute the Spearman correlation between search position and each metric for each keyword. Finally, the correlations are averaged across all the keywords to produce one number for each metric, the mean Spearman correlation.

Since Mozscape includes some 40+ link metrics for each URL, this process results in 40+ correlations. In practice, many of these correlations are similar, since link metrics themselves are similar. For example, we'd expect that the correlation with the number of links to a page to be similar to the correlation with number of followed links the page. A conceptually useful way to summarize the data is to group them into page level and domain/sub-domain level metrics. Page level metrics are associated with the actual page itself: Page Authority, number of links to the page, number of domains linking to the page, and mozRank. Domain and sub-domain level metrics measure the link authority of the entire domain, for example the Domain Authority and number of domains linking to the domain. As a concrete example, imagine an unpopular page buried on Wikipedia without a lot of direct links. It will have low page level metrics but we might expect to rank simply because the Wikipedia domain has so many links to other pages.

With these preliminaries out of the way, we can dive into the data.

This chart shows a time series of Mozscape correlations on the page and domain/sub-domain levels for all the index updates since November 2011. Focus first on the solid green (page) and blue (domain/sub-domain) lines. Each index update is marked with an "X." Except for a smaller, lower quality index (36 billion URLs) and the larger, experimental 150+ billion URL indices, the two values have been more or less constant over time. The 10,000+ keyword, top 50 result SERP set was updated every two months or so during this time, so both Google algorithm updates and Mozscape releases are represented.

Now focus your attention on the September Mozscape update at the far right. It includes two sets of correlations: the solid line and "X" represent values from SERPs fetched in late June. The dashed lines with filled circles indicates values from a SERP set fetched in mid-September. Everything else remained constant: the keywords did not change, and both sets of link metrics were pulled from the September Mozscape index. However, the correlations jumped in an interesting way. Every page level correlation increased and every domain/sub-domain correlation decreased. I haven't seen this type of behavior since I started tracking these values a year and half ago, and it was the motivation for the following analysis.
Enter Mozcast data, stage left

I had a suspicion that this jump in values was due to an algorithm change at Google, and wanted to see if I could tease it out of the data. Dr. Pete was kind enough to provide a data dump of the Mozcast SERP history from July 1 to September 15 to do some more analysis. Even though the Mozcast data only includes the top 10 results for 1000 keywords, it provides a daily time series to pin point the change. More data FTW!

This chart shows the time series of page and domain/sub-domain correlations from the Mozcast 1000 keywords. The solid blue line is a smoothed version of the raw data (the noisy light dashed line). There are a few things to note here. First, the magnitude of the correlations are different from the first plot but the overall trends are the same. The differences are due to the different data sets (1000 vs 10,000+ keywords and top 10 vs top 50). Second, the page level metrics do indeed increase over the time period, with a noticeable increase centered around August 12-14, the days when Google started displaying the 7-result first page (see these two posts for more information about the new 7-result SERP). Finally, the domain/sub-domain metrics decreased during the last two weeks in July, at the same time domain diversity decreased (see the 90 day history of diversity over at Mozcast).
So, how about those 7-result SERPs?

I was intrigued by the idea that the new 7-result SERP might be associated with an algorithm change, so I decided to probe further.

The 7-result SERP was fully rolled out by August 15, leaving a month of data after the change to analyze. This is a histogram of the percent of days from August 15 to September 15 (31 days) that each keyword had 7 results. The important thing to notice here is that it has two spikes at 0% and 100% and not much in between. Put another way, most keywords have either 10 results or 7 results on all days, and only a small portion alternate between the two cases. With this data, I created two cohorts, one with keywords that had 7 results for 30 or 31 of the days and a similar cohort of keywords that had 10 results for 30 or 31 of the days. All told, there are 144 7-result keywords, 808 10-result keywords and 48 flip-floppers.

With these groups, it is possible to compute link metric correlations for the 7-result and 10-result keywords separately. The results are striking: the 7-result keywords (red) have near zero domain/sub-domain link correlation, but have a huge page level correlation! On the other hand, the 10-result keywords (green) are much more balanced between page and domain link signals.

Now, we all know that correlation is not causation and these results are only averaged over a small sample of keywords that may not be a representative sample of the entire universe of keywords. In addition, any individual keyword may exhibit different behavior then the average. That being said, if we indulge ourselves and ignore these caveats for a thought experiment, we can revisit our example of the unpopular Wikipedia page without many direct links. This page has amazing domain/sub-domain link metrics but poor page metrics. If this page is competing for a 7-result keyword, all the Wikipedia link authority wouldn't help it rank. On the other hand, if it is competing for a 10-result keyword the Wikipedia link authority will help it rank.
Got any more data for us?

Just a bit more. We can bring in some Adwords data to see if there are are any other systematic differences between the 7-result and 10-result keywords.

Here, I've plotted histograms of the Adwords "Competition", the log of the US monthly search volume and the log of the cost per click. As in the prior chart, red lines represent the 7-result keywords and green the 10-result ones. We can see that there are (statistically significant) differences in the competition and CPC, but they have the same search volume. The new 7-result keywords have lower competition and CPC.
OK, so what does all this mean?

I'm not certain, so I'll offer a few ideas. I'd like to hear your interpretations and experiences in the comments below.
It doesn't mean anything. Just like the cat chasing it's tail around, you are chasing phantom signals around in a noisy data set. This is possible, but I don't think so.
Those correlations are just so different, Google must be using a different algorithm for these 7-result and 10-result keywords. Ohhh, now that is tantalizing isn't it? I suppose this too is possible, but not likely either. If they were, then they have been using these two different algorithms long before they rolled out the 7-result SERP since the split in the correlations has existed since at least the beginning of July.
These 7-result keywords are systematically different in some way then the 10-result ones, and we are seeing symptoms of that in the correlations and the Adwords data. Imagine the process that takes a search query and returns the SERP. The first step very well might be an new classifier that decides whether to return 7 or 10 results before passing the query onto the rest of the ranking model. This classifier takes some inputs - perhaps some information about the link metrics as well as some additional information - and makes a decision. In the process it preferentially selects queries from a part of the keyword space that includes low domain correlations, high page correlations, and low CPC.
A final shout out to Dr. Pete and Jerry Feng

This post wouldn't be complete without acknowledging Dr. Pete and Jerry Feng. Pete graciously provided the Mozcast data used in this analysis as well as encouragement and insight. He also kept my crazy ideas in check. Jerry is SEOmoz's newest data scientist and helped with the initial analysis. He's currently thinking about how to best improve the Page Authority and Domain Authority models.