Getting Your Head Around Google's RankBrain

david harry rankbrain Recently, the Google RankBrain system has started to garner quite the buzz within the SEO community — but to a large degree, it’s not entirely being understood. We’ve even had some somewhat link-baity post titles that didn’t help things either. And of course, Google didn’t do itself any favors including the word “rank” in the name.

So, let’s start with a statement by Google’s Gary Illyes:

” Lemme try one last time: Rankbrain lets us understand queries better. No affect on crawling nor indexing or replace anything in ranking” – via Twitter

The core, from what we understand, is more about better assessment of queries and the classifications therein. Add to this the potential use of similar concepts to better understand the words, phrases and concepts on a given page, and the outcome should be better results. But for now, let’s take a step back and look at another Google project: Word2Vec.

Google’s Word2Vec Project

Essentially, without getting too crazy with it, the Word2Vec project was not a single algorithm, but based from two models: skip-gram and CBOW (continuous bag of words) which are considered “shallow neural models.” The skip-gram is used to predict neighbouring words to a given word while the CBOW tries to predict the current word, based on the neighbouring words. Somewhat opposite but they have the potential to correlate each other.

In fact, much of the conceptual approach was to trade in complexity (in computing models) for efficiency. But let’s not consider this within that simplistic explanation. There’s nothing simple in the various neural network research I’ve read on this. I just wanted to get the basics out of the way before we move onto the meat of the moment.

Another related concept I came across a lot was Distributional Semantics; which essentially can break down into words that co-occur regularly, often tend to share aspects of semantic meaning. There’s also a patent that was awarded last year that is seemingly for W2V since the authors are mostly the same people that worked on the original project.

Hopefully that helps better understand where we’re headed with that.

Probable patent: Computing numeric representations of words in a high-dimensional space

Is Word2Vec really RankBrain?

As with all things Google, it’s hard to say. But again, many of the folks working on RankBrain also worked on the Word2Vec project, and even some of the descriptions of RankBrain are nearly identical to those for the W2V project such as:

The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

We also know that RankBrain is “one of the ‘hundreds’ of signals that go into an algorithm that determines what results appear on a Google search page, and where they are ranked,” as stated by Google senior research scientist Greg Corrado in a statement to Bloomberg. The term “signal” is telling to me — it implies it is entirely possible that RankBrain does no direct scoring whatsoever.

Furthermore, most of this machine learning artificial intelligence (AI) seems to be geared towards understanding queries, not really web pages per se. As mentioned in a Daily Mail article;

For example, a user searching for ‘What is low in the army’ is a hard query for the AI to resolve, but RankBrain would interpret it to mean ‘What is low rank in the army’.
Another example shows the search engine may have previous struggles with ‘Why are PDFs so weak?’, taking the search literally and returning results of PDFs containing the word ‘weak’.
Whereas now, RankBrain might interpret it as a question about the security of PDF files, and so returns a better first result.

And so I do have to believe that RankBrain and Word2Vec are related to some degree. For those of us that have studied patents, papers and research projects from Google over the years, both RankBrain and Word2Vec deal with the never ending struggle of semantic analysis; both seek a better understanding of words, phrases and how they relate to each other.

Word Vector Concepts

In the past, semantic analysis approaches would use a simpler word relationship model based on singular relations. The belief with the vector approach is that when trained on larger data-sets with larger dimensionality, it should be more effective. By utilizing 10 examples to form a relationship vector instead of one, semantic accuracy increases.

Again, in the past, semantic analysis was somewhat more simplistic. The belief now is that by using numerical representations of words, they can better predict relationships of words.

In another example, consider that the phrase “Company executive” isn’t necessarily intuitive, but the word vector model would find closest vector relations for:

Steve Ballmer – Microsoft
Samuel J. Palmisano – IBM
Larry Page – Google
Werner Vogels – Amazon

The main thing to consider is that RankBrain probably isn’t directly ranking anything. It’s trying to better understand concepts related to the words in a query (and presumably on pages it might return). It is also about improving the efficiency of the algorithms therein.

For more, please see my coverage of the Word2vec patent here.

The Artificial Intelligence Angle

Another piece to this puzzle that I came across was the 2011 Google Brain project, which also had some common players in the form of Jeff Dean, Geoffrey Hinton and Greg Corrado.

Google Brain was a deep learning / artificial intelligence research project that may indeed also be part of the evolution of what we know as RankBrain today. As always, it’s hard to say.

Google later acquired Deep Mind, an artificial intelligence company, in 2014. This was considered a talent acquisition at the time, and in retrospect might be much more of an AI tech and research play. This may also have been something that has played a role in the evolution of things.

The Hummingbird connection

I also think that it bears noting that this move is also somewhat in line with Google’s 2013 initiative dubbed “Hummingbird.” At the time, we knew that this new search platform was, among other things, more focused on dealing with queries and natural language processing. This is seemingly being addressed and expanded upon with RankBrain.

As my good friend Bill Slawski penned at the time:

[Hummingbird is] being presented as a query expansion or broadening approach which can better understand longer natural language queries, like the ones that people might speak instead of shorter keyword matching queries which someone might type into a search box.

And as Danny Sullivan reported last fall, in regards to where RankBrain fits in with Hummingbird,

Hummingbird is the overall search algorithm, just like a car has an overall engine in it. The engine itself may be made up of various parts, such as an oil filter, a fuel pump, a radiator and so on. In the same way, Hummingbird encompasses various parts, with RankBrain being one of the newest.

They also might be using it to further improve the Knowledge Graph as well, as noted in this paper (PDF);

Our ongoing work shows that the word vectors can be successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts.

From what we understand, RankBrain is actually part of the overall Hummingbird algorithm.

I would also venture a guess that both of these are also in play with an eye towards the future, which is “conversational search” for mobile. Surely they want to be the one that gets that right and leads the way forward.

Does it rank?

This aspect of RankBrain has been more than a bit confusing as well. Obviously, with the word “rank” in the name, it certainly implies such. Also, in talking with Danny Sullivan while crafting this piece, he assured me that “Google has specifically said it is also a ranking factor in addition to query analysis.”

But some statements from Google have seemingly contradicted this claim, including the one from Gary Illyes at the top of this post. So, let’s dig in…

Greg Corrado said in the original Bloomberg article that “RankBrain is one of the ‘hundreds’ of signals that go into an algorithm that determines what results appear on a Google search page and where they are ranked,” which of course further implies that it’s used in rankings (and/or re-ranking).

Certainly, if it’s not a scoring mechanism directly, the signals that it feeds could be used in re-ranking of results (boosting and dampening).

Corrado also mentioned that “RankBrain has become the third-most important signal contributing to the result of a search query.” I am not entirely sure if that is in terms of an actual scoring mechanism? Or that it’s importance is related to the value of better understanding a query and the pages returned in the results. Hard to say.

In short, yes. It would seem, directly or indirectly, that RankBrain is affecting rankings.

Let’s get this clear

So hopefully by now you’re starting to get the idea that RankBrain;

Is for dealing with unknown or unclear queries
Is for understanding and predicting related concepts of words
Is an important signal, but in the sense of serving better results to a query
It MIGHT be helping to better classify pages
Is re-ranking based on new classifications
May NOT be a direct ranking factor
Does NOT affect crawling nor indexing
Is NOT affecting other traditional ranking/scoring processes
It does NOT use machine learning on-the-fly (it’s a push/retraining process)

I have done some exhaustive research into this and there is still a TON of varied opinions floating around the SEO space, and many of them well off base. When Google said it is the “third most important signal,” they seem to mean that it’s a massive part of improving query classification/refinement, which in turn vastly improves search quality. Not all “signals” are scoring mechanisms (ranking factors, for you SEO-types). How it is actually affecting rankings is still unknown. Different Googlers have said different things about that.

For the record, content and links being the other two.

What should I do to leverage RankBrain?

That’s the $64 000 question now isn’t it? Oddly enough, I have been telling people that “if it ain’t broke, don’t fix it” and what I mean by that is if you’re still growing traffic month-over-month with your SEO and marketing efforts, then I wouldn’t get worked up. Apparently Amit Singhal gave the green light for RankBrain to be rolled out back in early 2015. And subsequent statements in the fall of 2015 said it had been active in the wild for a few months.

In short… this isn’t actually something ‘new‘ that we’re dealing with.

And of course all that being said, this is still more about better classification and handling of queries on Google’s end. If anything, it can potentially lessen the need to be overly concerned about what terms you’ve seeded on a page. They can, in theory, better understand the concepts on that page, regardless if that exact word or term is in the query that the user puts into Google.

I’ve long encouraged SEOs and content creators to be more focused on solidifying concepts and entities, than on exact match targeting approaches. So, my advice really won’t be changing in light of RankBrain.

Videos worth watching

David Harry

SEO Consultant at Verve Developments

David is a SEO consultant with Verve Developments with more than 18yrs experience. He is also one of the founders of the SEO Training Dojo, a community for professionals.

Latest posts by David Harry (see all)

How Hackers are Hiding Content & Links via PNG Files - January 25, 2018
Getting Your Head Around Google’s RankBrain - March 28, 2016

Comments

Tony Dimmock says
March 29, 2016 at 2:50 am
Superb study, commentary and resources David. Thanks for setting the record straight – let’s hope there’s an end to click-bait RB posts now.
I’ll be referencing this post with fellow SEO’s and clients in the future.
Another research paper here: https://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf “Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion” includes Thomas Strohmann and mentions the following:
at point 4.2 (Neural Network Model (MLP):
“To illustrate that the neural network model learns a meaningful “semantic” representation of the entities and predicates, we can compute the nearest neighbors of various items in the a K-dimensional space. It is known from previous work (e.g., [27]) that related entities cluster together in the space, so here we focus on predicates. The results are shown in Table 4. We see that the model learns to put semantically related (but not necessarily similar) predicates near each other. For example, we see that the closest predicates (in the ~w embedding space) to the ’children’ predicate are ’parents’, ’spouse’ and ’birth-place’.”
and point 4.3 (Fusing the Priors):
“We can combine the different priors together using the fusion method described in Section 3.2. The only difference is the features that we use, since we no longer have any extractions. Instead, the feature vector contains the vector of confidence values from each prior system, plus indicator values specifying if the prior was able to predict or not. (This lets us distinguish a missing prediction from a prediction score of 0.0.) We train a boosted classifier using these signals, and calibrate it with Platt Scaling, as before.”
Both comments are possibly key considerations (or a front-runner) to the development of RankBrain (note this paper was published in 2014).
Again, thanks for being as detailed as needed 🙂
Paulo says
March 29, 2016 at 3:57 am
congratulations! great post and very complete argumentation.
how do you think that RankBrain information is served right now? Humanly, with some team inserting info on a database, or algorithmcally, with AI being tested?
Nikolay Stoyanov says
April 21, 2016 at 5:02 am
Awesome article David! I read it like three times already. However, as it is with RankBrain, there is still a lot more questions to be asked. What are your thoughts on its impact on on-page SEO? I mean, they told us that it is one of thee most important ranking signals. But, will it change the way articles are written?

Getting Your Head Around Google’s RankBrain