Search ranking factors in Google

Table of Contents
low-angle photography of metal structure

Let's discuss search ranking factors. Start by saying that search ranking factors are the same as search ranking signals.

Search ranking factors types

Hand crafted or manually adjustable vs LLM-based.

Search ranking signals are made out of data

Google takes relevant data and performs regression to arrive at signals.


Neither Google, nor any other search engine would not disclose how its search engine ranking system works by pretext of safeguarding from manipulations. Obviously, what we knew about search ranking signals has become evident from other sources.

Sources of knowledge on search ranking factors:

leakage of the documents (Google leaks),
analysis of practice as well as from
recent court proceedings in which Google officials were forced to give some testaments. 

Importantly, Google ranking signals are nowhere to be seen in explicit form.

Image of Google antitrust proceeding docs

Google ranking signals may be divided into ‘hand crafted’ or manually adjustable and LLM-based. 

Manually adjustable signals that can be analyzed and adjusted by engineers whereas large language model- or LLM based revolve around natural language processing as well as AI powered learnings. Almost every signal, aside from RankBrain and DeepRank (which are LLM-based) are hand-crafted and thus able to be analyzed and adjusted by engineers.

Analyzed

If anything breaks Google knows what to fix, which factors can be ignored, and how these factors influence each other. Also, it means they can be impacted by site owners.

Adjusted by engineers

In the extreme, hand-crafting means that Google looks at the relevant data and picks the mid-point manually.

Data and signals are two major terms

Search ranking engineers operate two major variables: data and signals. Data is primordial. Google uses a pair of data plus regression to arrive at a signal. 

Take the function and figure out a threshold to use

To develop a signal engineers look at the function and figure out what threshold of sensitivity to use. The function is a rule of describing a relationship between the sets of data. For example, Google uses sigmoid or other functions. The threshold is a midpoint in which the relation becomes statistically significant, and this midpoint can be picked manually or arrived at by regression, as in most cases with Google.

How ranking signals are crafted?

Image of Google antitrust proceeding docs

01

The "hand crafting" of signals means that Google takes all those sigmoids (and other functions) and figures out the thresholds

02

Google takes the relevant data  and performs regression to confidently determine which factors matter most.

03

Google engineers plot ranking signal curves.

04

The curve-fitting is happening at every single level of signals.The purpose of curve fitting is to find a function, i.e. how to better explain a mathematical relationship between parameters that leaves the smallest residual.

What type of data does Google use to arrive at the signals?

Web page content

Structure of webpage

User clicks

Label data from raters

Data come from 3 sources: content, users and raters


Website owners are responsible content and structure, users - for clicks and human raters are Google’s agents that go and evaluate the website by quality guidelines that are publicly available (a complementary source of data).

What are the most important ranking signals?

NavBoost

NavBoost is a re-ranking module that uses a “pair of dice” metaphor and logic. As inferred from the leaked documents, modules use click and impression (and their proportions) as a "winning" dice combination per specific position in SERP: if a document gets a better combo per position than another, it gets a boost. People who navigate the search and choose a specific document called "voters", the whole process - "voting", people' data are tokenized and stored. This Twiddler - or re-ranking algorithm - works to Boost (promote) or Demote sites.


Overall, Twiddlers are responsible for the re-ranking of results from a single corpus. They act on a ranked sequence of results rather than on individual results. Twiddlers may work on device-basis, location-basis, topic-basis, etc. Google has Boost (or Demote) functions that are a part of Twiddlers framework. For example, the “Boost” functions identified in the leaked docs: NavBoost,  QualityBoost, RealTimeBoost, WebImageBoost and more.


ABC (Anchors, Body and Clicks)

Anchors

This is the oldest, probably the basic ranking signal. Anchors is a source page pointing to a target page by links. So if we take the number of anchors and analyze the text used therein, we'll find whether or not a page possesses a certain topic.For example, there are 10 links pointing to your page (internal or external links) and they use anchors like apple, red apple, green apple, and so on, so maybe then this page has topic of apple. So the document is relevant to the like queries. 

Body

These are terms in the document. This ranking signal analyses the relevance of the terms used in the document.

Clicks

Clicks is how long a user stayed on the page before bouncing back to search. So whether not this vote - in the form of click - shall be counted towards the relevance and topicality.

ABC ranking signals are the key components of the topicality of the page.

This is to say how the document is relevant to the query. Insofar as topicality answers the question, how relevant is the page based on the query term to be showcased in the search results?

And these ABC (anchors, body and clicks) are the key components of topicality, so,, they allow Google to decide whether or not to show a page high or low against the search term.

Quality

Quality is the notion of Trustworthiness.  It is an important signal. It has to do with authority of the web links coming to the website, the age of the domain, etc. This is to say Google wants to know whether or not users can actually trust the page and its content.

PageRank

PageRank arguably exists on several layers including that which implies a "distance" from a golden standard "seed" websites. 


Google arguably has a collection of trusted articles on all topics - the gold standard of trust. All selected links form a link graph. The rank or correlation of each link is calculated by the distance from the trusted documents, which is a standard graph algorithm. This is called “NeerestSeeds” method.


For example, if a trusted article from The New York Times links to an article on site X, and an article from site X links to an article on site Y, and an article on site Y links to wlw, the distance will be 3. The distance in graphs is calculated not by nodes, but by links or edges. The smaller the distance, the better for this indicator.

Summary

It's how uses the data, including the webpage content, the structure, including internal link clicks and other sources of data, to arrive by method of regression at signals. And signals - as groups of modules, i.e. NavBoost, ABC, Quality and PageRank - to convey the information as to the potential of a document to be ranked against a search query.


This is how search rankings work as inferred from analysis of the leaked documentation and the court testimonies.