A Technical SEO Guide to Crawling, Indexing and Ranking

One of the most common questions that I get is ‘how often does Google crawl?’

The first thing you need to do is understand what website crawling actually is. But rather than me explain that – hear what Google has to say:

 

What is an SEO Spider?

No doubt if you watched the video and you’re new to SEO, the first term that probably confused you is spider.

Simply put, this is the name that developers have called crawling bots that explore the internet. They go from page-to-page through the linking of the internet. It’s the world wide web, so it makes sense to call them spiders.

There are directives for spiders, and classifications for them too. Some spiders are considered good bots, whilst others are bad bots. Some listen to your robots.txt directives, and others choose to ignore them.

But the important thing is simply to know that these crawling bots are not necessarily harmful. However, if they’re crawling too much of your website too quickly, they may cause your server to slow down. This brings us to the next step:

What is the Robots.txt Directive?

According to Wikipedia, the robots exclusion standard began in 1994 when Martijn Koster created a robot to crawl the internet. This is backed up by robotstxt.org, with their original web standard document page here.

Inadvertently, a site crawler that did a distributed denial of service across the website being worked on, which lead towards the need for robots directives.

Whilst a robots.txt is not recognised as an internet standard by any standards body, it is recognised by the webmaster community. More importantly, it’s something that Google supports and recommends.

The basic format of a robots.txt for most of you guys will look something like this:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

 

If this is familiar to you, then you’ve probably seen a robots.txt before, or looked it up whilst reading this article. But in the next section, I’m going to break down each of the directives and how they work:

A List of Directives and What They Actually Do

I know a lot of people that started SEO and knew they needed a robots.txt file, but no idea why they needed to include those magical lines of code.

Here’s a list of directives, and what they do, as well as a handful of interesting and less commonly known facts.

User-Agent: This tells the robots whether the rules apply to them, and by default most people will use an asterisk to represent all robots.

Disallow: This tells the robots whether this section of the website should be crawled. They can still ping to see if the page is there, but if they follow directives then they won’t visit.

Allow: This tells the robots that this page or section can be reached. So if you want to block off a section of your website, but allow a single file there, then you can do so.

Crawl-Delay: This tells the robots that they should wait a certain amount of time before crawling another page. Some search engines perceive this as a time delay before the next page is crawled, others interpret this as length of time before revisiting the website.

Sitemap: This is often used for websites to specify where your XML sitemap is located. This helps web crawlers to find your sitemap and crawl the website.

 

Disallow: /wp-admin/: This is a default for Yoast SEO on WordPress. It’s blocking off the wp-admin dashboard so that robots do not crawl your back end.

Allow: /wp-admin/admin-ajax.php: This is the default for Yoast, because an important WordPress file called admin-ajax.php is located inside the /wp-admin/ folder. If you’ve ever password protected your /wp-admin/ section then you’ll notice a 401 error due to the browser not reaching the admin-ajax.php file.

<meta name=”robots” content=”nofollow”>: This is not something that is included in your robots.txt file, but is still part of the robots exclusion protocol. This tells the robot not to follow the link, but can be ignored and is not useful for protecting secret files or documents.

Google Search Console

The next questions that I receive a lot are ‘how often does Google crawl’ and ‘how to get Google to crawl my website’. Along with these, I am sometimes asked ‘when did Google last crawl my website?’

A lot of this information can be found in Google Search Console. So I’ll walk you through how to check this information for yourself.

How often does Google crawl?

The truth is that Google will crawl each website differently on a daily basis. However, you can use Google Search Console to help you here. This will tell you how often Google is crawling your website, which is most important.

To do this, open up Google Search Console and select your property. Then select Crawl > Crawl Stats

When you do this, you’ll be presented with the below graphs. The one you’re interested in are the Pages crawled per day.

You’re looking for the average amount of pages crawled per day, not the high or low values. There’s normally a great amount of variance between these  values, but with a small website such as mine, you can expect these numbers:

 

How to get Google to crawl my website?

If you’re looking to get Google to crawl your website, then the most important thing is to make sure that it’s well connected and providing value.

First and foremost, if the website is full of thin and duplicate content, this is going to stop Google from crawling your website. There’s no substitute for a good website design.

However, there do arise situations from time-to-time that require a little more encouragement. Perhaps you’ve created a new page you want to rank. Maybe there’s old pages indexed that you want removed from the index, but attributed towards the new page.

To do this, you’ve got a few options.

You can create a list of your pages you want Google to crawl, then create a fresh XML sitemap file. This can then be submitted using the Submit Sitemaps. This page can be found in Crawl > Sitemaps.

The other option on how to get google to crawl your website, is to fetch and request indexing. This can be done by visiting Crawl > Fetch as Google.

This page is fairly obvious how it works. Simply submit the URL and select either Fetch, or Fetch and Render. Then a button appears saying Request Indexing. When you press this, another pop-up will ask how you want to do this.

The first option will simply request that you crawl this page. So if you’re looking to submit a new article and get it quickly indexed, this is a great option. However, if you’re looking to recrawl an entire section of a website, select Crawl this URL and its direct links.

If your pages are full of good quality content, and you’ve got lots of links, then this isn’t going to be a problem. The people that are struggling to index pages need to address navigational and content problems.

When did Google last crawl my website?

It should be clear, by the section on how often does Google crawl your website, that it also answers ‘when did Google last crawl my website’. If you can see 90 days of crawling data, then problem solved?

Well, not really…

Sometimes people have set-up their Google Search Console incorrectly, and this means that those pages will not be crawled.

Your Pages Crawled might look something more like this:

This isn’t something to be alarmed by, even though it looks alarming. In this instance, I’ve selected my http://www. property instead of https://.

Since I have a redirect setup and build links towards my https:// version, it’s clear that Google will not be crawling the http version very often. So it’s really important to make sure that you’re collecting the right data.

However, there’s another way that you can check Google crawling habits on your website. If you’ve never done it before, you’re about to find out how to check your server logs.

INFORMATION REQUIRED – SERVER LOGS

What are crawl errors in Google Search Console?

Something that is often overlooked are errors in Google Search Console.

List of Website Crawlers

google crawl errors pages don’t exist

How to perform a Site Crawl

Lots of text

How to crawl a list of URLs

Lots of text

Leave a Reply

Your email address will not be published. Required fields are marked *