What is HTML Crawling?

All across the internet are websites that have been constructed using a markup language called HTML. This language is used by every site to display relevant information to users.

When a search engine such as Google, Bing or Yandex crawl websites, they request a copy of the HTML file. Each search engine then uses a form of HTML Parsing to extract information. An example would be pulling page titles, h1 tags, h2 tags and other headings.

Importantly, all of these websites will download the full HTML file, even if they only parse certain aspects. It is therefore essential to validate your HTML and improve every aspect of your code.

How can I check my site is crawlable?

There are many ways to check if your website responds to robot requests. A simple way is to use the built-in features from Microsoft, Apple or Linux.

For example, here’s how you can use the Command Prompt to check a website is accessible:

Ping Test with Microsoft Command Prompt

Start by opening the windows taskbar, and search ‘cmd’ or ‘command prompt’. Clicking this should then open up the terminal.

Microsoft 10 Command Prompt

Once the command prompt is open, you will want to type the following code to check your website:

ping:yoursite.com

Using CMD to ping rowanseo.com

Microsoft’s Ping service will always send four packages to the website to test it exists. However, if you put in the wrong website, or the site is not available then you will be returned a message that reads:

“Ping request could not find host fakewebsitebeingtested.com. Please check the name and try again.”

This technique is excellent for bug testing your website to see if it’s accessible.  While it’s a useful feature to know, it’s not as sophisticated as software tools such as Screaming Frog.

Ping Test with Apple Terminal

Start by opening the windows taskbar, and search ‘cmd’ or ‘command prompt’. Clicking this should then open up the terminal.

Microsoft 10 Command Prompt

Once the command prompt is open, you will want to type the following code to check your website:

ping:yoursite.com

Microsoft 10 Command Prompt

Microsoft’s Ping service will always send four packages to the website to test it exists. However, if you put in the wrong website, or the site is not available then you will be returned a message that reads:

“Ping request could not find host fakewebsitebeingtested.com. Please check the name and try again.”

This technique is excellent for bug testing your website to see if it’s accessible.  While it’s a useful feature to know, it’s not as sophisticated as software tools such as Screaming Frog.

How can I improve html crawling?

Since Google works by crawling your website, it is crucial to optimise crawl efficiency for your web pages. Complete the following optimisations to improve your HTML crawling.

  • Reduce the size of your site to save on download times.
  • Improve your site infrastructure and networking stack.
  • Remove irrelevant content that wastes the robots time.
  • Structure your content to funnel search engines to core pages.
Let’s cover the above tips in more detail:

Reducing File Sizes to Improve Crawl Efficiency

File sizes have an indirect relationship on how efficient a robot crawls your website. The load time efficiency will help robots quickly retrieve information from your HTML.

However, the small file size is not always an indicator that your page is good quality. The best websites are often the biggest file sizes. However, this is because the page is full of interactive content.

There are some instances where large file size is not due to interactive content. An excellent example will be uncompressed images, unnecessary CSS files and unused Javascript.

Reducing surplus code should lead to faster load times and therefore improve the HTML crawling.

Checking Crawl Stats in Google Search Console

Improving the Site Infrastructure

Some of the best site speed optimisations come from improving your infrastructure. There are three main areas that you will want to check for improvements to crawl efficiency:

  1. Check your hosting service specifications.
  2. Set up a Content Distribution Network.
  3. Update the PHP Version if necessary.

When reviewing the specifications from your hosting provider, you want to check for data points like you would a computer. A fast processor and abundance of RAM are essential. It’s also good to check if it uses solid-state memory as this is generally faster.

Another great way to improve your crawl efficiency, as well as user-experience, is by using a Content Distribution Network. These services place your content on servers around the world to help users and robots load content with lower latency.

Lastly, if you’re using a service such as WordPress, you may want to update your PHP version. Older websites will often use outdated versions of PHP that severely slow down your site.

Here’s a video benchmark that shows how updating your PHP version can almost half the time it takes per request:

Remove Irrelevant Content from Search Engines

Since search engines such as Google, Bing and Yandex are often crawling all your pages, it’s usually a good idea to be selective. Helping them to find your best content is an excellent way to improve rankings.

The most common issues that I frequently discover slowing down crawl efficiency are these:

  • WordPress /tag/, /category/, and /author/ pages.
  • Shopify /tagged/ pages and duplicate product pages.
  • Magento duplicate product pages.

Depending on your site’s infrastructure, you may wish to noindex these pages, nofollow links towards them, or canonicalise them.

When using a noindex tag, this will tell Google to remove the page from the index after it crawls a couple of times. You will usually want to use the Google URL Removal Tool alongside this strategy.

If you’ve chosen to nofollow links toward the page, this can help internal crawl efficiency. However, if the pages have external links, they will still be discovered, crawled, and indexed.

Lastly, if you would like to add a canonical tag, make sure that the page is an almost exact duplicate. Two pages that are similar but different should not reference each other.

Index Status in Google Search Console

Funnel Robots to your Core Pages

The final way that you may wish to improve the HTML crawling of your page is through funnelling. This practice is often called Link Sculpting.

To improve the rate that your core pages get crawled, they should include at least one link from 50 different pages (or as many as your site allows). Including multiple links on the same page does not improve the crawl efficiency.

It’s important to note that hash fragments do not count as different URLs. So using a table of contents will not allow you to add extra internal links to the count. For example, the following URLs would appear to Google as the same:

https://rowanseo.com/google-ranking-factors/

https://rowanseo.com/google-ranking-factors/#DomainFactors

Since they’re both seen as the same, the link is only counted once. However, despite this, they both deliver different user experiences. Therefore, you may wish to use multiple links despite the fact it’s counted once.

Screaming Frog Site Structure

Why can't Screaming Frog crawl my HTML?

If you’re using a tool such as Screaming Frog to crawl your website, you might find it fails to discover URLs. This problem often occurs when the main navigation is inside the Javascript.

Most modern search engines will render the Javascript, which helps them to uncover the links and understanding page structure.

If you are struggling to crawl the HTML of your website, you may wish to check your settings and enable Javascript Rendering. With Screaming Frog this can be found in the Configuration > Spider > Rendering.

You can find more tips in my Screaming Frog Tutorial.

Javascript Rendering in Screaming Frog