What is HTML Parsing?

Parsing HTML is often used in information retrieval and data science to extract data from an HTML document. This process is done frequently by search engines and robots that mine data.

While these web crawlers will always download the full HTML document served by the server, they do not parse all code.

Search engines are most interested in code that provides useful information for users. For example, the <h1>, <h2> and <p> tags often include useful headings or content.

Other elements such as <meta> and <link> are never fully parsed. Instead, search engines will extract the essential attributes from these tags. For example, the href attribute includes references to other related websites. Similarly, the hreflang attribute includes language and country information.

How can I check for Parsing Issues?

Parsing errors are something that search engines need to worry about, not search engine analysts. However, you can help to reduce the chances of HTML parsing issues by checking for W3C Validation.

This tool points out any issues with your HTML code according to the latest specifications. Unfortunately, this may also flag up issues that were deprecated in HTML 5 but still crucial for earlier versions. Remaining backwards compatible can be great for users and fine for search engines.

Aside from checking validation issues, try to remain consistent across your website. Avoid using capital letters, commas and underscores in URLs, as well as unparseable attributes such as hash fragments.

What about Structured Data?

In the early days of HTML, the markup language was designed to be purely functional. However, recently there is a focus on using structured data. These tags often include semantic importance, as well as a practical purpose.

Examples of structured data in HTML5 include the addition of <article> and <section> tags. These tags function the same as the division tags but suggest added meaning.

Other types of structured data include Microdata, Javascript Object Notation and Resource Description Frameworks. The addition of these markups can help provide semantic meaning to otherwise meaningless tags such as list elements and span tags.

List of Elements & Attributes Confirmed Parsing

To help you with your journey, I have provided a list of HTML elements that are confirmed as parsed. I’ve also broken these down into whether they’re in the head or body, and how strong they are as ranking factors.

Head Elements

HTML Element Tag Type Ranking Strength
<title> Meta Element Strong
<meta> Meta Element None
<link> Meta Element None

Body Elements

HTML Element Tag Type Ranking Strength
<h1> Block Element Strong
<h2> Block Element Medium
<h3> Block Element Weak
<h4> Block Element Weak
<h5> Block Element Weak
<h6> Block Element Weak
<p> Block Element Medium
<em> Inline Element Weak
<i> Inline Element Weak
<strong> Inline Element Weak
<b> Inline Element Weak
<article> Block Element None
<section> Block Element None
<nav> Block Element None
<footer> Block Element None
<div> Block Element None
<span> Inline Element None

All Attributes

HTML Attribute Purpose
lang=”” This is to specify the language of the document, often found inside the <html> tag.
rel=”” This is to specify the relationship between two things.
hreflang=”” This is to specify the language code and optional country code of a document. It is always located inside an <link> element.
alt=”” This is to specify an alternative if an image fails to load. Frequently used with the <img /> tags.
src=”” Block Element
srcset=”” Block Element