Interpreting Sitemaps, harder than you think.

Sitemaps are an integral part of how search engines identify which pages are on offer on your website. Reading your sitemap, alongside normal crawling activity results in a list of URLs the search engine will use to extract information from your pages.

Per page, basic information is usually pulled from <meta> tags, but there's much more to the story than just that.

Not Just for Search Engines

While search engines are the primary audience for your sitemaps, they're not the only tech that finds them useful. In fact, sitemaps are a great tool for managing and optimizing your content.

Here are a few ways they come in handy:
 

  • Tracking New Content: Whenever you add content to your site, the sitemap will start advertising it. This could be important for anyone involved in content marketing or SEO if they want to monitor their customer's activities.
     

  • Disappearing Content: Just as important as tracking new content is keeping tabs on what’s no longer available. By comparing the state of a historical sitemap with a more recent one, you could discover content that is no longer available. SEO experts would usually recommend that these URLs aren't lost, but instead offer a redirect to other relevant content.
     

  • Monitoring Content Updates: Sitemaps also log when content was last modified, giving search engines a heads-up on changes. SEO experts could also hook into this to understand how content changes over time if the platform their customer uses doesn't offer this kind of functionality out-of-the-box. 
     

  • Comparing Sitemaps for Migrations: When you're migrating content from an old site to a new one, sitemaps can be your best friend. By comparing the old and new sitemaps, you can make sure that every last drop of SEO “juice” makes the transition smoothly, by accounting for an exact structural equivalence.
     

Why Some Sitemaps Are Harder to Parse Than Others

Parsing sitemaps can be straightforward—or it can not be. This depends on the size and complexity of the site. Here’s why:

  • Large Sites and Sitemap Indexes: If you’re dealing with a massive site, a single sitemap isn't enough as there is a limit in the standard as to the maximum allowed amount of sitemap URLs per sitemap. 

    That’s where sitemap indexes come in. They break down your sitemap into smaller, more digestible chunks. But this also means more work for anyone trying to parse them.
     

  • Compressed Sitemaps: Sitemaps can be compressed using gzip to save space and reduce load times. While this is great for efficiency, it adds an extra step in the parsing process. You’ll need to unzip them before you can get to its contents.
     

As if that wasn’t enough, extracting metadata from pages discovered through the sitemap can get even trickier. Modern web applications, especially Single Page Apps (SPAs), don’t always output meta tags from the server. Instead, they might only render them on the client side, making it harder to get the information you need at a glance. In this case you'd have to arrange a browser to interpret the page.
 

Generating Sitemaps without killing your server

Sitemaps can be difficult to generate from an implementor’s point of view as well. You have to imagine querying all visitable elements from your data sources, turning them into addressable URLs and extracting useful images from them all at the whim of someone’s sitemap request. 

This can be quite a process-intensive operation and might require some additional thought as to how this is implemented. We will see below that some of the data sources I’m extracting data from have different approaches. 

If there were nefarious actors, they could use the ad-hoc generation of sitemaps as an attack vector to overload the systems quite easily. Of course caching helps but if not configured properly, there might be ways around that. 

Experimenting with NZ News Sites

To put theory into practice, I’ve implemented a sitemap scraper for some New Zealand news sites. These websites are content-heavy, making them perfect candidates for interesting sitemap implementations.

RNZ

RNZ offers gzipped sitemap URL sets in their indexes, which are updated nightly. It’s clear they’ve got a system in place that generates these sitemaps daily, ensuring that the site doesn’t get bogged down with every request. Smart, right?


Stuff

Stuff limits their sitemap to the last 500 articles. This could be a technical choice or a decision influenced by SEO advice. Either way, it’s an interesting approach that keeps things streamlined.
 

NZ Herald

NZ Herald’s sitemaps are more traditional, with regular updates. What’s unique is how they separate sitemaps for the latest articles from those responsible for historical content. This organization helps manage their vast archive more efficiently.


Exercise in Sitemap Parsing

As a fun exercise, I’ve written some code to parse sitemaps from the news sites mentioned above. Every few hours, this code reads the sitemaps, interprets the meta tags for new pages—whether server-side rendered or client-side—and visualizes it in a simple, filtered list.

Curious to see it in action? You can check it out here: https://newsfeeds.co.nz

Also Read

How AI fits into my development workflow

By Marnix Kok on 27 August 2024

Deploying Flutter Web builds to Cloudflare Pages

By Marnix Kok on 20 September 2023