Sitemaps are an integral part of how search engines identify which pages are on offer on your website. Reading your sitemap, alongside normal crawling activity results in a list of URLs the search engine will use to extract information from your pages.
Per page, basic information is usually pulled from <meta> tags, but there's much more to the story than just that.
Here are a few ways they come in handy:
Parsing sitemaps can be straightforward—or it can not be. This depends on the size and complexity of the site. Here’s why:
Large Sites and Sitemap Indexes: If you’re dealing with a massive site, a single sitemap isn't enough as there is a limit in the standard as to the maximum allowed amount of sitemap URLs per sitemap.
That’s where sitemap indexes come in. They break down your sitemap into smaller, more digestible chunks. But this also means more work for anyone trying to parse them.
Compressed Sitemaps: Sitemaps can be compressed using gzip to save space and reduce load times. While this is great for efficiency, it adds an extra step in the parsing process. You’ll need to unzip them before you can get to its contents.
As if that wasn’t enough, extracting metadata from pages discovered through the sitemap can get even trickier. Modern web applications, especially Single Page Apps (SPAs), don’t always output meta tags from the server. Instead, they might only render them on the client side, making it harder to get the information you need at a glance. In this case you'd have to arrange a browser to interpret the page.
Sitemaps can be difficult to generate from an implementor’s point of view as well. You have to imagine querying all visitable elements from your data sources, turning them into addressable URLs and extracting useful images from them all at the whim of someone’s sitemap request.
This can be quite a process-intensive operation and might require some additional thought as to how this is implemented. We will see below that some of the data sources I’m extracting data from have different approaches.
If there were nefarious actors, they could use the ad-hoc generation of sitemaps as an attack vector to overload the systems quite easily. Of course caching helps but if not configured properly, there might be ways around that.
To put theory into practice, I’ve implemented a sitemap scraper for some New Zealand news sites. These websites are content-heavy, making them perfect candidates for interesting sitemap implementations.
RNZ offers gzipped sitemap URL sets in their indexes, which are updated nightly. It’s clear they’ve got a system in place that generates these sitemaps daily, ensuring that the site doesn’t get bogged down with every request. Smart, right?
Stuff limits their sitemap to the last 500 articles. This could be a technical choice or a decision influenced by SEO advice. Either way, it’s an interesting approach that keeps things streamlined.
NZ Herald’s sitemaps are more traditional, with regular updates. What’s unique is how they separate sitemaps for the latest articles from those responsible for historical content. This organization helps manage their vast archive more efficiently.
As a fun exercise, I’ve written some code to parse sitemaps from the news sites mentioned above. Every few hours, this code reads the sitemaps, interprets the meta tags for new pages—whether server-side rendered or client-side—and visualizes it in a simple, filtered list.
Curious to see it in action? You can check it out here: https://newsfeeds.co.nz