Today we are talking about a file known to everyone, the Sitemap. A file that most projects on the Internet are using and very few use correctly.
I already wanted to talk about this and the time has come, since this file most of the time goes unnoticed and is automatically generated with a plugin or module and never touched again. And the truth is that it has many uses that I want to tell you throughout this article.
In this article I want to expose some aspects that we find in our day to day with the Sitemaps and the solutions by which we choose depending on the project, volume and state of the same.
What is the Sitemap?
The Sitemap is a file that serves to indicate to any search engine (in this case Google) the URLs of a Web project, so that the robots can more efficiently track the project. In addition the sitemap also assures you that the robots will collect that information before, the use they make of it already depends on other factors that I will tell throughout this article.
Formats in which we can generate a Sitemap:
- XML: This is the most used and the one I recommend using. Most plugins, modules and extensions that use content managers such as WordPress, Prestashop or magento use this format.
- RSS: If you have a generated feed that automates the upload of new content, you can include it as a sitemap, but beware that most feeds forget many old pages that have not been generated automatically.
- Text Document: You can also include .txt files for the generation of your Sitemap. Of course, you must include one URL per line.
- Google Sites: Another way to create your Sitemap that I do not recommend, but that Google allows its use, here I leave all the info: https://support.google.com/webmasters/answer/183668?hl=en&ref_topic=4581190# sitemapformat
- It should be noted that you can also create sitemaps for images, videos or mobile phones, but this is something that I see every day more implemented in the projects that I am.
IMPORTANT: So far we have only reviewed what is a sitemap and what formats you can use. Nothing you can not find in the Google guidelines as always commented (there is everything, and not elsewhere 😉). If you did not know all its uses, I have left several links to the official Google documentation. Ahhhh! Remember that you can not generate Sitemaps of more than 50,000 URLs, although I recommend that you do not exceed 40,000 for my experience. When you have many thousands of URLs, you should make a Sitemaps index.
Something you should keep in mind that in most cases you do not have the following aspects:
- Do not include URLs with Noindex in the Sitemap
- Do not include URLs in the Sitemap that do not respond to a code 200
- Do not include canonical URLs in the Sitemap
IMPORTANT: The sitemap does not serve to index URLs, it is true that the generation of Sitemaps in an appropriate way help the tracking, but they are not useful for indexing, much less this is a mandatory file. Any small-medium web does not need a sitemap for Google to index its URLs, simply with a good internal link would suffice.
Errors when generating Sitemaps
Throughout the last years auditing websites and working with different projects I have found everything with the theme of the Sitemaps, but what stands out above anything else is the following:
- Include URLs that respond to codes 301.
- Include URLs that respond 404.
- Include URLs that have the canonical pointing to another URL.
- Include URLs blocked by Robots.txt (this is the best xD).
To check that a project is not doing any of this is very simple, you only need the Project Sitemap and Screaming Frog (In case you are not yet with this tool, here is a complete guide of Screaming Frog). I explain the process in several steps:
STEP 1: Download the Sitemap file to be able to work with the document.
STEP 2: Start Screaming Frog >> Mode >> List >> Upload List >> from a file >> select Sitemap xml. With this you will be able to upload your Sitemap to analyze it in depth and remove any errors it may generate.
STEP 3: Identify the errors and generate a correct Sitemap. With this you will get an important tracking improvement. Depending on the status of your Sitemap this tracking improvement can make your project start to improve positions.
When should you perform this check?
In general I will list some situations where this check is very important, as well as the generation of a new sitemap that makes the google robot pass more intelligently through your Site:
- If you have implemented the famous HTTPS on your page, then it is a crucial moment to check your Sitemap and you will see the amount of 3xx that you will find.
- If you have recently made a migration or changes in URLs. You will find surprises type: 301 and 404 xD.
- If you like to play with Noindex or you are using a plugin to generate your Sitemap, you will surely find URLs with noindex that you are including in the Sitemap.
- If you really like the use of “canonical” you will surely find unpleasant surprises in your Sitemap.
2 Advanced uses of the Sitemap
The sitemap has different uses. Here I will explain in what situations I use them and why of each of these actions I do:
- Accelerate the de-indexing of a large number of URLs thanks to the Sitemap
We start with the first common scenario! We have a number of unnecessary URLs that we want to de-index for whatever reason (I do not want to go into details but this would be eternal, in subsequent posts we will address why we often need to de-index URLs). Imagine that they are hundreds or thousands. You can not wait for Google to go through each and every one according to its tracking frequency.
- To accelerate this process of de-indexing a large number of URLs, we simply need to generate a Sitemap including all the URLs that we already have with noindex and upload them to Search Console. For this I have asked my partner Julio to upload a tool to generate Sitemaps for free that you can find here, since Screaming Frog and other tools give problems with this type of URLs.
- Once a considerable time has elapsed we simply take all those URLs and verify that they have been de-indexed using URL Profiler (this tool will be explained later). Simply insert all the URLs and select the “Google Indexation” option.
- Once they are unindexed we remove the Sitemap from Search Console.
This scenario appears in many ecommerce! Imagine that you have an ecommerce and you work with seasons of products and suddenly you have to eliminate different categories and products for different reasons. Beware when this happens there are several options:
- Check that no URL has external links that are giving authority.
- Check the organic traffic of those URLs, because if I have some URLs with traffic in any case I would eliminate them.
- Check that there are no similar products, because if there were and we had traffic to those URLs we could execute a 301.