What is an XML Sitemap?
An XML Sitemap is a document that assists search engines such as Google, Bing understand a websites content. It allows webmasters to add additional information such as priority, last time content was updated, and how often the page is updated.
Why do I need an XML Sitemap?
Search engines do not have the capacity to index your entire website on each visit, typically pages with high importance are visited more than those that aren’t. If you have a website with 100,000 pages, then chances are a certain proportion of those pages are updated more often than others.
As a theoretical example to explain this – let’s assume only 1% of your content is refreshed by search engines, Without an XML sitemap, you are leaving it up to search engines to decide which pages to update in its index. So it may revisit 1000 unimportant pages on your site working top-down through your site hierarchy, not reaching important pages which are several layers deep.
With a properly implemented sitemap, you can prioritise and inform search engines which pages on your site are updated more often, utilising the theoretical 1% more.
How do I check my Sitemap is an accurate reflection of my site?
There are a number of different tools and methods to generate an XML sitemap, we assume you already have one implemented and want to check .
Step 1 – Scrape your website
You will need a tool to scrape your website for a comprehensive list of pages. Again, there are a number of tools out there, such as DeepCrawl or ScreamingFrog which can perform this for you.
If you need help crawling your website, contact us and we can run a crawl for you.
Step 2 – Download XML
Download your XML sitemaps, this can be done by using a browser download manager such as Turbo Download Manager for Firefox. Or for the HTML savvy you can create a simply HTML page and save it locally.
Once you’ve opened the page you can download XML sitemap(s) locally.
In some cases there is a single XML file, however typically there are multiple XML sitemaps. You will need to download each XML sitemap locally and will need to import each one.
Tip: WordPress Users:
If you are using a plugin such as YoastSEO to generate your XML Sitemap, you can open up the URL in a brower and download it directly thanks to its XLST (eXtensible Stylesheet Language Transformations).
Step 3 – Importing XML sitemap into Excel
Once you have saved the document locally, fire up Excel and start a blank worksheet.
- Select DATA on the Excel Ribbon
- Select “From Other Sources“
- Select “From XML Data Import”
- Select the location of your XML sitemap.
- Click OK
- Renaming the Sheet to something link XML Sitemap
- Create a new column and call it “Source“. Paste “XML Sitemap” values down, this helps with comparison later.
Step 4 – Importing Website crawl into Excel
- Create a new sheet and call it “Website crawl”
- Import data from your favourite crawling tool into this sheet (Please note: Import Internal HTML URLs only. Including other filetypes will cause issues later)
- Create a new column called Source use “Website Crawl” as the values and paste down
Tip – Keyboard shortcut to quickly populate cells:
- Press CTRL + C in the first cell with the value, this value will have marching ants around it.
- Press the down arrow key to select the cell below
- Press and hold the shift key, tap END and then press the down key. It will select all cells in the column to the end of the table..
- Press CTRL + V and all the values will be pasted.
Step 5 – Comparing XML Sitemap vs Website crawl
- Create a new sheet and call it “Comparison”
- Add two columns headings “Source” and “URL”
- Copy both tables to the Comparison table, excluding the header row. To quickly select all the rows in the table
Step 6 – Analysing the difference between the XML sitemap and Website crawl
Now you have collated the data, you can use Excel’s built in tools to do some basic comparison. Firstly we’ve need to highlight duplicate values and exclude them. This will show us URLs that are in the XML sitemap, and URLs that are crawled. Those we aren’t really interested in, we’re interested in the URLS missing from the XML sitemap, or visa versa.
Important: Save your document, the next set of instructions can be processor intensive so worth saving now in case your Excel crashes or becomes unresponsive.
- Select the column with the URLs
- Click HOME in the Excel Ribbon
- Click on Conditional Formatting
- Click Duplicate Values
- Press Ok – the default settings for Excel are fine.
Now we will need to filter out the duplicate values as these exist in both the XML sitemap and Website crawl.
Step 7. Hide duplicate entries from the XML sitemap and Website crawl
- Select your two columns with the “Source” and “URL” (Tip if you click on the column headings e.g. A & B it’ll select all Rows.)
In order to see what remains we want to hide any values that have been highlighted in Excel as duplicates.
If you have a lot of URLs this may take a few seconds to process, so please be patient.
Step 8. Analysis
Now the cells remaining can be described as following
- URLS categorised as “Website Crawl” – These URLs exist on the website, but do not exist in the XML sitemap. These can indicate an issue with your XML sitemap generation
- URLS categorised as “XML Sitemap” These are URLS in the XML sitemap but not found by your website crawler. This can indicate an overall problem with the crawlability of your website or the existance of “Orphaned Pages” which have no internal links, pages without internal links won’t rank.
This data provides basic information about your website from a search engine crawler vs the XML sitemap (handrail). It can help identify issues with XML sitemap generation, crawlability, URL canonicalization, duplicate content.
(The sales bit) If you need any help with SEO on your website, please feel free to get in touch