Robots.txt is a standard text file placed within the public root directory of a website, to instruct web robots (typically search engine bots) how to crawl pages throughout a website.

Robots.txt can be easily overlooked when building a website. It does not have any visual effect on the website and works by instructing ‘invisible to the eye’ web crawlers. Robots.txt is one of the most important parts of building a website as it can help focus the attention of web crawlers to the pages that are most important to your website or business.

How do Robots.txt and crawlers work?

A robots.txt file is a standard notepad text file that has web-based statements inside. These are read by crawlers to instruct them with which pages you would like them to visit. This allows the web crawlers to discover new content on your website, enriching search engine results.

Once the search engine crawlers have found the correct pages to gather content from, they will analyse your page and index them to the search engine they are working for. This is commonly known as spidering (spiders) as they are building the world wide web. Web crawlers will also use your internal linking to go from one page to another, which shows the importance of internal linking. This is because you will want your visiting web crawlers to index as many pages as possible to the search engine.

spider web

Examples of Robots.txt Statements

There are many statements which you can use within a robots.txt file, all of which have a unique function. Below are some of the most commonly used statements within a robots.txt file.

Basic Robots.txt format

A robots.txt file is deemed complete with just the following functions

User-agent: Googlebot

Disallow: /example-url-folder/example-page.php

In the statement above, we are telling Googlebot not to crawl (or visit) the example page. The web crawler will then read that statement and avoid the example-page and will not index it within the search engine rankings.

Advanced use of Robots.txt

You can do a lot more than declaring pages for a web crawler to look at. A robots.txt file is great for declaring other parts of your website, including:

  • Sitemap: You can declare where your sitemap is located to. This also helps web crawlers find more of your pages that may be deep down in your directories.
  • Crawl-delay: You can set a crawl delay, this will instruct how long we website crawler should take before moving from page to page. This can often be ignored, depending on what crawler is looking at your website. But its primary use is to bring down server load, keeping things as stable as possible for your website.
  • Allow: This statement as far as we know, is only used by Googlebot and declares pages which it can visit, even though the directory may be off limits.

You can use wildcards to shorten the amount you type, for example:

  • The use of an asterix (*) is a standardised wildcard that represents a sequence of text. For example: Disallow: /*/*/example.php can be used to declare patterns within directories.
  • To declare the same file type, you can use a dollar sign ($). This is particularly handy if you want to hide a specific file set, for example: /*.php$

To find out a bit more on how these could be useful for your robotx.txt file, check out the robots.txt documentation on Google.

A full example of a robots.txt file

Seeing a correct robots.txt file can sometimes make it a bit easier to implement. Find below a working example of a robots.txt file. This was taken from our very own robots.txt file here at Bravr Digital Marketing:

Sitemap: http://www.bravr.com/sitemap_index.xml

User-agent: *

Disallow: /lp/*

Disallow: /lp

Disallow: /cat/

Disallow: /c/

Disallow: /tag/

#Allow Googlebot

User-agent: Googlebot

Allow: /*/*.css

Allow: /*/*.*.css

Allow: /*/*.ttf

Allow: /*/*.woff

You can see that we have allowed Googlebot to see our design elements, such as CSS. This is so that Google can understand out template and index this when the bot renders the website.

How do I know if I have a robots.txt file?

A robots.txt file is accessible to anyone and can be found in the same directory on every website. The directory in which a robots.txt file should be placed is within the public_html directory – which is where all of your public website files can be found, for example www.example.com/robots.txt

If you were to place the file in a different directory, such as www.example.com/directory/robots.txt, the robots.txt file would not be found by any of the bots, resulting in it being ineffective.

If you can’t find your robots.txt file, chances are you don’t have one. It is recommended that you contact your website developers to get this setup. Or you can alternatively drop us an email here at Bravr; we will be happy to make sure your robots.txt is working and present!

Robots.txt best practices for Search Engine Optimisation

Be sure to check the following before going live with your robots.txt. As mentioned above, declaring the wrong statement can have a negative effect on your website, some things to look out for include;

  • Do not use robots.txt to prevent web crawlers from visiting information sensitive data. Some of these things include; quotes for clients, personal information and half-built pages. Using a different blocking method such as password protected pages is by far better and safer.
  • Be aware of multiple types of web crawlers. There are literally hundreds of different types of web crawlers. Cater to the needs of the top website crawlers such as Googlebot, bingbot and msnbot. Remember that some of these major websites will also have several variants such as Google’s image bot.
  • If you are in need of a fast turnaround with your robots.txt file, remember you can submit it via Google webmaster tools. This will speed up your robots.txt submission to the search engine.
  • As mentioned a few times throughout this article, be sure that you are disallowing and allowing the correct parts of your website. This alone is the most important things when creating a robots.txt.
  • Remember that robots.txt file serves as a site wide declaration. If you want specific pages to be disallowed (or noindexed) it might be worth you looking into meta robots tags, which can benefit statements for individual circumstances.

Robots.txt is just one part of Search Engine Optimisation. To find out a bit more about SEO, check out our dedicated SEO pages.

How to create and upload robots.txt file

To create a robots.txt file you will need to open a text editor program, notepad is probably the easiest to use. Once you have created your document, write all of your statements on individual rows. Doing this is important so the web crawlers can read your statements easily. If you would like to leave a note within your robots.txt you can use the # to null the text that follows, for example:

# The text on this line is now nulled because of the hash

Once you have written your statements, check them and make sure that you are allowing/disallowing the correct directories and files. It would be disastrous if you disallowed your important pages to Googlebot.

Once you are happy with your robots.txt file, save it locally so you have a backup. Then upload it to your website via FTP to your root directory (typically public_html).

Once you have done that you are all set! Sit back and be safe in the knowledge that any web crawlers visiting your website will be going to the correct pages on your website.

notepad-robots2