Guide to SEO: Robots.txt
This post is a part of the series Guide to SEO. To dive into more topics in this series, checkout the posts related to keywords and meta descriptions. What is the Robot Exclusion Protocol (REP)?
Since it was introduced in the early ‘90s, REP has become the de facto standard by which web publishers specify which parts of their site they want public and which parts they want to keep private. I use robots.txt on this site to keep the search engines away from parts of the site that contain content that I feel does not need to be indexed. Often times, sites exclude private data that they don’t want indexed like an 'admin' page or certain filetypes like JavaScript or PDF files. This is where the Robot Exclusion Protocol or robots.txt comes in.
Wikipedia’s definition for REP.
The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard complements Sitemaps, a robot inclusion standard for websites.
The REP is still evolving based on the entire internet communities needs; however, There isn’t a true standard that is followed by all of the major search engines. Although Google has worked with Microsoft and Yahoo they each have their own implementation of the protocol. As of September 2020, Google has shared some of it's latest open source robottxt projects on it's webmaters page. Google's open sourced robots.txt parser and matcher library can be a great place to do some digging to understand their implementation a bit future. It is important to understand there are differences between each major search engine's implementation.
This article will explore Google’s implementation thoroughly and breifly touch on the differences between Microsoft and Yahoo’s implementation. My explanations will be based on Google’s released detailed documentation on how they implement REP.
Robots.txt Directives
According to Google’s documentation these directives are implemented by all three major search engines: Google, Microsoft, and Yahoo.
- Disallow
- Allow
- Wildcard Support
- Sitemaps Location
Disallow
Google: Tells a crawler not to index your site – your site’s robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled.
Use Cases: ‘No Crawl’ page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.
Note: This is probably the most commonly used directive. Since the Googlebot penalizes for duplicate content I use this extensively in my robots.txt to hide duplicate content. This directive is also useful if you want to hide private or subscription data from bots. Although subscription pages will probably be password protected you should also add these pages to the disallow just in case.
Allow
Google: Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow
Use Cases: This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it
Note: The allow clause will trump the disallow. This is helpful if you want to allow a specific page in a directory that would normally be disallowed.
Wildcard Support
Google: Tells a crawler to match a sequence of characters
Use Cases: ‘No Crawl’ URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters
Note: This is probably the second most used directive in conjuntion with the disallow. Allows you to match multiple directories at once. One word of caution here is to test your robots.txt with Google’s Webdeveloper Tool.
$ Wildcard Support
Google: Tells a crawler to match everything from the end of a URL – large number of directories without specifying specific pages
Use Cases: ‘No Crawl’ files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf
Note: If you have an upload folder for your blog or website and you want to restrict a specific filetype but allow images to be indexed you could use this directive.
Sitemaps Location
Google: Tells a crawler where it can find your Sitemaps
Use Cases: Point to other locations where feeds exist to help crawlers find URLs on a site
Note: It is a good idea to create a sitemap for your website and include it in your robots.txt. You can also tell Google where your sitemap is through the webdeveloper tool.
HTML META Directives
Not only can you provide rules that search engine bots must follow through robots.txt, you can also specify rules per html page. This is often required for sites that want the search spider to follow through links to other pages but to refrain from indexing that specific page. A webmaster may want the search spiders to follow my links from category and archive pages but to exclude the category and achrive listings since they contain duplicate content.
The following HTML META directives are implemented by all three major search engines: Google, Microsoft, and Yahoo.
NOINDEX META Tag
NOFOLLOW META Tag
NOSNIPPET META Tag
NOARCHIVE META Tag
I will first give you the exact description given by Google’s documentation and then give you my own explanation along with examples when necessary.
NOINDEX META Tag
Google: Tells a crawler not to index a given page.
Use Cases: Don’t index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag
Google: Tells crawler not to follow a link to other content on a give page.
Use Cases: Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are dicounting all outgoing links from this page.
Note: A good place to put this tag is on outgoing links in commented areas. Wikipedia uses this method on all external links placed on wiki pages. It should also be noted that this not only can be used for entire pages but individual links:
Example: <a href="http://Somespamlink.com" rel="nofollow">Some comment spam</a>
NOSNIPPET META Tag
Google: Tells a crawler not to display snippets in the search results for a given page.
Use Cases: Present no snippet for the page on Search Results.
NOARCHIVE META Tag
Google: Tells a search engine not to show a “cached” link for a given page.
Use Cases: Do not make available to users a copy of the page from the Search Engine Cache.
Targeting specifc Search Spiders (user-agents)
Each visitor to a website will have a string that is identified by user-agent. This is the same string that we can use in our robots.txt that allows us to specify different rules for different search spiders. For example you could deny access to your archive articles because of [Google’s duplicate content penality] and allow the same archives to Yahoo’s bot.
Not only can the Google bot be identified with the user-agent string but also can be identified using reverse DNS based authentication. This allows for an alternative way to verify the identity of the crawler.
Here are some common user-agents for search engines:
Googlebot - Google
Googlebot-Image - Google Image
msnbot-Products - Windows Live Search
Mediapartners-Google - Google Adsense
DuckDuckBot - Duck Duck Go
Baiduspider+( - Baidu
ia_archiver - Alexa
Bingbot - Bing
msnbot-media/ - MSN Media
W3C_*Validator - W3C Validator
teoma - Ask Jeeves
msnbot-NewsBlogs/ - MSN News Blogs
You can check out a full list of user agents for search engine crawlers.
You may ask, so where can I put these rules?
You can put these in all forms of html and non html type documents. The most common place for these is robots.txt. This file is checked by all bots before entering a page and will adhere to the rules in the file for the entire domain. All you have to do is create a text file in the root directory of your domain with the name ‘robots.txt’.
These robot exclusions can also be put in non html files like, PDF and Video files using the X-Robots-Tag. You place these directives in the file’s http header.
Google also has a web developer tool that will simulate their bots going to your site. It will show you which URLs were excluded due to your robots.txt. You can even modify your robots.txt for testing purposes through their tool. This will help make sure you don’t accidently exclude the wrong directories/pages.
Robots.txt for Wordpress
Since the default installation of Wordpress is full of duplicate content I have created a robots.txt file to focus the Google bot’s attention on indexing the actual articles. Before I implemented my robots.txt file my rss feed page and archive listings were indexed over the actual articles. Here is an example of my robots.txt that removes most of the duplicate content in Wordpress. If you want to know more about Wordpress’ duplicate content problems check out, Duplicate Content Causes SEO Problems in Wordpress.
User-agent: * Disallow: /wp- Disallow: /search Disallow: /feed Disallow: /comments/feed Disallow: /feed/$ Disallow: //feed/$ Disallow: //feed/rss/$ Disallow: //trackback/$ Disallow: ///feed/$ Disallow: ///feed/rss/$ Disallow: ///trackback/$ Disallow: ////feed/$ Disallow: ////feed/rss/$ Disallow: ////trackback/$
### Conclusion
There was a lot to cover and with all things in the technology space, you must keep up with the changes as things evolve over time. If this post wet your appitite to learn more about the Robots Exclusion Protocal then I suggest reading [Improving on Robots Exclusion Protocal](https://webmasters.googleblog.com/2008/06/improving-on-robots-exclusion-protocol.html) on the webmasters section of the Google Blog.
Thanks for your time!