Robots.txt: Guide to Robots exclusion standard

by Prashant Kumar Sharma

robots.txt, also known as the Robots Exclusion Protocol or Robot Exclusion Standard protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is unrelated to, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.

Here are few facts, i have discovered while working with robots.txt and are difficult to find anywhere,

  • The robots.txt patterns are matched by simple substring comparisons.
  • Robots are often used by search engines to categorize and archive web sites.
  • Standard was developed in 1994.
  • The protocol is purely advisory.
  • It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy.
  • robots.txt should only reside in the top-level directory of the website ( /  or  root ).
  • following Search engines use robots.txt
  1. Lycos
  2. AltaVista
  3. Yahoo
  4. Bing
  5. Google

I have tried to cover all available directives in a form of tutorial or a guide with comprehensive examples and tips and help.

List of Available Directives:

1. Allow:

This example allows all robots to visit all files because the wildcard “*” specifies all robots:

User-agent: *
Allow: /

2. Disallow
This example keeps all robots out:

User-agent: *
Disallow: /


The next is an example that tells all crawlers not to enter four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

3. User-agent
Example that tells a specific crawler not to enter one specific directory:

User-agent: BadBot # replace the ‘BadBot’ with the actual user-agent of the bot
Disallow: /private/

Example that tells all crawlers not to enter one specific file:

User-agent: *
Disallow: /directory/file.html

Note that all other files in the specified directory will be processed.

4. Comments #
Example demonstrating how comments can be used:
# Comments appear after the “#” symbol at the start of a line, or after a directive

User-agent: * # match all bots
Disallow: / # keep them out

Nonstandard extensions

5. Crawl-delay directive
Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:

User-agent: *
Crawl-delay: 10

6. Allow directive
Some major crawlers support an Allow directive which can counteract a following Disallow directive.This is useful when you disallow an entire directory but still want some HTML documents in that directory crawled and indexed. While by standard implementation the first matching robots.txt pattern always wins, Google’s implementation differs in that it first evaluates all Allow patterns and only then all Disallow patterns. Bing uses the Allow or Disallow directive which is the most specific.

In order to be compatible to all robots, if you want to allow single files inside an otherwise disallowed directory, you need to place the Allow directive(s) first, followed by the Disallow, for example:

Allow: /folder1/myfile.html
Disallow: /folder1/

This example will Disallow anything in /folder1/ except /folder1/myfile.html, since the latter will match first. In case of Google, though, the order is not important.

7. Sitemap
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form:

Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml

8. Extended standard
An Extended Standard for Robot Exclusion has been proposed, which adds several new directives, such as Visit-time and Request-rate. For example:

User-agent: *
Disallow: /downloads/
Request-rate: 1/5 # maximum rate is one page every 5 seconds
Visit-time: 0600-0845 # only visit between 06:00 and 08:45 UTC (GMT)

Advertisements

2 Responses to “Robots.txt: Guide to Robots exclusion standard”

  1. Hello I am Dylan Goss and I love this forum and sheep. I hope to learn from most of you, thanks! ( yes joiing about the sheep)

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: