Jump to content
Gonzalez

How to Set Up a Robots.txt File

Recommended Posts

How to Set Up a Robots.txt File

Writing a robots.txt file is extremely easy. It's just an ASCII text file that you place at the root of your domain. For example, if your domain is

www.example.com

, place the file at

www.example.com/robots.tx

t. For those who don't know what an ASCII text file is, it's just a plain text file that you create with a type of program called an ASCII text editor. If you use Windows, you already have an ASCII text editor on your system, called Notepad. (Note: only Notepad on the default Windows system is an ASCII text editor; do not use WordPad, Write, or Word.)

The file basically lists the names of spiders on one line, followed by the list of directories or files it is not allowed to access on subsequent lines, with each directory or file on a separate line. It is possible to use the wildcard character "*" (just the asterisk, without the quotes) instead of naming specific spiders. When you do so, all spiders are assumed to be named. Note that the robots.txt file is a robots exclusion file (with emphasis on the "exclusion") — there is no universal way to tell spiders to include any file or directory.

Take the following robots.txt file for example:

User-agent: *
Disallow: /cgi-bin/

The above two lines, when inserted into a robots.txt file, inform all robots (since the wildcard asterisk "*" character was used) that they are not allowed to access anything in the

cgi-bin directory

and its descendents. That is, they are not allowed to access cgi-bin/whatever.cgi or even a file or script in a subdirectory of cgi-bin, such as

/cgi-bin/anything/whichever.cgi.

If you have a particular robot in mind, such as the Google image search robot, which collects images on your site for the Google Image search engine, you may include lines like the following:

User-agent: Googlebot-Image
Disallow: /

This means that the Google image search robot, "Googlebot-Image", should not try to access any file in the root directory "/" and all its subdirectories. This effectively means that it is banned from getting any file from your entire website.

You can have multiple Disallow lines for each user agent (ie, for each spider). Here is an example of a longer robots.txt file:

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/

User-agent: Googlebot-Image
Disallow: /

The first block of text disallows all spiders from the images directory and the cgi-bin directory. The second block of code disallows the Googlebot-Image spider from every directory.

It is possible to exclude a spider from indexing a particular file. For example, if you don't want Google's image search robot to index a particular picture, say,

mymugshot.jpg

, you can add the following:

User-agent: Googlebot-Image
Disallow: /images/mymugshot.jpg

Remember to add the trailing slash ("/") if you are indicating a directory. If you simply add

User-agent: *
Disallow: /privatedata

the robots will be disallowed from accessing privatedata.html as well as privatedataandstuff.html as well as the directory tree beginning from /privatedata/ (and so on). In other words, there is an implied wildcard character following whatever you list in the Disallow line.

Where Do You Get the Name of the Robots?

If you have a particular spider in mind which you want to block, you have to find out its name. To do this, the best way is to check out the website of the search engine. Respectable engines will usually have a page somewhere that gives you details on how you can prevent their spiders from accessing certain files or directories.

Common Mistakes in Robots.txt

Here are some mistakes commonly made by those new to writing robots.txt rules.

  1. It's Not Guaranteed to Work
  2. As mentioned earlier, although the robots.txt format is listed in a document called "A Standard for Robots Exclusion", not all spiders and robots actually bother to heed it. Listing something in your robots.txt is no guarantee that it will be excluded. If you really need to block a particular spider ("bot"), you should use a .htaccess file to block that bot. Alternatively, you can also password-protect the directory (also with a .htaccess file).
  3. Don't List Your Secret Directories
  4. Anyone can access your robots file, not just robots. For example, typing
    http://www.google.com/robots.txt

    will get you Google's own robots.txt file. I notice that some new webmasters seem to think that they can list their secret directories in their robots.txt file to prevent that directory from being accessed. Far from it. Listing a directory in a robots.txt file often attracts attention to the directory. In fact, some spiders (like certain spammers' email harvesting robots) make it a point to check the robots.txt for excluded directories to spider.

  5. Only One Directory/File per Disallow line
  6. Don't try to be smart and put multiple directories on your Disallow line. This will probably not work the way you think, since the Robots Exclusion Standard only provides for one directory per Disallow statement.

How to Specify All the Files on Your Website

A recent update to the robots.txt format now allows you to link to something known as a sitemaps protocol file that gives search engines a list of all the pages on your website.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...