Creating
a robots.txt File
Teacher:
Sumantra Roy
Some people
believe that they should create different pages for
different search engines, each page optimized for
one keyword and for one search engine. Now, while I
don't recommend that people create different pages
for different search engines, if you do decide to
create such pages, there is one issue that you need
to be aware of.
These pages,
although optimized for different search engines,
often turn out to be pretty similar to each other.
The search engines now have the ability to detect
when a site has created such similar looking pages
and are penalizing or even banning such sites. In
order to prevent your site from being penalized for
spamming, you need to prevent the search engine
spiders from indexing pages which are not meant for
it, i.e. you need to prevent AltaVista from
indexing pages meant for Excite and vice-versa. The
best way to do that is to use a robots.txt
file.
You should create
a robots.txt file using a text editor like Windows
Notepad. Don't use your word processor to create
such a file.
Here is the basic
syntax of the robots.txt file:
User-Agent:
[Spider Name]
Disallow: [File Name]
For instance, to
tell AltaVista's spider, Scooter, not to spider the
file named myfile1.html residing in the root
directory of the server, you would write
User-Agent:
Scooter
Disallow: /myfile1.html
To tell Excite's
spider, called ArchitextSpider, not to spider the
files myfile2.html and myfile3.html, you would
write
User-Agent:
ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of
course, put multiple User-Agent statements in the
same robots.txt file. Hence, to tell AltaVista not
to spider the file named myfile1.html, and to tell
Excite not to spider the files myfile2.html and
myfile3.html, you would write
User-Agent:
Scooter
Disallow: /myfile1.html
User-Agent:
ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to
prevent all robots from spidering the file named
myfile4.html, you can use the * wildcard character
in the User-Agent line, i.e. you would
write
User-Agent: *
Disallow: /myfile4.html
However, you
cannot use the wildcard character in the Disallow
line.
Once you have
created the robots.txt file, you should upload it
to the root directory of your domain. Uploading it
to any sub-directory won't work - the robots.txt
file needs to be in the root directory.
I won't discuss
the syntax and structure of the robots.txt file any
further - you can get the complete specifications
from http://www.robotstxt.org/wc/norobots.html
Now we come to
how the robots.txt file can be used to prevent your
site from being penalized for spamming in case you
are creating different pages for different search
engines. What you need to do is to prevent each
search engine from spidering pages which are not
meant for it.
For simplicity,
let's assume that you are targeting only two
keywords: "tourism in Australia" and "travel to
Australia". Also, let's assume that you are
targeting only four of the major search engines:
AltaVista,
Excite,
HotBot
and Northern
Light.
Now, suppose you
have followed the following convention for naming
the files: Each page is named by separating the
individual words of the keyword for which the page
is being optimized by hyphens. To this is added the
first two letters of the name of the search engine
for which the page is being optimized.
Hence, the files
for AltaVista are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for
Excite are
tourism-in-australia-ex.html
travel-to-australia-ex.html
The files for
HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for
Northern Light are
tourism-in-australia-no.html
travel-to-australia-no.html
As I noted
earlier, AltaVista's spider is called Scooter and
Excite's spider is called
ArchitextSpider.
A list of spiders
for the major search engines can be found at
http://www.searchenginewatch.com/webmasters/spiderchart.html
From this list,
we find that the spider for Northern Light is
called Gulliver. HotBot uses Inktomi
and Inktomi's spider is called Slurp. Using this
knowledge, here's what the robots.txt file should
contain:
User-Agent:
Scooter
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent:
ArchitextSpider
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent:
Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent:
Gulliver
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the
above lines in the robots.txt file, you instruct
each search engine not to spider the files meant
for the other search engines.
When you have
finished creating the robots.txt file, double-check
to ensure that you have not made any errors
anywhere in it. A small error can have disastrous
consequences - a search engine may spider files
which are not meant for it, in which case it can
penalize your site for spamming, or, it may not
spider any files at all, in which case you won't
get top rankings in that search engine.
An useful tool to
check the syntax of your robots.txt file can be
found at http://www.tardis.ed.ac.uk/~sxw/robots/check/.
While it will help you correct syntactical errors
in the robots.txt file, it won't help you correct
any logical errors, for which you will still need
to go through the robots.txt thoroughly, as
mentioned above.
About
the teacher:
Sumantra
is one of the most respected search engine
positioning specialists on the Internet. To have
Sumantra's company place your site at the top of
the search engines, go to http://www.1stSearchRanking.com/
For more advice on how you can take your web site
to the top of the search engines, subscribe to his
FREE newsletter by going to http://www.1stSearchRanking.com/newsletter.htm