Robots.txt
Robots Exclusion Standard. It sounds like something straight out of a science fiction book, but is really nothing more than a tool to prevent web spiders and robots from accessing a particular section of your website, or even your entire website , that you don't want indexed.
The standard goes by many names, like the Robot Exclusion Protocol, but you most likely have heard of it as the robots.txt protocol. No matter what you may call it, it is a handy tool that, when used properly, can help increase ranking within various web pages.
The standard was created in June of 1994 to handle robots that were accessing deep virtual trees, attacking servers with a succession of rapid requests, and downloading certain files over and over again.
Despite its name, the Robots Exclusion Standard is not backed by any acting body or organization. Nor is it enforced by anyone, and there are no guarantees that any present or future robots will comply with it. There is a movement involving what is known as ACAP, or Automated Content Access Protocol, that is seeking to update the Robots Standard, and perhaps govern it.
In order to stop web spiders and web robots from accessing and indexing every inch of your website, you use a file known as robot.txt.
As the filename suggests, robot.txt is a text file.
It contains data that tells a robot whether or not it can access certain areas of your site.
You store the file in the top-level directory of your site “` ROOT “
If you have sub-domains, then each one will require its own robots.txt file. If you exclude it, then the rules will apply for yoursite.com but not for, say, sample.yoursite.com.
Some examples of top-level directories are:
www(dot)sample(dot)com/robots.txt
www(dot)blog(dot)com/robots.txt
www(dot)url(dot)com/robots.txt
Examples of sub-domain directories where you would store the robots.txt
www(dot)your(dot)sample(dot)com/robots.txt
www(dot)some(dot)sample(dot)com/robots.txt
www(dot)bad(dot)sample(dot)net/robots.txt
Your First Steps
Your first step is to create a new text file. You can use Notepad, but Word and OpenOffice work just as well, so long as you save the file as a .txt. The robots.txt file uses two basic lines, the User-Agent and the Disallow. User-Agent lists the spider or bot that you either wish to grant access to or deny access to. Disallow lists the directory or filename you wish for the bot/spider to crawl or not crawl.
NO index required (your site will be hidden) ,
you would type the following into your text file:
User-agent: *
Disallow: /
In this example, the “*” is known as a wildcard and says that the rule applies to all bots. A wildcard is a special character that could stand for anything. In typical usage, if you write d*ng, a computer can interpret this as being: “ding”, “dang”, “dong”, “dung”, “dzing” and so forth. Simply put, the “*” could be anything.
The Disallow part says that no directory or file should be scanned. It's important to note how this works. The patterns in the Disallow are matched by using a substring comparison.
The robot sees what is written there and says, “Does this directory or file contain this?” For instance, let's say our site is www(dot)sample(dot)com. If I have a directory called “images,” it would be listed as www(dot)sample(dot)com/images/.
In this instance, the bot sees the “/” in the www(dot)sample(dot)com/images/ and will ignore it.
To allow search engines to visit and view every file and directory,
you would write this in your file:
User-agent: *
Disallow:
Again, User-Agent uses the wildcard to say that whatever is in the Disallow line applies to all bots. Since the Disallow is blank, there is nothing to match, and so all files and directories are available.
If you want every bot to ignore one directory, we would write:
User-agent: *
Disallow: /images/
Again, the wildcard says all bots should follow the Disallow. The Disallow asks the bots to stay away from /images/. If the bots are compliant, they won't scan this directory or the files therein. Note again that I wrote “/images/” and not “/image”. You always want to include that final forward slash (/).
To tell all bots not to scan a specific file, we use this code:
User-Agent: *
Disallow: /images/biggorillaonatricycle.jpg
Now all bots should scan everything except the biggorillaonatricycle image. When it finds that picture in the “image” directory, it looks away, even though, let's face it, who wouldn't want to see that? An important thing to note here is that if we had, say, a secondary directory (named "imagestwo" perhaps) that held some photos and included the same picture, the bots would still scan that one, unless you told them otherwise.
Here is how you could make it so that neither of the pictures of our buddy the gorilla riding on his tricycle get scanned:
User-agent: *
Disallow: /images/biggorillaonatricycle.jpg
Disallow: /imagestwo/biggorillaonatricycle.jpg
This rule applies to directories as well:
User-agent: *
Disallow: /images/
Disallow: /imagestwo/
Disallow: /aboutus/
The above tells all bots to ignore the three directories. Note that we can also mix our directories and files together:
User-agent: *
Disallow: /images/
Disallow: /imagestwo/
Disallow: /aboutus/wearereally.html
So far we have focused primarily on how to limit files. Now we will work with limiting specific bots from accessing our files.
If we want to tell one specific bot to stay out of all of our directories, we input the following code into our robots.txt file:
User-agent: Google-Bot
Disallow: /
Now Google should stay away from all of our directories.
We can also tell a specific bot to ignore one or more directories or files, like so:
User-agent: WebZIP
Disallow: /images/
Disallow: /secrets/globaldomination.html
And finally, if we want to specify that several bots are not allowed to access a directory or file, we can do so in this manner:
User-agent: WebZIP
Disallow: /images/
Disallow: /secrets/globaldomination.html
User-agent: Fetch
Disallow: /images/
Disallow: /secrets/globaldomination.html
Disallow: /tmp/
User-agent: MSIECrawler
Disallow: /images/
Disallow: /secrets/globaldomination.html
User-agent: WebCopier
Disallow: /images/
Disallow: /secrets/globaldomination.html
Disallow: /cgi-bin/
Whenever you add a bot, you must include a space between your line, which tells the interpreter that this is a new record.
for further info view all |