The basic syntax of robots.txt is fairly simple. You specify a robot name, such as “googlebot”, and then you specify an action. The robot is identified by user agent, and then the actions are specified on the lines that follow. Here are the major actions you can specify:
- Disallow: the pages you want to block the bots from accessing (as many disallow lines as needed) Some other restrictions apply:
- Each User-Agent/Disallow group should be separated by a blank line; however, no blank lines should exist within a group (between the User-Agent line and the last Disallow).
- The hash symbol (#) may be used for comments within a robots.txt file, where everything after # on that line will be ignored. This may be used either for whole lines or for the end of lines.
- Directories and filenames are case-sensitive: “private”, “Private”, and “PRIVATE” are all uniquely different to search engines.
Here is an example of a robots.txt file:
User-agent: Googlebot Disallow:
User-agent: BingBot Disallow: /
# Block all robots from tmp and logs directories User-agent: * Disallow: /tmp/ Disallow: /logs # for directories and files called logs
The preceding example will do the following:
- Allow “Googlebot” to go anywhere.
- Prevent “BingBot” from crawling any part of the site.
- Block all robots (other than Googlebot) from visiting the /tmp/ directory or directories or files called /logs (e.g., /logs or logs.php).
Notice that the behavior of Googlebot is not affected by instructions such as Disallow: /. Since Googlebot has its own instructions from robots.txt, it will ignore directives labeled as being for all robots (i.e., uses an asterisk).
One common problem that novice webmasters run into occurs when they have SSL installed so that their pages may be served via HTTP and HTTPS. A robots.txt file at http://www.yourdomain.com/robots.txt will not be interpreted by search engines as guiding their crawl behavior on https://www.yourdomain.com. To do this, you need to create an additional robots.txt file at https://www.yourdomain.com/robots.txt. So, if you want to allow crawling of all pages served from your HTTP server and prevent crawling of all pages from your HTTPS server, you would need to implement the following:
User-agent: * Disallow:
User-agent: * Disallow: /
Read More : SEO Services in India