Preventing selected subdirectories from being crawled

Expertise level: Medium

Webcrawlers can be prevented from accessing certain directories of your website by using the disallow option in your robot.txt file. 

Web site owners use the /robots.txt file to give instructions about their site to web robots. It works likes this: A robot wants to vist a website URL (example: http://www.example.com/welcome.html). Before it does, it checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site. If you need to prevent the robots from accessing the cgi-bin directory, use the following lines in your robot.txt file:

User-agent: *
Disallow: /cgi-bin/

Robots directives for Disallow/Allow are case-sensitive. Use the correct capitalization to match your website.

Additional symbols allowed in the robots.txt directives include:

  • '*' - matches a sequence of characters

    Example of '*':
  • User-agent: Slurp
    Allow: /public*/
    Disallow: /*_print*.html
    Disallow: /*?sessionid
  • The robots directives above;

    1. Allow all directories that begin with "public" to be crawled.
    Example: /public_html/ or /public_graphs/
    2. Disallow files or directories which contain "_print" to be crawled.
    Example: /card_print.html or /store_print/product.html
    3. Disallow files with "?sessionid" in their URL string to be crawled.
    Example: /cart.php?sessionid=342bca31

  • '$' - anchors at the end of the URL string

    Example of '$':
  • User-agent: Slurp
    Disallow: /*.gif$
    Allow: /*?$

    The robots directives above:

    1. Disallow all files ending in '.gif' in your entire site.
    Note: Omitting the '$' would disallow all files containing '.gif' in their file path.
    2. Allow all files ending in '?' to be included. This would not allow files that just contain '?' somewhere in the URL string.

There are two important considerations when using /robots.txt:

  • Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities and email address harvesters used by spammers will pay no attention.
  • The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
Have more questions? Submit a request

0 Comments

Article is closed for comments.
Powered by Zendesk