Announcement

**Ray** · Mar-20-2008, 12:33 AM

We could not reproduce the problem you describe. A robots.txt disallow of "search.php" will exclude "search.php?f=8" or similar pages.

However, in looking at this issue, we did discover a bug which might be the actual cause of your problem. Zoom is currently looking for "robots.txt" file at the base URL, as opposed to the root of the domain.

While there is no specification for "robots.txt" file, the general consensus seems to be that it should be located in the root folder of the domain, that is:
http://www.mysite.com/robots.txt

If Zoom is given a start point that begins at the root domain (eg. you start spidering from http://www.mysite.com/index.html), then this is not a problem, and it uses the abovementioned robots.txt file. But if Zoom is given a start point that is one or more folders deep, it will mistakenly look for the robots.txt file there.

So for example, given a start point of http://www.mysite.com/forums/index.php

It will currently look for (and use) the robots.txt file at:
http://www.mysite.com/forums/robots.txt

This is incorrect based on robotstxt.org (the original specs are somewhat more vague and ambiguous). And it means that, potentially, it is not finding the "correct" robots.txt file, especially if you have multiple (invalid) robots.txt file located in your other folders.

You mentioned that you have confirmed that the log found a robots.txt file, but I wonder if it might be the case, that you actually have more than one robots.txt file, one of which is invalid and situated in a folder besides the root, and Zoom is using that one instead. This might explain why it is not behaving as you expect.

We will address this issue in our next build (5.1.1014) and change it so that Zoom will only pick up the robots.txt file at the root level of the domain.

If you are sure that it is using the correct robots.txt file in your scenario, and still believe that it is failing to skip the disallows specified, can you provide us with the actual URL to your website so that we can investigate further.

Announcement

Partly ignores robots.txt?

Partly ignores robots.txt?

Comment