PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Google indexes search pages

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Google indexes search pages

    I use V6 enterprise (php) on several sites and noticed a strange effect.

    My site www.anothersite.nl for example:

    - Google indexes search result pages. See:
    www.google.nl/search?q=site:anothersite.nl&start=70 and look at the url's of the results. They are all like: www.anothersite.nl/index.php?zoom_query=... (with all kind of different search words).
    I think this can be nice: more exposure maybe?

    - In my searchword logfiles I find a lot of rules with Google IP addresses (66.249.66.143, 66.249.66.242 for example). In the logs of my own (small) site it can be 60 rules a day, but on another, bigger site, it can be as much as 500 rules a day!
    I think that is not nice: it slows down. I think I will have to do some PHP scripting to ban these IP addresses from being logged.

    I am curieus what is happening here. If I myself click on such a Google link to a search result page, then my own IP address shows in the log file.
    So is this just Google indexing? Does anybody know?

  • #2
    Originally posted by Rob F View Post
    - Google indexes search result pages. See:
    www.google.nl/search?q=site:anothersite.nl&start=70 and look at the url's of the results. They are all like: www.anothersite.nl/index.php?zoom_query=... (with all kind of different search words).
    I think this can be nice: more exposure maybe?
    While nobody knows exactly how Google indexes (and since it can change from day to day, anyone who claims to know all the rules to SEO is just plain lying), I believe this should only happen if you have text links to your search page somewhere. In other words, somewhere else on your site (or even on someone else's site), there may be links to this search page in the form of "www.anothersite.nl/index.php?zoom_query=etc"

    Once there is just one link into a set of legitimate results, it is possible that Googlebot will then find the other links to subsequent pages of results and index them as well.

    So I suspect this is the case here. If you do not have links like this on your website somewhere, nor do you think anybody is linking to you this way, note that it might be third-party spam websites that have done this. Spam sites (which try to just create a mass of legitimate links and manipulate them to their advantage) have their own spiders which automatically fill in forms that they find. It may well have automatically entered some words into the search form on your site, found the URLs to the search results, and created links on their spam sites. Googlebot then finds those links and follows them.

    Originally posted by Rob F View Post
    - In my searchword logfiles I find a lot of rules with Google IP addresses (66.249.66.143, 66.249.66.242 for example). In the logs of my own (small) site it can be 60 rules a day, but on another, bigger site, it can be as much as 500 rules a day!
    I think that is not nice: it slows down. I think I will have to do some PHP scripting to ban these IP addresses from being logged.
    There's a better way to do this. You can tell Googlebot what you want it to do with a "robots.txt" file. Simply tell it to not index the search results, or anything with "zoom_query" in it on your website if this is what you want.

    General information on robots.txt file:
    http://www.robotstxt.org/

    Google's own help page on robots.txt:
    http://www.google.com/support/webmas...n&answer=40360

    Note that Zoom's spider also obeys robots.txt so you need to be careful with your settings. You can specify rules for Googlebot (or ZoomSpider) specifically.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      In other words, somewhere else on your site (or even on someone else's site), there may be links to this search page in the form of "www.anothersite.nl/index.php?zoom_query=etc"
      Not on my site (about 20 pages). Including the ?zoom_query.. Google indexed 1090 pages!

      I just Googled on ?zoom_query, wich gives 2.780.000 results!
      www.google.nl/search?q=%3Fzoom_query
      So there are many more domains where this is happening.
      It's also the case with your own site, I just found out:
      www.google.com/search?q=zoom_query+site:wrensoft.com&hl=nl&start= 180&sa=N&filter=0

      note that it might be third-party spam websites that have done this. Spam sites (which try to just create a mass of legitimate links and manipulate them to their advantage) have their own spiders which automatically fill in forms that they find.
      Yes, this could be the case, but wouldn't I also find these links with Google? I tried, but couldn't find any. Also, if it was from spam bots, i think the search phrases would be different. More with terms that they want to spam?

      There's a better way to do this. You can tell Googlebot what you want it to do with a "robots.txt" file. Simply tell it to not index the search results, or anything with "zoom_query" in it on your website if this is what you want.
      Yes, I know I can do that. But maybe search engine marketing-wise it's better to let Google index the site, but just not log it.
      More important: I am very curious what is happening here.

      Comment


      • #4
        Hmmm. As David informed me, it turns out Google has actually been experimenting with the idea of searching the "Deep Web" and is, in fact, inserting selected text into form fields and indexing the submitted results.

        More information about it here:
        http://googlewebmastercentral.blogsp...tml-forms.html

        This really goes against most expectations on the web. Randomly filling form fields with values (which it guesses might be appropriate) is a real can of worms, and a more common method amongst spammers. They say that they only do this on some websites, presumably, websites of a certain pagerank and of a certain reputation they can trust to be not spammy. But their inability to recognize search results as non-meaningful index data is indicative of the problem with this method.

        However, it is simple enough to avoid, as mentioned in the above article and in my post above, you can use robots.txt to tell Googlebot to stay away from the search page. Or you could just add a "noindex" robots meta tag on the search template page.

        As for logging, I'm not sure if it's really meaningful to block them from logging. If it was blocked in the first place, you wouldn't have noticed this was happening. Second, Googlebot has a wide range of IP addresses, and there's always other bots on the web, from Yahoo's crawler, etc.

        You might want to exclude the IP addresses when you analyse the statistics, and you can do that if you import your searchwords.log file into Excel and filter the IP address column accordingly.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Ah, so this is what is going on!
          I am really glad with this knowledge. Thank you.
          I would prefer to let Google go on this way.

          Problem with logging al this Google activity, is that I think this will slow down the site (server)? The log file of one of the more visited and bigger sites with logs of one year is 17,4 MB! And really most of the searches are from Google. I presume that openening en writing to the large file will cost a lot of time/memory?

          Comment


          • #6
            Logging is not significantly expensive in this context. It doesn't really make much of a difference if you're appending to a large file or appending to a small file on most file systems.

            The fact is, the load of doing the actual search and serving the results would be greater than the load caused by logging. So there is not much to gain here. If you are worried about Googlebot putting a strain on your server load, the factors to consider is how frequent Googlebot is hitting your server. If it really is hitting it too hard, you would really need to block it from hitting your search pages and executing a search instead of just simply disabling logging.

            You might be able to use "crawl-delay" or "Request-Rate" instructions in robots.txt. But it's not clear if Googlebot supports these commands (they didn't in the past).
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Raymond,

              I will 'robot' Google out of the sites on wich it crawles too many times a day.

              Thanks for all the very to-the-point, clarifying replies. Zoom Search is a really superb program. And it's an nice extra that you give this great service on the forum!

              Comment


              • #8
                Still problems with Google filling out the zoom search forms

                I found out that Google still was filling out the search forms. Then I realised that in the robots.txt I told Google not to index zoom_search result pages. But this doesn't prevent Google to visit the form, which is embedded in the site (in a footer.php that is included in all pages).
                I am afraid I am forced to use IP to ban Google searches from being logged.

                Comment


                • #9
                  It would surprise me if Google proceeds to index the results page (presumably you mean search.php) when you have told it not to in the robots.txt file. Even if it was filling in forms on other pages to get to it, it would ultimately be failing to comply with the robots.txt file if it did this. You should check if your robots.txt file is valid.

                  Also, by placing a meta robots noindex tag on the search_template.html page, it should also prevent Google from indexing the results (regardless of how it got there). See this Google support page for more information.

                  Note also that your changes will not immediately take effect. It may take Google some time before they re-index your site.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment

                  Working...
                  X