PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Timing out

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Timing out

    On occassion I have some pages that time out( which I know is a problem with my server not being able to handle the indexer and am looking into fixing ). So, when the indexing is done, it leaves out a subset of pages that didn't get indexed - I don't want to upload the index to my server without these.

    Question: Do I need to do a complete re-index to get these pages that didn't get included, or will an incremental index find them and just index those ones I missed? A full-site index takes 10+ hours so I would prefer to do an incremental.

  • #2
    How many pages on your site?
    10 hours is a long indexing session.
    Can you index a local copy of you site instead. Might save you heaps of time.

    You can incrementally add single pages or a list of pages (by entering in the URLs by hand). So this could work if you know all the pages that had a problem.

    But if you do an automatic incremental update it might or might not work. It depends on how your pages are linked. As the pages are not currently in the index, the indexer will not check to see if there is a new version of the page is available. However if you had a site map page, with links to all your other pages, and the site map page was updated, then this would work. The indexer would notice a newer version of the sitemap, look for links on the page, then find pages that are not currently in the index and add them.

    Of course the better solution is to fix your server so that it doesn't time out.

    Comment


    • #3
      Here are the stats for my site that takes 11 hours to index:
      68k pages indexed
      70k pages skipped
      33M words found
      2.6G of data downloaded


      I have it set to 0.5second throttle with 10 threads.(with no throttling, it consistently gets the website into slow state)

      The machine that runs indexer often gives an warning message that there is not enough RAM( requires 1.1G, but only 1G available )- that may be the root cause of it taking long as well.

      Any other ideas on how I could optimize?

      Thanks

      Comment


      • #4
        Most of the 11 hours the software will be doing nothing, just waiting around, becuase of the 0.5 sec delay per page you have set.

        If you indexed a local copy of your site (and removed the delay) you could probalby cut the indexing time down to 1 - 2 hours. I assume Zoom itself isn't overloading your site (it shouldn't be) and the real problem is that the live site is already under heavy load and isn't very quick to start with.

        RAM probalby isn't the main problem. But additional RAM is super cheap now (Around $14 / GB) so putting in an extra 1 or 2GB should be a no brainer.

        Comment


        • #5
          So, I've update the site and it is now responding faster. however, I am seeing something I had not seen before with zoom.

          I changed back the setting to have no delay in between requests, and still have 10 threads for downloading, but it seems that the downloading is still happening 1 at a time, and the entire process is taking even longer than before. Most of the time, the DL threads just sit idle and only 1 is in use at any given time. Any thoughts on what might be causing this? Previously when I had this configuration, all 10 threads would be continously downloading.

          I don't think its my server as it is responding quickly when pages are being downloaded and not having any CPU issues at all.

          I tried the zoom configuration on a different PC and got same behavior, so not sure what setting I may have changed to create this problem. I can email you my zcfg file for you to test.

          Thanks in advance for guidance - it is at the point where it is not even usable since it would take days to index my site.

          Also, as far as creating a local copy, what do most people do for this? Our site is completely database driven that it would take alot of work to try to keep a local copy in synch to keep indexing.

          Comment


          • #6
            Could be your server limiting the number of connections from a single IP address. (this is not typical but we have seen it before).

            What is the URL for you site, we can test it from here?

            How do you backup your database? And how do you test new versions of your site before going live? If you are just doing a SQL dump, you could restore the dump on a local machine. Then index from the local machine. But I agree if it not as easy doing this for a database driven site as a static site.

            The other option (if you are running a Windows server, and control the machine) is to install the indexer directly on the server itself. This will give a dramatic speed improvement as you remove all the Internet latency.

            Comment


            • #7
              Just to add our experience to this discussion, we have chosen to maintain a mirror of our site on a local Windows box, and all the indexing is carried out there, thereby eliminating all the internet traffic and latency.

              Stats:
              |Total words found: 8872355
              |Elapsed index time: 00:17:30
              |Total bytes scanned/downloaded: 2390483737

              Comment


              • #8
                I checked and apache is not limiting connections per IP. I don't know what else it might be. I've sent you my configuration file via email. Please take a look and see if you can find anything out of place. Thanks.

                Comment


                • #9
                  We've replied via e-mail already, but for anybody else following this thread...

                  The reason Zoom was still delaying between pages was because there was a "robots.txt" file on the website being indexed which specified "Crawl-delay:10" for all user agents.

                  As Zoom was configured to obey the robots.txt file ("Configure"->"Spider options"->"Enable robots.txt support"), Zoom was forced to wait between page downloads.

                  Disabling this option, or changing the robots.txt file is the suggested solution.

                  Alternatively, you can change the robots.txt file so as to specify a different crawl-delay value for Zoom. You can find information on how to do this here:http://www.wrensoft.com/zoom/support/useragent.html
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment

                  Working...
                  X