PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexer starts to index site then stops and uses 100% CPU

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexer starts to index site then stops and uses 100% CPU

    I am having a problem trying to index my site. I ran Zoom Indexer yesterday and created a rather large index of the site. (wordmap and pagetext files are 105mb each).

    I noticed that it was finding a lot of pages multiple times and listing them seperately because the name is different but the page is the same. I wasn't sure if there was a way for Zoom to know that the page was the same so I decided to add the special zoom no follow tag around many of the links so that when someone uses the search they won't get a page of identical pages with different URL's. The other change I made since it ran yesterday was that I increased the list of recommended links that I imported from ~56000 to ~120000.

    The problem I have today when I tried to index the site (hoping that it would remove a lot of the duplicate pages and reduce the file sizes) is that it gets to one of two pages and stops. CPU usage is 100% and it doesn't try to go to any more pages.

    In the status window it says:
    DL Thread #2, got URL (http://www.bnbfinder.com/?action=myinns&l=1) off queue
    Downloading file http://www.bnbfinder.com/?action=myinns&l=1

    It never says that it gets the ready buffer like the other pages before it say. Sometimes it gets past this file but it will stop at a state page.
    http://www.bnbfinder.com/Alabama-Bed-and-Breakfast

    I haven't been able to run it succesfuly today. Of the five times I tried it stopped at one of those two pages every time.

    I am writing this as I am trying to run the index and after about 5 minutes of it sitting on the same page it just jumped to another. So it does seem to be working but it is running extremely slow. It took a total of 2 hours to index the entire site yesterday and it should be faster today since I am reducing the number of links it sees considerably. However it has taken 10 minutes so far and it has only indexed 24 pages. (I have about 25,000 pages).

    Have I found or exceeded some sort of limit, perhaps with the recommended links?

  • #2
    Initially I thought it was this problem related to System Mechanic 7. But the symtoms are not the same.

    I think it is more likely that is is FAQ problem related to a bug in HTTP 1.1 Windows networking.
    Q. Spider mode indexing stops responding at a particular URL and CPU utilization is at 100%

    The 3rd possibility is that it is something related to your truely massive list of 120,000 recomended links. We never tested the software with this many recomended links. You might want to E-Mail the recomended links import file if you still have a problem.

    Regarding the duplicate pages. If the pages are really 100% the same you can use the CRC function in Zoom to filter them. Otherwise you can probalby use the skip list. Can you post some example URLs that point to 'duplicate' pages. Of course there is also the solution of designing your site so that each page has a unique URL.

    Comment


    • #3
      I did read the posting about System Mechanic but I do not run that software and I checked what services where running and did not see it listed (just in case someone else had previously installed it on this machine).

      I took your advice about the bug in http 1.1 and disabled that and I had the same problem. I read the post on Microsoft's bug report and it seems that this only effects IE 4 and 5 so I guess they fixed it in the more recent versions of the browser. (I am using XP Pro with IE7)

      I removed the recommended list and still had the problem. The only thing left that I have changed is the <!--zoomstop--> and <!--zoomstart--> which I am searching thru my files to remove now.

      I did a test on another website I have but it is only a handful of pages at the moment and it worked so I don't think there is a problem with the install itself.

      I found the CRC check and think I will use that. I would love to write the site to not have multiple addresses for the same page but I didn't write the site initially and although I really want to for many reasons I have to answer to a boss as well and she won't let me rewrite the entire site.

      The reason I have so many recommended links is because I want people who search by a city, state, or city/state combination to be given the most likely page they want at the top as opposed to just a list of the pages that are matched in the search. (the city state pages themselves don't have enough weight to outrank some of our other pages which may list the city or state name 100 times due it being a list).

      Comment


      • #4
        I'm a wanker.

        It looks like the problem was a misspelling in the <!--ZOOMRESTART--> tag.

        I don't know what it tries to do if it can't match up the tags but I now know from first hand experience that it's not good.

        Thank you for your help.

        Comment


        • #5
          This is still not normal. The software should never get stuck in a loop like you are suggesting. It is hard to imagine how this could happen.

          Are you sure it is not a co-incendence. I would think switching to HTTP 1.0 was more likely to have been the solution. We have seen this HTTP 1.1 problem in IE6 and IE7 with some servers despite that Microsoft have said about fixing it.

          Comment


          • #6
            I fixed the tag and am currently running the index. It has visited 13,000 URL's with about 10,000 to go. I switched HTTP 1.1 back on after it didn't seem to solve the problem and I also have my list of 120,000 recommended links included.

            The only thing that has changed is that I fixed the tag. If you would like I can send you exactly what I did and you can test it. The stop tag was correct but the restart tag was wrong. I would imagine that it would have just not found anything to go to and ended the search but it seems to go into a loop maybe searching for the restart tag.

            If there is anything you need from me that can help you narrow it down let me know and I will be happy to help you. For the time being I am happy that it is working again and I can test it then integrate it into our live website.

            Comment


            • #7
              Note that if you do not have the "Reload all files (do not use cache)" option checked in the Configuration window (on the "General" tab), then it is possible that you are indexing from the cache. As such, it may be that switching off the HTTP 1.1 option did not take effect until the cache was cleared and Zoom had another attempt at downloading the files.

              But if this is not the case, and if it is possible, yes, we would like to see the files with the incorrect tags which causes Zoom to be stuck. You can zip up the files and e-mail them to us, and we can look into whether we can replicate the problem. Perhaps you should include your ZCFG file as well so we will use the same settings and include that long list of recommended links.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment

              Working...
              X