PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

How does incremental indexing work

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How does incremental indexing work

    Hi folks - recently I put together a large ZOOM project and for all intents and purposes it ran extremely well giving me a well working end product. Took about 7 hours to run. So I thought I'd test doing an incremental index as that's going to be the most common method of updating. As required, there have been no changes to the config for the job and this job is the only job the machine is running. I've made sure the job is well resourced as its looking for about 5GB of RAM.

    The first thing I noticed was that there is not a lot of feedback when an incremental is running. Unlike doing the the original index,very little if anything is dumped to the log screen so its unclear whats happening. At this stage the incremental has been running three times longer than the original indexing job and I don't know if its working or not. Its a definite maybe.

    Having said that there are a few log entries that are of considerable interest.

    After the job had been running for maybe a couple of hours (there's no date and time stamp on the log entries so its a bit hard to tell) about 40 lines appeared like this even though there was no such error in the original indexing job:

    Skipping http://www.classiccmp.org/pipermail/...ay/035129.html (External site - does not match base URL)

    The relevant start point is http://www.classiccmp.org/pipermail/cctalk/ so its unclear why ZOOM suddenly treated this URL as an external site.

    Overnight and following the above log entries about 100 log entries were generated thus:

    Warning: Core engine is running slow while indexing

    but there's no other info that might help understand whats happened here. As detailed before, the machine only had one job.

    Then I see about another 40 lines in the log screen:

    Skipping http://ana-3.lcs.mit.edu/~jnc/cctalk...pril/0661.html (External site - does not match base URL)

    Like the first example, the start point is http://ana-3.lcs.mit.edu/~jnc/cctalk/ so its unclear why ZOOM suddenly sees a bunch of URLs as an external site when they match the start point.

    Any info on what might be happening here and/or how incrementals work would be very useful. Thank you.

  • #2
    There are a collection of different options for incremental indexing. The two main ones are to "update and existing index" and "Add list of new or updated pages to existing index".

    If you can, the best option, by a big margin, is to add a list of new or updated pages. i.e. you tell Zoom which pages have changed. This gives optimal performance as there is no need to check each and every page on the site to see if an update an occurred. So if you can add some mechanism on your site to maintain a list of modified pages.

    If you can't do this, then every page on the site needs to by checked for possible modification. Which can involve a lot of internet traffic. Within this scenario there are a couple of sub-scenarios. Some web sites return accurate HTTP response header information for the last-modified date and file size. Others don't. So if for example your web site always returns the last modified date of today, then incremental indexing is effectively futile, and counter productive, as zoom will not be able to accurately find the files which have been changed.

    Not sure about the external site thing. We would need to have a look at the Zoom configuration file.

    Comment


    • #3
      Thank you!!! Happy to send the config file. Where can I send it?

      Comment


      • #4
        Originally posted by kpa View Post
        Thank you!!! Happy to send the config file. Where can I send it?
        You can find our contact information on the Contact Us page.

        Comment


        • #5
          Hi guys - I have sent all the details to your info mail box.

          Comment

          Working...
          X