PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Incremental indexing & last modified property

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Incremental indexing & last modified property

    Hello,

    I’ve got a question regarding incremental indexing.
    I’d like to do an incremental index of our website, but that process does consume much more time than a full-reindex. ZoomSearch checks all files for "changes made to pages in the existing index and adds files to a queue and downloads all of them again. Of course, I did not change the .zcfg. After studying the FAQs and the forums I found out, that the root of the problem might be the “last modified” property of the html files. But if that is the case, why does it take so much more time than the regular reindex?


    Thanks in advance!

    Marc

  • #2
    Originally posted by mwe View Post
    ZoomSearch checks all files for "changes made to pages in the existing index and adds files to a queue and downloads all of them again.
    No, it asks the web server for the last-modified date and time. This should be much faster than re-downloading every web page. It is only for web pages which are reporting itself to have been updated that Zoom will proceed to download them and re-index them.

    Originally posted by mwe View Post
    After studying the FAQs and the forums I found out, that the root of the problem might be the “last modified” property of the html files. But if that is the case, why does it take so much more time than the regular reindex?
    It doesn't take more time, unless your web server is reporting every page to have been updated. This is possible for dynamically driven websites which were not coded to provide a last-modified date. For example, PHP pages, if not specifically coded to report a Last-Modified header, are configured to usually return with the current date and time, meaning that it is always considered a new page.

    If this is the case, you shouldn't use Incremental Indexing. The Indexer needs an accurate Last-Modified date time for any of this to be meaningful. There's no other magic way of finding out if a page has changed or not, and we haven't quite worked out how to make it telepathic...

    So if you imagine that every page reports itself to be new/changed, then Zoom ends up having to re-download and re-index every page, just like a full re-index. But in addition to doing this, Zoom needs to check the date/time for each file first to determine this, and also manipulate the existing index files to insert/update data as needed. In this scenario, it would certainly be slower.

    Please see chapter 2.9 in the Users Guide for more information on Incremental Indexing. This is explained in section 2.9.1.

    If you only need to add new pages to the index, you can specifically use "Add list of new pages" or "Add start points to existing index", without relying on last-modified dates. Only "Update existing index" depends on the last-modified dates in this way.

    There is also chapter 7.7 which tells you how to specify a last-modified date.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Okay, that makes totally sense and is kind of an explanation, I thought about in the first place.

      Thank you, Ray for the very fast and detailed feedback.

      Comment

      Working...
      X