PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Technical query - incremental indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Technical query - incremental indexing

    Are my assumptions right about incremental indexing.
    • For a file (e.g. PDF file) ZOOM looks at the file date on the server to see if that has changed.
    • For HTML files it looks for metadata in the head specifying a LastModified date to see if that has changed.
      • If there is no last modified date then it looks at the file date on the server.

  • #2
    Actually neither of these options is exactly what happens.

    It is also important not to be confused between HTTP and HTML.
    HTTP has header fields and HTML has a head section. So confusion is easy.

    For incremental indexing Zoom asks the web server for the last-modified date and time. The server response field appears in the HTTP header. It looks like this,
    Last-Modified: Tue, 15 Nov 1994 12:45:26 GM
    More details are here.

    If the file was not modified, then the HTML file itself isn't downloaded. So the HTML metadata is never downloaded, and not relevant. Looking at just the HTTP header is much faster than re-downloading every web page.

    Sometimes however a web server is reporting every page to have been updated (the last modified time is always the current time). This is sometimes happens for dynamic websites which were not coded to provide a last-modified date. For example, PHP pages, if not specifically coded to report a Last-Modified header, are configured to usually return with the current date and time, meaning that it is always considered a new page.

    For sites that aren't dynamic (scripted), then it is up to the web server to decide what date to report in HTTP. For a direct download link to a file (PDF, HTML or any other file), then that date is likely to be the file system date and time. But it can also be impacted by caching settings.

    Quoting from the specifications,
    "The exact meaning of this header field depends on the implementation of the origin server and the nature of the original resource. For files, it may be just the file system last-modified time. For entities with dynamically included parts, it may be the most recent of the set of last-modify times for its component parts. For database gateways, it may be the last-update time stamp of the record. For virtual objects, it may be the last time the internal state changed.

    An origin server MUST NOT send a Last-Modified date which is later than the server's time of message origination. In such cases, where the resource's last modification would indicate some time in the future, the server MUST replace that date with the message origination date.

    An origin server SHOULD obtain the Last-Modified value of the entity as close as possible to the time that it generates the Date value of its response. This allows a recipient to make an accurate assessment of the entity's modification time, especially if the entity changes near the time that the response is generated.

    HTTP/1.1 servers SHOULD send Last-Modified whenever feasible."

    It is also possible to set a last modified date & time in HTML using metadata like this,
    <meta http-equiv="Last-Modified" content="Sat, 07 Apr 2001 00:58:08 GMT">
    This is useful for setting the date of a document when doing a sort by date in search results. But not so useful for incremental indexing.

    Comment


    • #3
      Thank you David for your reply. Dynamically generated pages are the main target of my query. I'm working on a large document based web site (about 4,000 with docs ranging from 2 pages to 600 pages) and doing whatever I can to minimize server load and indexing turnaround times etc. (I was originally building it with the search engine bundled with the application server, however, that has turned out to be a show stopper due to some undocumented gotchas. We shouldn't be surprised but ZOOM is actually doing it a truck load better. I should have known better.)

      What I'm doing on the dynamically generated pages is getting the last date modified from the database record that gives rise to that page, converting it to the equivalent GMT date and time and them formatting it as a date and time stamp for adding between the head tags as per the example below taken from an actual page that I called just now (you'll note that the date and time is in the past even though it was dynamically generated now):

      <meta http-equiv="Last-Modified" content="Wed, 02 Oct 2019 04:31:30 GMT">

      so that even though the page is actually generated when you call it, the last date modified recorded in the page will be the date the database record was last modified.

      But I think what your saying is that it really depends on the HTTP header as to whether ZOOM re-indexes it or not.

      Comment


      • #4
        Originally posted by kpa View Post
        What I'm doing on the dynamically generated pages is getting the last date modified from the database record that gives rise to that page, converting it to the equivalent GMT date and time and them formatting it as a date and time stamp for adding between the head tags as per the example below taken from an actual page that I called just now (you'll note that the date and time is in the past even though it was dynamically generated now):

        <meta http-equiv="Last-Modified" content="Wed, 02 Oct 2019 04:31:30 GMT">

        so that even though the page is actually generated when you call it, the last date modified recorded in the page will be the date the database record was last modified.

        But I think what your saying is that it really depends on the HTTP header as to whether ZOOM re-indexes it or not.
        Yes I can confirm your last sentence. A meta Last-Modified date will not help Incremental Indexing because the spider would only be able to see this date AFTER it has already downloaded the page. So if you want to make sure the page is not updated on an incremental index, the last-modified date needs to be specified in the HTTP header.

        As David suggested above, the meta last-modified date is only useful (in the context of Zoom) for date sorting.

        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X