PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

PDF Files, Indexing Technique Offline

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PDF Files, Indexing Technique Offline

    First up - new user alert.

    Currently I am taking on a task for a large organisation that will not permit Zoom on the servers or network. For a number of years the tool has been used offline on a laptop then the index loaded onto the server which works very well.

    Given the success of this approach, another business entity has decided to follow suit, but they have numerous linked PDF files (thousands).

    The Problem
    To download the 15-16000 html files takes a few hours, indexing a few minutes then the upload of the index files about 20 minutes. Currently we do not download the PDF files as the time is prohibitive but I would like to work out a way to get them done too. Ideally they would be downloaded closer to the server location and scanned there, or perhaps a mirror site would be kept for just this purpose on a laptop (I may have just answered my own question here).

    Is it possible to merge two index results? Are there any other options experienced users have come up with?

    The search engine is distributed on CDROM as well as the website.

  • #2
    Short answer is there's no easy way to merge two indexes. Longer answer is any attempt at doing so would be quite complicated and not worth it for the purpose here. What you can do however is incrementally add to an index.

    If download time is prohibitive, would recommend you get an offline mirror copy (e.g. evidently it will fit on a CDROM so no bigger than 600 MB in total) of the site and all the files (HTML, PDF, etc.) and use Offline Mode to index.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Thought I'd post a followup in case anyone else is in a similar situation. Since I couldn't avoid the downloading of the html I looked a bit closer at the pdfs and how fast they changed over. As it turned out, the churn was minimal so an offsite mirror was created and a process put in place to capture changes to the main site. This means 20 or so files may need to be added or deleted every 3 months or so. Still a painful experience to transfer the files but manageable and the end user experience is greatly improved which is more important.

      Comment

      Working...
      X