PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

PDF indexing strategy

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PDF indexing strategy

    I am looking to use Zoom Search to index a medium size set of PDFs (approx. 3000). All the PDFs are in one directory (I don't want to get into moving files around as they are made active or inactive, see below). My question is this: The PDFs have meta data that is in a SQL database. Without getting into too much business detail, there are "active" pdfs and "inactive" pdfs. I would like to only index the "active" pdfs. This list would of course change on about a weekly basis, as some new pdfs are added and some "active" pdfs change to "inactive" pdfs. Can anyone suggest a strategy for keeping Zoom up to date with the list of pdfs it should index?

    Thank you for your help!
    - Jen

  • #2
    Presuming that you have server-side generated scripts which access the SQL database and report their active/inactive status, and presuming that these pages will be the only places where the links to the PDF files themselves will be offered, then you can consider the following:

    When a page displays a HTML link to a PDF file which is inactive, add some scripting to the page so that it will surround the link with the <!--ZOOMSTOPFOLLOW--> and <!--ZOOMRESTARTFOLLOW--> tags. This will prevent Zoom's spider mode from following the links and downloading (and thus, indexing) the file.

    More information on the ZOOMSTOP and ZOOMRESTART tags can be found in the Users Guide and Help files.

    Note that you will need to block off all links to these inactive PDF files on your website, when crawling the site in spider mode. So if you have other pages where there are direct links to the PDF files, you will need to do this on those pages too.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Actually, I'm not using Zoom to index the website itself, which will contain active/inactive links based on a users input to a database search.

      The Zoom is meant to offer a full-text search of the active PDFs, which of course the database couldn't offer. I use Zoom to just search the directory of PDFs itself, using the directory listing as a guide to search the PDFs. I was thinking that there may be some way to use the Skip Links configuration, but I would want the database to generate the Skip Links and not have to manually go through the GUI config tool to add all those files... I figured there might be a better way to do it. Any other thoughts?

      Thank you for your very quick response.

      Jen

      Comment


      • #4
        If you are an experienced developer, we provide a SDK for Zoom which includes documentation on the ZCFG file format. This means that you can then create a script that generates a new ZCFG file each time (with a different skip list reflecting the inactive statuses), and call Zoom via the command-line to re-index the files accordingly. More information on the SDK here:
        http://www.wrensoft.com/zoomsdk/index.html

        Alternatively, you could use Spider Mode and point it to a specially created script/page which generates HTML links to only the active PDF files. This way Zoom will never find the inactive PDF files.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Thank you. Those are excellent suggestions. I didn't realize that you had an SDK. I will look at it.

          Best,
          Jen

          Comment

          Working...
          X