PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Surpressing scan of .pdf files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Surpressing scan of .pdf files

    I have a slew of large (apx. 2 MB) pdf files that contain mostly imagery. Each .pdf file has an associated .desc file that I use to index the content of the files.

    I would like the Zoom spider to ignore scanning the .pdf files themselves. This would greatly increase my speed of index generation.

    It would be great if you could develop a Meta Tag for use in the .desc file that would tell the spider to not bother downloading or scanning the .pdf when the Meta Tag so directs.

    e.g.: <meta name="IGNORE" content="1">

  • #2
    We agree that this can be useful. We plan on adding support for meta robots tags in V5.0 so this would allow you to specify the ignore parameter.

    However, for PDF and .desc files, this would require an additional change to the way Zoom behaves at the moment since it currently retrieves the PDF file before the .desc file. For what you suggest to work, we would have to retrieve the .desc file first, and this would raise some issues in the different behaviour (eg. do we discard the .desc file contents if the eventual PDF file fails to scan?). Nonetheless, we'll look into it and keep it on our list of things to consider for V5.0.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Ignore PDFs

      Could you put all the PDFs in one folder then add that folder name to the skip list?

      Alternatively could you just remove the .PDF extension from the list of file types to be indexed?

      Comment

      Working...
      X