PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Strategy for treatment of 7z files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strategy for treatment of 7z files

    I am indexing several pages that contain links to quite a number of .7z files (why someone would want to compress a PDF file escapes me but that's another story) - I've told ZOOM to treat these as a binary file and not to extract content from them. The link to the documents contain quite rich text about the document so I'm keen to harvest that but not the actual file. At this stage I have added a file extension reference for 7z so I can display an icon in search results. However, even though I've told ZOOM not to interrogate the content of the file ZOOM is spending a very long time looking at the file. Is my strategy right here? Thank you!

  • #2
    Select the file type in your extensions list and click 'Configure', and make sure that the option "Extract recognizable text from binary file" is UNCHECKED. If this is checked, it will spend alot more time digging through the binary file.

    Otherwise it should only use the filename and should not take much time. Although there may be a few cases where it could look at the file in greater detail.

    Are you using Spider Mode or Offline Mode? If spider mode, is the file being served via a download script which specifies a different Content-Type header?

    If possible, email us a copy of the .zcfg file and we can take a closer look at your configuration.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Thanks Ray!!! - spider mode - I was conscious of having the box unchecked to "extract recognizable text from binary file" because as best I know ZOOM can pull apart a zip file but not a 7z file. The file sits behind a link but I will check how its being downloaded but I suspect its a plain old a href. Where can I send the config file if I can't resolve this form your advice?

      Comment


      • #4
        You can email the file to us (with reference to this thread) via zoom [at] wrensoft (dot) com. Contact details also found here.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Thanks Ray - I will get that organised as soon as I can. If its any help in the mean time here is an example of a page I'm trying to index with 7z files. http://www.classiccmp.org/cpmarchives/trs80/Library/Books/

          Comment


          • #6
            We've confirmed the time is spent in Spider Mode to download the file. Although the file is just a binary file, the current behaviour is that it proceeds to download the file regardless. We agree that this is not ideal but in most cases the binary file / filename only option is more commonly used for offline mode indexing. This also handles a few other scenarios such as when the file doesn't exist (i.e. broken link), or if the file is then downloaded and determined to be a different format than the URL link (possible with a web server).

            Having said that, we agree that it would be better if the binary (filename only, not extracting text) files would not be downloaded and we simply index them after checking that the file exists. And we've added this functionality to the next release (V7 build 1013).
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Thanks Ray - that's an excellent outcome and its very refreshing seeing a software vendor that is responsive to the questions and needs of its customers like this. Unless I hear differently I'll take that you no longer need the config file.

              Comment

              Working...
              X