PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Not detecting PDF files downloaded

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Not detecting PDF files downloaded

    I’m running version 6.0 build 1028, and am indexing a site written in PHP. Most of the documents available for download from the site are retrieved like this:

    getfile.php?name=ABCD

    which causes a new window to open which downloads the file "HELLO.PDF". What's strange is that the first time I indexed the site with zoom it correctly detected all 50 pdf files and indexed them. However, after that it interprets the files to be PHP.

    Is this a known bug? Something I need to set?

  • #2
    Maybe something changed on the server?
    When indexing in spider mode and the file extension in the URL doesn't match the document type, the document type is identified in the MIME HTTP header.

    So the first thing to check is what information is in the HTTP header that comes back from your server.

    If you don't know how to do this, post the URL in question and we can have a look at the header.

    Comment


    • #3
      I can't send a URL since the site prevents deep linking, but if you go to

      karlog.ca

      Then browse to COMPANY, then PUBLICATIONS, click any of the publication links. You will see that a PDF opens...so I assume mime type is right.

      Comment


      • #4
        If you can't link to the PDF with a URL, then it doesn't seem reasonable to expect the spider to find the (non existent) link? Seems to also contradict your initial post.

        Comment


        • #5
          I did some TCP/IP traces on your site. Turns out you do have direct URLs that can be linked to. They are just not contained in the HTML source.

          Here is one,

          http://www.karlog.ca/Desktop/Company/Publications.php/Include/PHP/FileDownload.php?source=100&id=7532F3F86E8040E5A52 55C20A7B65668

          This URL returns these headers,
          HTTP/1.1 200 OK
          Date: Wed, 18 Apr 2012 23:28:00 GMT
          Server: Apache
          Pragma: public
          Expires: 0
          Cache-Control: public
          Content-Description: File Transfer
          Content-Disposition: attachment; filename="Publication_1apr2004.pdf"
          Content-Transfer-Encoding: binary
          Content-Length: 74885
          Keep-Alive: timeout=15, max=100
          Connection: Keep-Alive
          Content-Type: application/pdf

          Which looks OK.

          *BUT* when I do the same thing outside of a browser your site throws an error.

          Check out this text dump,

          HTTP/1.1 200 OK
          Date: Wed, 18 Apr 2012 23:39:31 GMT
          Server: Apache
          Expires: Thu, 19 Nov 1981 08:52:00 GMT
          Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
          Pragma: no-cache
          Set-Cookie: PHPSESSID=ir4ub9tnkvr0qig8ktqbm1atn4; path=/
          Set-Cookie: UREF=BF005EF73F934BC0A074D5828064446E; expires=Sat, 16-Apr-2022 23:39:31 GMT
          Vary: Accept-Encoding
          Connection: close
          Content-Type: text/html

          <h1><font color="#FF0000">File Download Error</font></h1><hr><br><table border="0" cellspacing="0" cellpadding="5"> <tr> <td bgcolor="#CCCCCC"><div align="right">Description:</div></td> <td>Internal error</td> </tr> <tr> <td bgcolor="#CCCCCC"><div align="right">Code:</div></td> <td>7006</td> </tr></table><p><em><font size="-2">OCG technical support has been automatically notified of this error. </font></em></p>

          Looks like your site is browser sniffing, or changing behavior based on session data. In short being too smart for itself.

          Comment


          • #6
            That's correct - the content protection system will not allow retrieval of pages from outside the site. It's not a bug - that works fine

            However, when the user agent matches ZSEBOT it allows zoom to retrieve the file. I can't explain it...but it's inconsistent.

            Perhaps unrelated, but zoom writes its default config file to c:\documents and settings\all users\etc...

            This folder is usually (as of Sp3 on xp?) read only for all except administrator...might be useful if zoom wrote config files to the user's doc & settings folder. (For now I have changed permissions on this folder)..

            Comment


            • #7
              I still think the problem is with your site, and not with the indexer. You should be able to debug it with Wireshark.

              Comment


              • #8
                Originally posted by ocgltd View Post
                Perhaps unrelated, but zoom writes its default config file to c:\documents and settings\all users\etc...

                This folder is usually (as of Sp3 on xp?) read only for all except administrator...might be useful if zoom wrote config files to the user's doc & settings folder. (For now I have changed permissions on this folder)..
                You can save the config file to another folder. Zoom will always open the last used config file by default so it doesn't matter where it is.

                Having said that, in my testing, Zoom writes to the following by default:
                C:\Users\...

                So I'm not sure if maybe you had an older version of Zoom previously? And the new Zoom is just using the one that you're loading from? What version of Windows are you using?

                Regarding the PHP/PDF download issue -- perhaps you can make sure you have "Reload all files (do not use cache)" enabled under "Configure"->"Spider options". But we can't really investigate further to verify what your site is doing. Given that it changes behavior on different factors, and involves your custom site coding, we can't really make any guesses as to what's really happening. We know though, that Zoom is responding to the HTTP header for the Content-type.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment

                Working...
                X