PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

[PDF plugin error] Failed to read or parse PDF file. File may require a password.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [PDF plugin error] Failed to read or parse PDF file. File may require a password.

    Since build 1015 I am getting errors like the above for PDF files which are *not* password-protected and are not even in a login-protected section of our website.
    These files can be easily downloaded via Firefox and opened with PDF-Xchange Editor. Like the previously reported 406 errors, these errors generally occur for different files on successive runs of Zoom Search Indexer. In other words, a file which triggers this error on one run can be indexed without problems on a subsequent run. Since I am dependent on java script, I am also dependent on a complete, error-free index run. This can take many repetitions, which take hoours because I am reduced to using single-threading and the longest possible delay between pages in order to minimize these errors.

  • #2
    I tried running the indexer offline on a backup copy of the website and get similar errors on PDF files which can easily be opened by simply copying the path from the ZSE log and pasting it into a Windows "Run" dialog. There was one case where the file really couldn't be found because ZSE had modified the filename, replacing a "Ř" (u with umlaut) by "u?", but two others for files which can be opened without a password.

    P.S. The index generated offline was IAC unusable. When added to the website, the search results referenced pages which didn't contain the search term.
    Last edited by imcz; 06-28-2021, 11:44 AM.

    Comment


    • #3
      There are several different ways PDF files can be protected. They can be encrypted with a password, but they can also be flagged to prevent printing and text extraction (but still allow viewing).
      See example below from Adobe Acrobat.


      Click image for larger version

Name:	PDF-Page-Extraction.png
Views:	40
Size:	7.3 KB
ID:	38095


      Other PDF files might have no text in them (i.e. no OCR layer) and just be a photograph or a scan.

      If you think you have a different case, can you post a link to an example file.

      Comment


      • #4
        Dear David,
        Thanks for responding. Unfortunately, I once again failed to receive (even in my spam folder) an e-mail notification of your response, even though I am subscribed to this thread.
        Here is a link to a document which has "no security" (at least according to PDF Xchange Editor) and still fails to index:
        PDF Security properties of non-indexed file
        It is a regular authored PDF with embedded text, not a scan.
        I have attached the log file, which also shows the DLL load error which I reported in another thread.
        To answer your question there, the error occurs consistently (every time).
        Thanks for your support.

        P.S. Here is another link to a PDF with no security which provokes the same error.
        Attached Files
        Last edited by imcz; 07-16-2021, 12:07 PM.

        Comment


        • #5
          Adobe says something different for the same document.

          Click image for larger version

Name:	No-page-extraction-pdf.png
Views:	32
Size:	118.5 KB
ID:	38112

          Comment


          • #6
            Sorry, my bad. I misread the filename, so the link I provided points to a different document (202107.pdf) from the one causing the error (202102.pdf). I can confirm your analysis of 202102.pdf. The error caused by the file in the seci˘nd (P.S.) link seems to be due to a discrepancy between the filename in the (wget) backup we are scanning (due to the notorious 406 errors) and the link ZSE is using.
            ZSE is looking for "Kunstfu?hrung, flyer 2020_english_Webversion.pdf", but the file is really named "Kunstführung, flyer 2020_english_Webversion.pdf".
            I'm not yet sure whether the problem lies with ZSE or with wget. If you have any ideas, please let me know.

            Comment


            • #7
              Hi, I have addressed this issue in your email, and we will follow up there so we don't end up discussing this in multiple places.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment

              Working...
              X