PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexing PDF files - Garbage characters at start of doc

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing PDF files - Garbage characters at start of doc

    The output being generated from a search of a large library of Adobe Acrobat files, accessed via a web script, includes a lot of surrounding text in the description with garbage characters which are presumably sitting in the raw pdf file next to the words of interest.

    For example, a typical zoom result for a search for "Technology Monitoring" looks like this:

    1. Technology Monitoring for Business Success
    ... ï ëí îí èí ïð Š Ú æ õíí øð ï ìé îð ðë íð Š ©© ©ò®³ ò ®¹ TECHNOLOGY MONITORING FOR BUSINESS SUCCESS Summary of EIRMA Working Group 55 Report Technology Monitoring (TM) is now recognised as a crucial activity for ...

    Is there any way to avoid or tidy up this type of presentation?

    thanks
    Last edited by AndrewD; Apr-24-2007, 06:16 PM.

  • #2
    Hard to know if this is a result of strange content in the particular PDF or some other issue. Can you post the URL to the PDF in question. Or E-Mail it to us.

    What version of Zoom do you have?

    Also what search script option are you using? (PHP, ASP or CGI)?

    Comment


    • #3
      I believe we found the file in question here:
      http://www.eirma.asso.fr/pubs/abstract/abstract55.pdf

      The problem appears to be with the way the PDF file was created. In the document properties, it is reported as having been created by "pdfFactory Pro".

      The garbage characters are actually stored inside the PDF file, in place of the address and phone numbers in the letter head of that document. If you open the PDF file up in Adobe Acrobat Reader, and then select the text (you can try just selecting the letter head text, or selecting all of the text in the document - results are the same) and then copy and pasting the text into a text editor, you will see that Acrobat Reader also retrieves what seems to be garbage characters in place of the letter head text.

      In case you are not already aware - PDF documents can contain a "searchable text" layer which is supposed to contain the actual text content of the document. The actual data stored to represent the 'text' that you see on screen is actually graphical. It would seem to me that there is a bug in the pdfFactory software used to create this document, which stored garbage characters in place of some of the text. This does not appear to be a problem with Zoom.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        Originally posted by Ray View Post
        I believe we found the file in question here:
        http://www.eirma.asso.fr/pubs/abstract/abstract55.pdf

        The problem appears to be with the way the PDF file was created. In the document properties, it is reported as having been created by "pdfFactory Pro".

        The garbage characters are actually stored inside the PDF file, in place of the address and phone numbers in the letter head of that document. If you open the PDF file up in Adobe Acrobat Reader, and then select the text (you can try just selecting the letter head text, or selecting all of the text in the document - results are the same) and then copy and pasting the text into a text editor, you will see that Acrobat Reader also retrieves what seems to be garbage characters in place of the letter head text.

        In case you are not already aware - PDF documents can contain a "searchable text" layer which is supposed to contain the actual text content of the document. The actual data stored to represent the 'text' that you see on screen is actually graphical. It would seem to me that there is a bug in the pdfFactory software used to create this document, which stored garbage characters in place of some of the text. This does not appear to be a problem with Zoom.
        This is correct. Actually, the bug was probably in me (which is usually the case, I'm sure).

        By recreating our letter heading, then recreating the documents with pdffactory properly configured, the pdf files are being indexed correctly and meaningfully.

        Comment

        Working...
        X