PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexing pdf files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing pdf files

    Hi

    I am using zoom to index hundreds of pdf files. I notice that most have no metadata and thus do not have title information returned in the results of a search.

    I have seen your manual metadata solution using a text file, which is ok but time consuming for 800 files and would involve 800 small text files!

    I note that Google do manage to extract title and other information from the same files and then present it in their results - presumably via an automated process. Are there any plans to add this to the zoom pdf plugin and/or in the interim could you suggest a tool that can handle a batch of pdf's and set the metadata??

    Thanks

  • #2
    If there is no title in the document, then it is impossible to extract the title.

    But I guess the assumption that you are making is that a lot of time the document title is the very first line of text in a PDF file.

    I did a quick check on Google. Google seems to be first looking at the PDF meta data (as does Zoom) and then, if nothing satisfactory is found, grabs the fist line of text. Sometimes this works for Google and sometimes it doesn't.

    For example there are lots of PDF files in Google with the title "press release" or "for immediate release". In this case using the file name (as Zoom does) would maybe have allowed better differentiation of the documents. You don't want every document to have the same title.

    The situation can be especially bad in the case where all documents are from the same company (as if often the case with Zoom customers). For example, if the content of all of your company documents starts with
    Your Company Name
    Document type
    Document title
    Then "Your Company Name" would be recorded as the title for every single document.

    But you are right, having this as an option might be a nice addition for some customers.

    Comment

    Working...
    X