PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Search results for PDF ignores columns

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Search results for PDF ignores columns

    I am using Zoom to index PDFs of a newspaper. The files are a spread (so there is page1.pdf, pages2-3.pdf, pages4-5.pdf, etc.). The index seems to cross over columns on the page, so the result words that appear before the search term are from the previous page.

    Search term: raceway

    Example text from the PDF:
    Raceway rep addresses fire, rescue procedures (headline)
    BYERS — The Byers Fire Protection District Board met with a representative from High Plains Raceway at its Feb. 9 meeting.
    Joe Gilmore, regional executive of Colorado Region of the Sports Car Club of America, attended the meeting to discuss fire and rescue procedures at the race track, which opens in April.

    Results description:
    . 8 The I-70 Scout Tuesday, February 17, 2009 Tuesday, February 17, 2009 The I-70 Scout 9 Rural women focus Raceway rep addresses of MCC biz seminar fire, rescue procedures money go to the local citizens." Women for Rural Business, a one ...

    The article about the Rural Women's biz seminar appears on page 8, and the Raceway article appears on page 9. Here's a link to the search page and to the resulting PDF.

    This edition will only be up for a few more days, but you will see the same result with other pages. This has been going on a while - I'm just now getting some time to work on it.

    Any ideas?

  • #2
    More on what the results are from:

    . 8 The I-70 Scout Tuesday, February 17, 2009 Tuesday, February 17, 2009 The I-70 Scout 9 Rural women focus Raceway rep addresses of MCC biz seminar fire, rescue procedures money go to the local citizens." Women for Rural Business, a one ...

    The green text is from the header of page 8.

    The pink text is from the header of page 9.

    The turquoise text is from the headline of the article at the top of page 8.

    The blue text is from the headline of the raceway story at the top of page 9.

    The purple text is from the top of column 2 of the raceway story on page 9.

    The grey text is the first line of the story on page 8.

    Comment


    • #3
      Originally posted by pststerrye View Post
      I am using Zoom to index PDFs of a newspaper. The files are a spread (so there is page1.pdf, pages2-3.pdf, pages4-5.pdf, etc.). The index seems to cross over columns on the page, so the result words that appear before the search term are from the previous page.
      Double click on the ".pdf" extension on the "Scan Options" panel of the Configure Indexer tab. You can control the "Scan Method" here.

      From the Users Guide (chapter 2.17.5) and Help file:

      Scan Method (PDF only)
      This option allows you to utilize alternative methods of extracting the text content from PDF files. Due to
      the technical limitations of the PDF file format, the textual content stored within a PDF file can be
      ambiguous in its order of presentation. For example, text may be split up in several columns, but this
      may not be defined within the PDF file itself as to when a sentence ends and when it wraps around. It is
      only structured visually.


      For some PDF files (it depends on how they were created), the default scan method ("presentation
      layout") may not be the best at preserving the order of text as intended, and in such situations, you
      should try the other two methods available: "raw formatting order", and "text layer".
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        That worked great! Thanks for the assistance and quick reply!

        Comment

        Working...
        X