PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Referencing searched data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Referencing searched data

    I am looking to buy software to add a search engine to my website (jjonz.us/RadioLogs).

    When finished the site will have 45000 pdf files. However, my problem is that these pdf files are scanned images. I have tried to OCR with acrobat 7, but acrobat is not a great OCR program and my results have been unsatisfactory. Instead I think I will use Omnipage which will create text files for each pdf file.

    Here is the problem. If I run a search of the OCR'd text files the results will show up as hits in the text files. USING YOUR SOFTWARE IS THERE ANYWAY TO CROSS REFERENCE THE RESULTS BACK TO THE ORIGINAL PDF FILE?

    Thanks for your help.

    jj



    ps. I sent the above message as an email, but it was undeliverable with the following message:

    Delay reason: SMTP error from remote mail server after RCPT TO:<info [at] wrensoft.com>:
    host mailwash4.pair.com [66.39.2.4]: 450 <info [at] wrensoft.com>:
    Recipient address rejected: Service temporarily unavailable

  • #2
    If you wish to use OmniPage, it looks like you should use their OmniPage Search Indexer Plugin:
    http://www.nuance.com/omnipage/search/

    According to that site, this should add the OCR'ed text layer within the PDF file itself, just like Acrobat's Paper Capture/OCR features. This text layer within the PDF file will allow search engines such as Zoom and Google to index the text content directly from the PDF file. It will also allow you to use the "Find" (Ctrl-F) feature within Acrobat Reader with greater success.

    If you'd rather go ahead with the alternative approach of creating .txt files to index separately, you may be able to use the "Rewrite links" feature to change these links to point back to your original PDF files. (See "Rewrite links" on the "Indexing Options" tab of the Configuration window).

    Originally posted by jjonz
    ps. I sent the above message as an email, but it was undeliverable with the following message:

    Delay reason: SMTP error from remote mail server after RCPT TO:<info [at] wrensoft.com>:
    host mailwash4.pair.com [66.39.2.4]: 450 <info [at] wrensoft.com>:
    Recipient address rejected: Service temporarily unavailable
    This does not mean that the e-mail was rejected, only delayed (by an amount of time determined by your sendmail server). It is usually a sign of what is known as 'greylisting' occurring on the receiving server, which is used to prevent spam. Most normal sendmail servers should automatically re-send the e-mail after a determined period of time. If your server does not do this and is reporting an error, you are likely to be using a non-RFC compliant mail server. You may want to check with your ISP. Do a search on 'greylisting' on Google for more information. There are many useful sites which explains it in detail such as this one:
    http://projects.puremagic.com/greylisting/
    Last edited by Ray; Mar-27-2007, 12:46 AM.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment

    Working...
    X