Announcement

**David** · Jun-02-2010, 12:04 AM

Skip words are in fact still in the index. But you won't be able to search for them. They are needed for the context display. So adding skip words isn't the best way to reduce the word count. And yes the dictionary is not just a simple list of words. It is really part of a database.

The best way to investigate is to have a close look at the source documents and see where all the words are coming from. My guess from what you said is that you have 100,000s of OCR errors. Are these documents online anywhere where we can see them?

**studious** · Jun-02-2010, 12:17 AM

Wow, that's a quick response! - thanks. You are very helpful.

In the meantime I've gone on playing and indexed a small manageable bit of my library to make sure I could upload it ok and that it would work on my website. I'm sorry, I can't give you access to it as it's members only, but I've got a printout of the page I got from a simple search - and this amply demonstrates the problem I've got.

Can't find how to attach files to these posts but I'll upload the pdf to www.one-name.org/librarysearch.pdf and you might like a giggle at the result.

**studious** · Jun-02-2010, 12:20 AM

OCR errorrs

Sorry, forgot to say, you were dead right about the 100000s of OCR errors.

**David** · Jun-02-2010, 12:30 AM

Yes, it looks like a bad OCR job. About 1 in 4 words are wrong. The errors are typical of a bad OCR job. Missing space characters and confusing the i and l and 1 characters. So you get "Transferrlng the Family flles" instead of "Transferring the family files".

Maybe you were using a poor quality OCR program, or maybe the source document were not scanned at a high enough resolution to get a good conversion to text, or maybe the source documents were blured or poor quality to start with. It is bad enough that I would consider redoing the OCR step. But with 9000 document this will be very painful I expect.

**studious** · Jun-02-2010, 12:42 AM

Thank you Wrensoft. I think it's a combination of all the factors you listed. But with the amount we already paid for digitisation I don't think there's a hope of being able to do it again. What a pity.

The odd thing is that this library was previously indexed for searches when it was held on a single PC - the librarian's - and no mention was made of this problem then. I've just been brought in as the website wallah and am making unpleasant discoveries. Oh well.

Thanks again for your help.

**Ray** · Jun-02-2010, 01:21 AM

A quick tip:

You can usually check the text layer of a PDF document by opening the file in Acrobat Reader, and click "Edit"->"Select all". This should copy all the text into the clipboard. You can then open up a text editor (such as Notepad) and paste it all into the window. Look through that to see if there are any significant problems with the OCR produced text layer.

Note that this may not work for all PDF files, if it has various security settings in place which prevent the text layer from being copied out in this manner. But if it prevents this, then it is possible (though not always the case) that a search engine like Zoom would also be prohibited from extracting the text.

Another tip:

If you are indexing a collection of files, and finding it swamped by a subset of these PDF files which contain a large amount of junk content -- thus making it difficult to index the full set of files -- you can consider truncating such files from the index.

You can do this either by using the "Limit words per file" setting (under "Configure"->"Limits") which would mean that the Indexer will stop indexing a file once a certain number of words is found within it, and it will move on to the next file. This would give you alot more space to index the rest of your collection, without having to increase your Limits significantly.

**studious** · Jun-02-2010, 01:08 PM

Thank you, Ray, for those tips.

With the second one, limiting the number of words, I'm not sure how this works. Is this limit applied to the number of words searched, or the number of words found?
In other words, if I had a (silly) file of 500 occurrences of the word "zoom", followed by a single instance of the word "search", and I limited my search to 500 words - would it give up at the 500th "zoom" or would it keep looking for the 500th unique word and index "search" as well? If it did the latter, then I think you've solved my problem.

**Ray** · Jun-03-2010, 12:53 AM

The "Limit words per file" setting applies to the actual number of words indexed from the page, rather than the number of unique words. The reason for this is that the latter makes it too hard for a user to judge what is actually happening, and control what should or shouldn't be indexed.

The original idea was not to filter out junk, but to allow the user to only take the first portion of a document if their files are simply too big.

If you think about it further, applying a limit to the first x number of unique words would not solve your problem, because there would be just as many bad words at the top of your document as there are towards the bottom. Although, it is probably true that most of the legitimate words are already indexed (and so are not new unique words), and there are very few new unique words that are legitimate. Nonetheless, it wouldn't quite work - you'd need a high limit for your initial documents (the files at the start of your indexing process) to index completely, and then this same high limit would have to be applied to the files found later in your indexing process.

Having said that, I still think the "Limit words per file" setting, even when it applies to the actual number of words in a file, could help your situation if you are willing to compromise the ability to search for words found towards the end of some large files.

It may help to point out that even Google (at last review) truncate files for indexing and you often can't search for words at the bottom of some large files.

Announcement

Limit of 500,000 unique words

Limit of 500,000 unique words

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment