PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Limit of 500,000 unique words

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Limit of 500,000 unique words

    I am using Zoom Indexer Pro, v6.0 Build 1019 on Windows XP. I have successfully used it for several indexing projects on websites and CDs including one of over 12,000 files, 9.7 Gb, which it handled fine.

    I would now like to index our "online library" of some 9,000 or so files, totalling 6.65Gb. Most of these files are .pdf format, with some .doc files and a few other formats including .xls and .txt.

    Unfortunately the indexing stops after processing about 1100 files, having reached the maximum of 500,000 unique words.

    I find it very surprising that the indexing should have found 500,000 unique words in only 1100 files when the average vocabulary is about a tenth of this. I concluded that maybe, since these files contain contributions by many different individuals of varying standards of literacy and technical competence, they contain very many misprints and mispellings (and probably misreadings by OCR as well).

    So I set myself to reducing the number of unique words, and changed the configuration "Skip" section to skip words of less than 5 characters - previously set to the default of skipping words less than 2 characters. To my surprise this made no difference at all, the result was identical.

    Is this right? Surely this should have resulted in a substantial reduction in the number of words indexed?

    I then had a look at the zoom_dictionary.zdat file generated, which I believe contains the words found and indexed. This contains lots of words of four characters and less (as well as some words which I find it difficult to believe were in the documents indexed). I did not understand the format of this file as it is clearly not just a word list!

    Could you help, please?

    If necessary I am prepared to upgrade from the Pro version to the Enterprise version, but I am aware that if I do succeed in indexing so very many unique words then the size of my index and the speed of the search will suffer. Your advice would be most welcome.

  • #2
    Skip words are in fact still in the index. But you won't be able to search for them. They are needed for the context display. So adding skip words isn't the best way to reduce the word count. And yes the dictionary is not just a simple list of words. It is really part of a database.

    The best way to investigate is to have a close look at the source documents and see where all the words are coming from. My guess from what you said is that you have 100,000s of OCR errors. Are these documents online anywhere where we can see them?

    Comment


    • #3
      Wow, that's a quick response! - thanks. You are very helpful.

      In the meantime I've gone on playing and indexed a small manageable bit of my library to make sure I could upload it ok and that it would work on my website. I'm sorry, I can't give you access to it as it's members only, but I've got a printout of the page I got from a simple search - and this amply demonstrates the problem I've got.

      Can't find how to attach files to these posts but I'll upload the pdf to www.one-name.org/librarysearch.pdf and you might like a giggle at the result.

      Comment


      • #4
        OCR errorrs

        Sorry, forgot to say, you were dead right about the 100000s of OCR errors.

        Comment


        • #5
          Yes, it looks like a bad OCR job. About 1 in 4 words are wrong. The errors are typical of a bad OCR job. Missing space characters and confusing the i and l and 1 characters. So you get "Transferrlng the Family flles" instead of "Transferring the family files".

          Maybe you were using a poor quality OCR program, or maybe the source document were not scanned at a high enough resolution to get a good conversion to text, or maybe the source documents were blured or poor quality to start with. It is bad enough that I would consider redoing the OCR step. But with 9000 document this will be very painful I expect.

          Comment


          • #6
            Thank you Wrensoft. I think it's a combination of all the factors you listed. But with the amount we already paid for digitisation I don't think there's a hope of being able to do it again. What a pity.

            The odd thing is that this library was previously indexed for searches when it was held on a single PC - the librarian's - and no mention was made of this problem then. I've just been brought in as the website wallah and am making unpleasant discoveries. Oh well.

            Thanks again for your help.

            Comment


            • #7
              A quick tip:

              You can usually check the text layer of a PDF document by opening the file in Acrobat Reader, and click "Edit"->"Select all". This should copy all the text into the clipboard. You can then open up a text editor (such as Notepad) and paste it all into the window. Look through that to see if there are any significant problems with the OCR produced text layer.

              Note that this may not work for all PDF files, if it has various security settings in place which prevent the text layer from being copied out in this manner. But if it prevents this, then it is possible (though not always the case) that a search engine like Zoom would also be prohibited from extracting the text.

              Another tip:

              If you are indexing a collection of files, and finding it swamped by a subset of these PDF files which contain a large amount of junk content -- thus making it difficult to index the full set of files -- you can consider truncating such files from the index.

              You can do this either by using the "Limit words per file" setting (under "Configure"->"Limits") which would mean that the Indexer will stop indexing a file once a certain number of words is found within it, and it will move on to the next file. This would give you alot more space to index the rest of your collection, without having to increase your Limits significantly.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                Thank you, Ray, for those tips.

                With the second one, limiting the number of words, I'm not sure how this works. Is this limit applied to the number of words searched, or the number of words found?
                In other words, if I had a (silly) file of 500 occurrences of the word "zoom", followed by a single instance of the word "search", and I limited my search to 500 words - would it give up at the 500th "zoom" or would it keep looking for the 500th unique word and index "search" as well? If it did the latter, then I think you've solved my problem.

                Comment


                • #9
                  The "Limit words per file" setting applies to the actual number of words indexed from the page, rather than the number of unique words. The reason for this is that the latter makes it too hard for a user to judge what is actually happening, and control what should or shouldn't be indexed.

                  The original idea was not to filter out junk, but to allow the user to only take the first portion of a document if their files are simply too big.

                  If you think about it further, applying a limit to the first x number of unique words would not solve your problem, because there would be just as many bad words at the top of your document as there are towards the bottom. Although, it is probably true that most of the legitimate words are already indexed (and so are not new unique words), and there are very few new unique words that are legitimate. Nonetheless, it wouldn't quite work - you'd need a high limit for your initial documents (the files at the start of your indexing process) to index completely, and then this same high limit would have to be applied to the files found later in your indexing process.

                  Having said that, I still think the "Limit words per file" setting, even when it applies to the actual number of words in a file, could help your situation if you are willing to compromise the ability to search for words found towards the end of some large files.

                  It may help to point out that even Google (at last review) truncate files for indexing and you often can't search for words at the bottom of some large files.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment

                  Working...
                  X