I am using Zoom Indexer Pro, v6.0 Build 1019 on Windows XP. I have successfully used it for several indexing projects on websites and CDs including one of over 12,000 files, 9.7 Gb, which it handled fine.
I would now like to index our "online library" of some 9,000 or so files, totalling 6.65Gb. Most of these files are .pdf format, with some .doc files and a few other formats including .xls and .txt.
Unfortunately the indexing stops after processing about 1100 files, having reached the maximum of 500,000 unique words.
I find it very surprising that the indexing should have found 500,000 unique words in only 1100 files when the average vocabulary is about a tenth of this. I concluded that maybe, since these files contain contributions by many different individuals of varying standards of literacy and technical competence, they contain very many misprints and mispellings (and probably misreadings by OCR as well).
So I set myself to reducing the number of unique words, and changed the configuration "Skip" section to skip words of less than 5 characters - previously set to the default of skipping words less than 2 characters. To my surprise this made no difference at all, the result was identical.
Is this right? Surely this should have resulted in a substantial reduction in the number of words indexed?
I then had a look at the zoom_dictionary.zdat file generated, which I believe contains the words found and indexed. This contains lots of words of four characters and less (as well as some words which I find it difficult to believe were in the documents indexed). I did not understand the format of this file as it is clearly not just a word list!
Could you help, please?
If necessary I am prepared to upgrade from the Pro version to the Enterprise version, but I am aware that if I do succeed in indexing so very many unique words then the size of my index and the speed of the search will suffer. Your advice would be most welcome.
I would now like to index our "online library" of some 9,000 or so files, totalling 6.65Gb. Most of these files are .pdf format, with some .doc files and a few other formats including .xls and .txt.
Unfortunately the indexing stops after processing about 1100 files, having reached the maximum of 500,000 unique words.
I find it very surprising that the indexing should have found 500,000 unique words in only 1100 files when the average vocabulary is about a tenth of this. I concluded that maybe, since these files contain contributions by many different individuals of varying standards of literacy and technical competence, they contain very many misprints and mispellings (and probably misreadings by OCR as well).
So I set myself to reducing the number of unique words, and changed the configuration "Skip" section to skip words of less than 5 characters - previously set to the default of skipping words less than 2 characters. To my surprise this made no difference at all, the result was identical.
Is this right? Surely this should have resulted in a substantial reduction in the number of words indexed?
I then had a look at the zoom_dictionary.zdat file generated, which I believe contains the words found and indexed. This contains lots of words of four characters and less (as well as some words which I find it difficult to believe were in the documents indexed). I did not understand the format of this file as it is clearly not just a word list!
Could you help, please?
If necessary I am prepared to upgrade from the Pro version to the Enterprise version, but I am aware that if I do succeed in indexing so very many unique words then the size of my index and the speed of the search will suffer. Your advice would be most welcome.
Comment