Can anyone tell me how to skip words that contain numbers using the skip words feature? I tried adding *1,*2,... to the skip words list but did not work.
Announcement
Collapse
No announcement yet.
skip words - Large number of files and words.
Collapse
X
-
What version of the software are you using?
Adding
*1
*2
*3
...
*9
*0
etc..
To the word skip list (not the page skip list) should skip all numbers. The first line, *1, skips all words that contain at least one 1 character. The second line, *2, skips all words that contain at least one 2 character, etc..
What is the URL to the page that numbers are being indexed from?
Comment
-
I am using V5 build 1017 enterprise edition in offline mode. I found there is no option to directly enter words in the skip words list, so I edited zoom.zcfg file using wordpad. Here is the part that I edited in the file:
#SKIPWORDS_START
*1
*2
*3
*4
*5
*6
*7
*8
*9
*0
and
or
the
it
is
an
on
we
us
to
of
has
be
all
for
in
as
so
are
that
can
you
at
its
by
have
with
into
#SKIPWORDS_END
Comment
-
Now I recreated a configuration file from skratch, and I think it works. In the zoom_dictionary.zdat file, all words that contain numbers are followed by "-1", but why they still have to be in the dictionary file and also still being counted as unique words. I am skipping them because I want to reduce the size of the unique words list so to increase the number of files to be indexed.
I also found that words like "thermofluids", "Thermofluids", "ThermoFluids" are treated as three different unique words in the dictionary. This makes the list 40% longer. Would it be possible for the indexer to ingore this?
Comment
-
Skipped words are words which we do not store any index data for. This means they can not be searched for, and it saves space in the index files (which makes searching other words faster).
They are not entirely omitted from the index however. We still need to keep track of them for other purposes, most significantly the Context Description, which recreates the paragraph of contextual text you see around the matched word. This is also the reason why you have words with different upper and lower casing. So it will not reduce your unique words count.
Perhaps you can elaborate on what limits you are reaching and we can give more helpful advice. For example, if you're using the PHP/ASP version, you may want to consider switching over to the CGI option.
Comment
-
How big are each of these files? There's a big difference between 57,000 x small HTML pages, and 57,000 x PDF files (e.g. each PDF potentially containing several hundred pages in itself).
If many of these files are long lists, and you do not need the entirety of them indexed, you could consider using the "Limit words per file" option to restrict indexing to the first portion of each file.
If they are HTML or text files, and they are pre-generated in any way, you could consider adding ZOOMSTOP and ZOOMRESTART tags around content that you do not want indexed (and which would prevent your unique words count from increasing, as they would be completely omitted from the index).
Comment
-
I have both pdfs and html files and their sizes vary, some are very big. I will add the zoomstop tags to all html files, but that will only remove header and footer. All other information still need to be indexed. I wonder if it is possible for zoom to have this feature as well, so that it can automatically omit html files' header and footer while indexing.
For the upper and lower cases difference, would it also be possible for zoom to have an option to ignore it? I think it does not matter too much to keep word's original case in the context description, why couldn't it be all lower cases? Since it is only a portion of the sentence.
Comment
-
It's not at all simple to automatically determine what counts as a header or a footer. There's no HTML which says this, it's only a design concept. Sometimes different HTML can produce the same resultant layout. It requires a bit of AI, and some training (which means several passes of the entire content to be indexed) if we were to do something like this, which would slow down the indexing procedure significantly, and even at best, produce a flawed result and skip some things incorrectly and not skip others when expected.
For most people, upper and lower case differences are important in context descriptions, if you think about it. You can have acronyms which mean totally different things to the word it resembles. For example, there's "wasp" the insect, and there's "WaSP", the Web Standards Project.
Comment
-
Another point to add is that you should try increasing the "Max. Unique Words" limit from your current 1 million setting. While having a limit over 1 million is likely for Zoom to issue you with a warning regarding the 32-bit address space, you may be able to ignore this warning, given that you have a small number of large files, as opposed to a large number of small files (which is the more common scenario and it is what we base our estimations on prior to indexing). So try increasing this limit first before deciding you really need to cut down some parts.
Comment
-
By increasing the maximum number of unique words, I was able to index more files, and I noticed the program actually used much less memory than the indexer estimated. When increasing to even more unique words, the program over estimates the memory and would not start. Is it possible to remove this limit and just leave a warning message on the enterprise edition?
Comment
Comment