PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

skip words - Large number of files and words.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • skip words - Large number of files and words.

    Can anyone tell me how to skip words that contain numbers using the skip words feature? I tried adding *1,*2,... to the skip words list but did not work.

  • #2
    Also, how to skip words that contain special characters like _ - , . $ # *()}, I tried adding *-,*_,*$... to the skip list and it did not work either. I also want to skip words that are longer than 20 characters.

    Comment


    • #3
      What version of the software are you using?

      Adding
      *1
      *2
      *3
      ...
      *9
      *0
      etc..

      To the word skip list (not the page skip list) should skip all numbers. The first line, *1, skips all words that contain at least one 1 character. The second line, *2, skips all words that contain at least one 2 character, etc..

      What is the URL to the page that numbers are being indexed from?

      Comment


      • #4
        I am using V5 build 1017 enterprise edition in offline mode. I found there is no option to directly enter words in the skip words list, so I edited zoom.zcfg file using wordpad. Here is the part that I edited in the file:
        #SKIPWORDS_START
        *1
        *2
        *3
        *4
        *5
        *6
        *7
        *8
        *9
        *0
        and
        or
        the
        it
        is
        an
        on
        we
        us
        to
        of
        has
        be
        all
        for
        in
        as
        so
        are
        that
        can
        you
        at
        its
        by
        have
        with
        into
        #SKIPWORDS_END

        Comment


        • #5
          You can directly enter skip words from the "Skip options" tab in the Zoom configuration window. You risk messing up the config file by editing it directly (by, for example, entering ASCII characters, when Unicode are expected).

          Comment


          • #6
            Now I figured out how to directly enter the skip words in the list, but indexer still did not skip them.

            Comment


            • #7
              What version of the software are you using and what makes you think they weren't skipped.

              Comment


              • #8
                Now I recreated a configuration file from skratch, and I think it works. In the zoom_dictionary.zdat file, all words that contain numbers are followed by "-1", but why they still have to be in the dictionary file and also still being counted as unique words. I am skipping them because I want to reduce the size of the unique words list so to increase the number of files to be indexed.

                I also found that words like "thermofluids", "Thermofluids", "ThermoFluids" are treated as three different unique words in the dictionary. This makes the list 40% longer. Would it be possible for the indexer to ingore this?

                Comment


                • #9
                  Skipped words are words which we do not store any index data for. This means they can not be searched for, and it saves space in the index files (which makes searching other words faster).

                  They are not entirely omitted from the index however. We still need to keep track of them for other purposes, most significantly the Context Description, which recreates the paragraph of contextual text you see around the matched word. This is also the reason why you have words with different upper and lower casing. So it will not reduce your unique words count.

                  Perhaps you can elaborate on what limits you are reaching and we can give more helpful advice. For example, if you're using the PHP/ASP version, you may want to consider switching over to the CGI option.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment


                  • #10
                    The indexer found 1 million unique words after indexing only 57,000 files, and then it stopped. I am using cgi/linux platform.

                    Comment


                    • #11
                      How big are each of these files? There's a big difference between 57,000 x small HTML pages, and 57,000 x PDF files (e.g. each PDF potentially containing several hundred pages in itself).

                      If many of these files are long lists, and you do not need the entirety of them indexed, you could consider using the "Limit words per file" option to restrict indexing to the first portion of each file.

                      If they are HTML or text files, and they are pre-generated in any way, you could consider adding ZOOMSTOP and ZOOMRESTART tags around content that you do not want indexed (and which would prevent your unique words count from increasing, as they would be completely omitted from the index).
                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment


                      • #12
                        I have both pdfs and html files and their sizes vary, some are very big. I will add the zoomstop tags to all html files, but that will only remove header and footer. All other information still need to be indexed. I wonder if it is possible for zoom to have this feature as well, so that it can automatically omit html files' header and footer while indexing.

                        For the upper and lower cases difference, would it also be possible for zoom to have an option to ignore it? I think it does not matter too much to keep word's original case in the context description, why couldn't it be all lower cases? Since it is only a portion of the sentence.

                        Comment


                        • #13
                          It's not at all simple to automatically determine what counts as a header or a footer. There's no HTML which says this, it's only a design concept. Sometimes different HTML can produce the same resultant layout. It requires a bit of AI, and some training (which means several passes of the entire content to be indexed) if we were to do something like this, which would slow down the indexing procedure significantly, and even at best, produce a flawed result and skip some things incorrectly and not skip others when expected.

                          For most people, upper and lower case differences are important in context descriptions, if you think about it. You can have acronyms which mean totally different things to the word it resembles. For example, there's "wasp" the insect, and there's "WaSP", the Web Standards Project.
                          --Ray
                          Wrensoft Web Software
                          Sydney, Australia
                          Zoom Search Engine

                          Comment


                          • #14
                            Another point to add is that you should try increasing the "Max. Unique Words" limit from your current 1 million setting. While having a limit over 1 million is likely for Zoom to issue you with a warning regarding the 32-bit address space, you may be able to ignore this warning, given that you have a small number of large files, as opposed to a large number of small files (which is the more common scenario and it is what we base our estimations on prior to indexing). So try increasing this limit first before deciding you really need to cut down some parts.
                            --Ray
                            Wrensoft Web Software
                            Sydney, Australia
                            Zoom Search Engine

                            Comment


                            • #15
                              By increasing the maximum number of unique words, I was able to index more files, and I noticed the program actually used much less memory than the indexer estimated. When increasing to even more unique words, the program over estimates the memory and would not start. Is it possible to remove this limit and just leave a warning message on the enterprise edition?

                              Comment

                              Working...
                              X