PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Limit word lenght

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Limit word lenght

    Hello,
    exist a limit for a word that it's possible to index?
    I see in my zoom_dictionary.zdat that i have some word truncated.
    It's possible to eliminate this limit?
    Thanks

  • #2
    There are several reasons word might be truncated. Including length or the use of characters that break up a word. e.g. the dash, plus sign, etc...

    Can you give a few examples of the words being truncated and how these words appear in your source document.

    ----
    David

    Comment


    • #3
      Example word

      This is the word present in zoom_dictionary.zdat: pkq001-29-1.0_IO_Utilizzo
      The filename, instead, is pkq001-29-1.0_IO_Utilizzo_Pegasus.doc
      Thanks

      Comment


      • #4
        There is a maximum word length of 35 characters as documented in the Technical Limitations chapter of the Users Guide (http://www.wrensoft.com/zoom/usersguide.html)

        However, that filename is under 35 characters so this should not be the cause of your problem. The position of the split is also not obvious, since if it is splitting because of the "_" after "Utilizzo" then it should have splitted the word earlier than that.

        Is this filename indexed from the content of a page? There may be another reason that the word is broken up due to the formatting of the page (for example, if it occured in a PDF file, where it is broken up in a column layout). If possible, refer us to the original page or file where this word can be found. You could also email us a ZCFG file with your settings so that we can try to replicate the problem.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          We have tested your scenario using the ZCFG file you sent us, but we could not reproduce your problem. In our tests, the filename was indexed as "pkq001-29-1.0_IO_Utilizzo_Pegasus" and not "pkq001-29-1.0_IO_Utilizzo".

          The filename was split at the ".doc" part due to the maximum word length limit of 35 characters. Since appending ".doc" to the filename would exceed this limit, it was split into two words "pkq001-29-1.0_IO_Utilizzo_Pegasus" and "doc".

          Can you confirm if your original bug report was accurate regarding where the filename was truncated.

          Also, can you check the following:

          - If you are using the latest version of Zoom available (4.2.1013) here:
          http://www.wrensoft.com/zoom/whatsnew.html

          - If there is another occurance of this filename in your index data (for example, another page may simply refer to this filename) and this is where it is being truncated, rather than the actual filename. In such a case, you may find another occurance of the full filename indexed in zoom_dictionary.zdat.

          - If the above does not help, can you zip up your index files (all .zdat files) and send them to us.

          We also noticed that you have set an extremely high unique words limit (10,000,000) which is most likely unnecessary. This estimates use of over 2 gig of memory. You might want to lower this to something that is closer to the unique words count (seen in the Status tab) at the end of indexing.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment

          Working...
          X