PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

problem with dashes and indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • problem with dashes and indexing

    I'm indexing a set of HTML pages for offline access. Included among the indexed terms are command line flags, which consist of a dash followed by a single letter (e.g., "-f"). For the most part, there are no problems with these. However, I did notice that I wasn't getting the correct "hits" when searching for -b or -v. I looked in the zoom_index.js file, and sure enough, I found something peculiar:

    "-v,6,4",
    "-b,6,4",
    "-b,6,8,14,15,35,15,269,15,271,16,272,15,281,75,282, 15,288,15,289,15,306,16",
    "-v,6,4,10,8,81,15,122,20,124,15,133,30,271,4,273,15 ",

    For some reason, -b and -v both have two array entries. When I perform a search, only the web pages listed for the first occurence of the same flag in the array are returned. If I transpose the array elements for either -b or -v, I get the longer list of results when performing a search, as expected.

    So I'm wondering why this indexing anomaly is occurring, and if there's a possible workaround. It's not difficult for me to fix the problem manually, but the index file is over 400 KB and I haven't checked it thoroughly for other similar issues. Better to avoid the problem altogether.

    I'm using Zoom v4.1 Professional, by the way.

    Thanks in advance.

  • #2
    Check if the encoding/charset of the files you are indexing, matches your encoding configuration in Zoom (click "Configure" -> "Languages"). This can sometimes cause different characters or words to get mis-encoded when they are written out to file, causing what appears to be a duplicate entry.

    If this is not the problem, it might be best if you can e-mail us your ZCFG file, and some of the HTML pages which you are indexing (if the files are not online). Depending on the number of pages which contains these terms ("-b" and "-v"), send us enough pages to cause this problem to replicate (multiple entries of the same term in the zoom_index.js file). We can then take a closer look at this problem.

    You should also ensure that you are using the latest build (Version 4.1 build 1003) available at:
    http://www.wrensoft.com/zoom/whatsnew.html
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      The charset of the files and Zoom configuration are the same, and I'm using the latest build.

      I'll e-mail the relevant files so you can investigate further...

      Comment


      • #4
        We had a look at your files and have determined that this is a bug.

        Note that this only affects words beginning with a punctuation character (eg. "-b", ".net", etc.) and only for the Javascript platform. It occurs when you have these words in different upper/lowercase forms.

        This was triggered in your files because you have "-b" and "-v" on most pages, except for one ("errors_ndlm.html"), which mentions them as "-B" and "-V" (note: in upper case form).

        We will have this bug fixed in the next public build - most likely Version 4.2.

        In the meantime, you may want to workaround the issue by modifying that single file, and replacing "-B" with "-b" and "-V" with "-v".

        Let us know if you have any questions, or if you continue to have problems.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X