PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Surrounding words

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Surrounding words

    Normally in the result page the surrounding words are seen before and after the highlighted word. But in some rare cases the surrounding words exist only after the highlighted word, nothing before the hit!
    What I could find, is that I have around 15000 html files. Each file has a unique title: x:y.z where x, y and z are numbers. Only those files, where x and y are zeros (0:0.z), this happens. Very strange.

  • #2
    Maybe there are no words before the highlighted word in the documents in question?

    Comment


    • #3
      Well, I'm not that blind. Below you see an example of two almost identical files, which differ only in html title, which number is also in the beginning of paragraph.
      Also the entire html code is copied below.

      But I'm not sure any more, if it is the title number that matters. There seem to be also some other inconsistencies.

      ---------------------------------------------------

      Click image for larger version

Name:	SurroundingWords.GIF
Views:	259
Size:	15.0 KB
ID:	35014

      ---------------------------------------------------

      <!doctype html>
      <html>
      <head>
      <meta charset='utf-8'>
      <title>0:0.7</title>
      </head>
      <body>
      <p><span class='refs'><span class='nRef'>0:0.7</span><span class='glue'> </span><span class='oRef'>(1.7)</span></span> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
      </body>
      </html>

      Comment


      • #4
        Hi Tapio,

        We were able to reproduce the behaviour.

        The bug occurs only for the very first file in the index. It is not related to the title of the files.

        It should only occur for one file per index. If you are seeing this problem with multiple files in your index, let us know.

        This will be fixed in the V8 release.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Further investigation shows, that the affected area is depending on Context size in Context description. In my case the size is 1000 words, and my files are very small, only one paragraph long, as I want to ensure to show the entire content. The first 6 files were affected to this behaviour, and they included slightly below 500 words all together. Seventh and the following files had no problem. Decreasing the size of Context description decreased the number of affected files in the beginning.

          It was a pure coincidence, that those six files had a title with common pattern, which was initially misleading my thoughts of the cause.

          Comment


          • #6
            Doesn't make sense to want a context block of 1000 words when there are 20 words in a document.

            Comment


            • #7
              With 20 words I fully agree. But my 15000 documents include 5-500 words each (where every paragraph is an own document, which I found is the only way of doing a paragraphwize search in a very large off-line text).
              Still narrowing the context size even till half doesn't solve the problem.

              Comment


              • #8
                Originally posted by Tapio View Post
                Further investigation shows, that the affected area is depending on Context size in Context description. In my case the size is 1000 words, and my files are very small, only one paragraph long, as I want to ensure to show the entire content. The first 6 files were affected to this behaviour, and they included slightly below 500 words all together. Seventh and the following files had no problem. Decreasing the size of Context description decreased the number of affected files in the beginning.
                This behaviour matches our understanding of the bug, so it should be fixed in V8 as mentioned.

                When you increase the context size, you are asking Zoom to go back further to find the start of the context. The bug was such that when Zoom reaches back too far -- beyond the very start of the index data, it aborts any attempt to produce text before the hit, and just shows context beginning from where the hit occurred.

                So what you describe would make sense, and as mentioned, this has been fixed in V8. If you are a registered customer, you can e-mail us to be added to the list to be informed when the V8 beta is available.

                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  UPDATE: Unfortunately, there isn't a way to improve this behaviour contrary to our initial response. It turns out we had underestimated the original problem.

                  In our test cases, we had not anticipated indexing multiple files that are smaller than the context description size. To the extent where the context size is greater than 2 or 3 of these files contents.

                  Our initial attempt to fix this problem assumed it was just occurring for the very first file that was indexed, so that pointing to the start of the context data would allow us to retrieve the relevant "pre hit" context. But when it occurs for the 2nd or 3rd file (or further on), it becomes impossible to point to the start of these files. Such a pointer does not exist in the data.

                  I hope that explains the situation.

                  This has now been reverted to the original behaviour as described above. That is, in the (quite rare) case where context description is required for a very small file (relative to the Context Size specified in your configuration), it may not be able to locate context (or "surrounding words") that exist before the hit word, and instead you will only see a context description with words following the hit word.

                  I have followed up with your email with a workaround for your particular case.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment

                  Working...
                  X