PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Noise words in exact phrase searching

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Noise words in exact phrase searching

    The removal of noise words ("a, an, then, of...") from indexing is a good move. But if you then include these words in an exact phrase search, it will return no results as the exact string is not in the index.

    So a book title "Tess of the d'Urbervilles" will be indexed as "Tess d'Urbervilles", but someone typing in the full title will never find it.

    I can, of course, remove the stripping of the noise words on indexing so I get the full title in the index. But stripping noise words IS a good idea. Does it not, then, make logical sense to strip the noise words from any search phrase submitted so that it matches the index?

    Thanks.

  • #2
    What you call "noise words" can be added to the Skip Words list in Zoom.

    When you do an exact phrase search, skip words within the phrase are taken into consideration. For example, if you do a search on our website for the word "the" (without quotes), you will find that it is a skipped word and return no results. However, if you search for "take the tour" (with double quotes), you will find the phrase matched correctly.

    So your example should be fine, and searching for a phrase like "Tess of the d'Ubervilles" should not be a problem. Can you give us a URL to your search page to show us if otherwise?

    Note that the one difference is that your exact phrase can not BEGIN with a skipped word. This means that in the above examples, you can not search for phrases like "the tour", or "of the d'Ubervilles". But having skipped words within the middle or end of the phrase should not be a problem.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Thanks for the info. I suspect your key bit, though, is "Note that the one difference is that your exact phrase can not BEGIN with a skipped word." This is exactly where skipped words such as The, A and An are very likely to appear. So this year's Man Booker prize winner "The White Tiger" becomes invisible on our bookshop sites if people search by exact phrase.

      Is this behaviour changeable?

      If not, I'll leave the skip list in place, but have to remove any that appear as the first word when doing exact matching.

      Comment


      • #4
        If you are mostly doing exact phrase searching then maybe you should remove all the skip words and just let Zoom index all the words.

        Comment

        Working...
        X