PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Feature Request: Ability to Tweak Stemming

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Feature Request: Ability to Tweak Stemming

    I'm really liking the stemming function in V6. But I've found a couple of circumstances where it's giving me undesirable results.

    For example, my website has both RA and RAS as abbreviations. It finds pages with either of them when either is searched for whereas I'd prefer if it only found pages with the one that was searched for.

    Another example, I have the word device and the proper name Devic in my site. Because of stemming (I assume), searching for "devic" (no quotes) gives me all the pages that have the word device too.

    If feasible, it would be nice if there could be a page akin to synonyms where one could enter/upload stemming exceptions. If I could add specific things to a list that would then not be interpreted as a stem, it would solve this problem.

    Thanks,

    Andrew

  • #2
    Related to this feature request -- to tweak stemming -- is my desire to tweak Synonyms. Misspellings and synonyms are entered in the same Zoom Search Engine entry box. For example, I have entered these variants in Zoom Search Engine:

    raisin = raison,rasian,rasain,rasan

    If I do a search for raisian, I get: "Did you mean: rasian?"
    If I do a search for rason, I get: "Did you mean: raisins or raison or rasain?"

    The suggested word must never be a misspelling! It should only suggest the correct spelling of raisin. The Synonyms module seems to think rasian, rasain and raison are a properly spelled synonyms for raisin. We need to be able to make a distinction between misspellings and correctly spelled synonyms.

    Comment


    • #3
      Good points above.

      Originally posted by aschecht View Post
      For example, my website has both RA and RAS as abbreviations. It finds pages with either of them when either is searched for whereas I'd prefer if it only found pages with the one that was searched for.

      Another example, I have the word device and the proper name Devic in my site. Because of stemming (I assume), searching for "devic" (no quotes) gives me all the pages that have the word device too.
      Yes, that is a result of stemming, and one of the downsides to using it. A list of words to exempt from stemming is a good idea, and likely something we could add for a V6.1 release or similar. We're trying to find a balance between having an overwhelming number of features that the majority of users don't know what to do with, and having things "just work". Most people wouldn't understand that this behaviour was caused by stemming, so they wouldn't even look for a stemming exemption list to enter words in.

      It would be nice if we could automatically determine this based on upper/lower-casing, but that also makes a whole bunch of assumptions about how a user will type in names and abbreviations. We'll look into it in any case.

      Originally posted by rschletty View Post
      Related to this feature request -- to tweak stemming -- is my desire to tweak Synonyms. Misspellings and synonyms are entered in the same Zoom Search Engine entry box. For example, I have entered these variants in Zoom Search Engine:

      raisin = raison,rasian,rasain,rasan

      If I do a search for raisian, I get: "Did you mean: rasian?"
      If I do a search for rason, I get: "Did you mean: raisins or raison or rasain?"

      The suggested word must never be a misspelling! It should only suggest the correct spelling of raisin. The Synonyms module seems to think rasian, rasain and raison are a properly spelled synonyms for raisin. We need to be able to make a distinction between misspellings and correctly spelled synonyms.
      Yes, that is right, and yes, there are also synonyms which are correctly spelled, so we don't want to remove them from the suggestions all together.

      In your particular example, you probably shouldn't need to have "raison", "rasain", or "rasan" as synonyms, because they are all automatically determined as spelling mistakes of "raisin".

      Note that the Spelling Suggestions feature automatically associate similar words (based on the phonetic sound of the word), so you should not need to enter so many misspelt words as synonyms, except maybe for a smaller group of words which could not be automatically associated. I think this should minimize the problem at least.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        I think you need three columns in the Zoom Synonyms custom entry tool: correct word, synonyms and misspellings.

        Then, the module would pull suggestions only from correct words, synonyms, phonetic library and stemming library -- but never from misspellings.

        I understand that raison is French for reason, but I don't think that's why it presented that word as a suggested option. And I was pretty sure that I tested variants of raisin before entering my misspellings. I'll take out my entries and test again.

        Thanks.

        Comment


        • #5
          Ray,

          Thanks for taking this feature request into consideration. I agree that most users would never touch it but it would be a nice piece of fine tuning for users who wanted to take the time to add finishing touches.

          I like rschletty's suggestion about the Synonym tweaking as well. Many of the terms on my synonym lists are misspellings of proper names and it's suboptimal when they show up on the Did you mean . . . list. I like the idea of three columns:
          key term | synonyms | misspellings
          where only the synonym matches would be eligible to appear as Did you mean . . . suggestions.

          I populate my Synonyms list with search terms that went unfound as determined by reviewing my search logs. This means that I don't both making synonyms for misspelled words unless they result in no hits.

          Andrew

          Comment


          • #6
            +1 vote on improved support for stemming/synonyms.

            I was surprised to find that I can't use phrases for synonyms. How then can I create synonyms for acronyms? For example, how could I create a synonym for something like PTO=Paid Time Off ??

            That doesn't appear to be possible

            Comment


            • #7
              I agree with Dan. I was disappointed to see I could not enter a phrase in Synonyms.

              Comment


              • #8
                May I add my voice to a request for an ability to tweak stemming?
                I like the way it works most of the time, but occassionally English can be a real pain - "news" is not the plural of "new" (nor "new" the singular of "news").

                Comment


                • #9
                  We're thinking of adding an user specifiable list of words to exempt from stemming. We'll note your request. Thanks for the input.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment

                  Working...
                  X