PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

search for square brackets in chemical names

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • search for square brackets in chemical names

    How are square brackets [] treated? The Zoom V6 users guide states ' some characters (such as brackets and other punctuation characters) and spaces may be stripped or trimmed from your keyword or phrase.'

    I do not see brackets in the list of characters that can be specified to join words. We have brackets in our product names such as D-[1,2-13C2]xylose. I have enabled the comma and hyphen and getting better results for that, but the brackets are still a problem. My search for D-[1,2-13C2]xylose responds with 'Search results for: D- 1,2-13C2 xylose'. The brackets are replaced with spaces and the highlighting in the search results skips the brackets. Also, D-[1,2-13C2]arabinose will be included in the results, because I have to match ANY words. If I select match ALL words, I get nothing found.

    Another problem is that L-[1,2-13C2]xylose is also found (I guess because I have to search for any words instead of all words) and in fact, it may even be listed above the the D-[1,2-13C2]xylose in the results.

  • #2
    Square brackets are effectively treated like a space character, and break up words.

    So searching for DOG[CAT] is the same as searching for DOG CAT.

    However if you select match ALL words, then you should get a match. That is to say, searching for DOG[CAT] should match both the text DOG[CAT] and DOG CAT on a page.

    Comment


    • #3
      brackets in title tag and description different from body text

      Originally posted by wrensoft View Post
      Square brackets are effectively treated like a space character, and break up words.

      So searching for DOG[CAT] is the same as searching for DOG CAT.

      However if you select match ALL words, then you should get a match. That is to say, searching for DOG[CAT] should match both the text DOG[CAT] and DOG CAT on a page.
      I am not sure what is going on here. I have a page with D-[1-13C]glucose listed in the title, the description, the alt and title tags of an image, and also in a list element near the bottom of the page. This page also has this heading '<h2><span class="smcap">D</span>-[1-<sup>13</sup>C]glucose</h2>'. I understand that the superscript tag (sup) is a problem that is going to be addressed in the next patch (1026). I also understand that brackets [ ] are treated as spaces and there is currently no setting in Zoom 6 to control how brackets are treated. Since brackets are an integral part of our product names, I am trying to find an acceptable work around.

      When I search for D-[1-13C]glucose, I get unacceptable results with ANY search words selected, my desired page is #8 out of 219 results. Ranking above #8, with the same score, is a page for L-[1-13C]glucose and a page for D-[1-13C;2-2H]glucose. Also with the same score is a page for D-[1-13C;1-2H]glucose. These 4 pages are essentially identical as far as the search terms go, each having the appropriate term in the title, description, image, and list element. The h2 element being ignored since it uses the superscripting. Due to the brackets, I get NO results found containing all search terms. The indexing was done with hyphen, apostrophe, comma and colon all set to join words and skip words less than 1 character. Unfortunately the semicolon is not an option to include in the join words.

      If I substitute spaces for the brackets, which is what I understand Zoom is doing, and select to match ANY words, my target page moves up to the #3 spot out of 424. Being bested by D-[1-13C;1-2H]glucose and D-[1-13C;2-2H]glucose with the same score. Selecting match ALL words finally places my target in the #1 spot, out of 12, but it still has the same score as the other 2 products. Search terms are highlighted in the title, description, and in the body text, skipping the brackets. These results can be viewed at www.omicronbio.com/zoom1.html.

      Finally, I enclose the terms in quotation marks and search for "D- 1-13C glucose". My target page is # 2 out of 3 results. I realize that I may need to work with the weightings (I have page title as 3+ boost, alt text as 1+ and page text and keywords as normal. A problem is that the desired item, D-[1-13C]glucose, is highlighted in the results ONLY in the body text of my target page. The same term is NOT highlighted in the title or in the description. So, was it not counted in the score? These results may be seen at omicronbio.com/zoom2.html. Are brackets turned into spaces only in keywords and in body text but something different in the title tag and description?

      Comment


      • #4
        To be honest, I'm not really sure what the question is.

        There should be no difference with how square brackets are treated as spaces, in titles or descriptions.

        Some subtleties of note however:

        The minimum word length specified under "Configure"->"Skip options" ("Skip words less than ... characters") by default is 2. This would have affected the first letter in your search.

        Because essentially, you are searching for
        D- 1-13C glucose

        And the word join rules would not index the "-" (even when hyphens are enabled for word joining) because it is not followed by a valid character (it is instead, followed by a space). So what it would try to index are 3 words:

        D
        1-13C
        glucose

        But if you are skipping words less than 2 characters (as by default), then the "D" would be lost, and it only indexes the two remaining words.

        These requirements for semi-colon to join words, and the use of square brackets are a bit unusual, you are at least, the first person to ask for it. While it would be nice to add options for everything requested, it simply isn't economical for us to do so when every user has slight different requirements and it doesn't benefit the rest of the userbase.

        If these indexing requirements (I think you just need square brackets and semi-colons to be word join characters?) are commercially critical for you, we can give you a quote for custom development to add bespoke features into V7. However, as noted, when features are uniquely required, we would charge the cost of development (which is different from pricing for an off-the-shelf product). If you do wish to discuss this however, please contact us and we can work out the details and give a ballpark figure.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Just had another idea which might be a decent solution for you. See this post I made in your other thread.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Originally posted by Ray View Post
            To be honest, I'm not really sure what the question is.

            There should be no difference with how square brackets are treated as spaces, in titles or descriptions.
            The specific question is why when I search for D- 1-13C glucose (which is my target text with the brackets replaced by spaces and match all words and I have changed the "Skip words less than ... characters" to 1, to include the D) I get highlighting of D-[1-13C]glucose, except for no highlighting on the brackets, in the search results in the Title, description, and the body text area (all 3 places where it occurs). This I understand.

            But when I search for "D- 1-13C glucose" which is the exact same term surrounded by quotes, the D-[1-13C]glucose will be highlighted completely, even the brackets, in the body text string, but not at all, not even a portion, in the Title or descrtiption, which previously had the parts highlighted.

            I do not understand why the body text received the highlighting in both searches, but the title and the description do not get the highlighting in the search enclosed by quotes. To me, it seems the highlighting is signaling that perhaps the search did not match the term in the title and description when enclosed in quotes. It is only highlighted in the body text. Which this may explain why the target page got a lower score than the non-target page. The target page has the term (with brackets) in the title, description, alt and title of an image, and one time in the body text. It gets one highlight in the body only, the title and description areas are not highlighted, and a score less than a non-target page. The non-target page has the term (with brackets) in the keywords and alt tag of an image. It also gets one highlight, but somehow gets a higher score, even though I have boosted the title (3+) and keywords are normal.

            But thanks for your other tip that the custom meta field of Text type is not subjected to the normal word indexing rules. I will check this out. I realize I can not expect you to customize for my needs, but I am just trying to understand how the matching works so I can rewrite my pages accordingly. One suggestion I would make, that might have global appeal, would be instead of having the check boxes for characters to use as joins, you just leave a blank area for users to specify any characters desired, similar to the listing area to list the skip words.

            Comment


            • #7
              Originally posted by nmyers View Post
              The specific question is why when I search for D- 1-13C glucose (which is my target text with the brackets replaced by spaces and match all words and I have changed the "Skip words less than ... characters" to 1, to include the D) I get highlighting of D-[1-13C]glucose, except for no highlighting on the brackets, in the search results in the Title, description, and the body text area (all 3 places where it occurs). This I understand.

              But when I search for "D- 1-13C glucose" which is the exact same term surrounded by quotes, the D-[1-13C]glucose will be highlighted completely, even the brackets, in the body text string, but not at all, not even a portion, in the Title or descrtiption, which previously had the parts highlighted.
              I see, so you're actually talking about the highlighting here.

              Note that the highlighting doesn't necessarily reflect the match or scoring of the page. So it doesn't mean the title is omitted if it doesn't highlight there.

              The number of results and the scoring would vary between the non-exact phrase match and the exact phrase match, because the latter requires the keywords to be in the exact same sequence, while the former would allow any occurence of "1-13C" on that page to contribute to the score, or the single letter "D".

              There are other weighting/score adjustments that might have applied, such as "Word position" and "content density" and "URL length". These might have adjusted the score further. Having said that, "GLC-018.php" is a smaller page, so if you have adjustments for "Content density" ("Standard adjustment" is default) it should actually push it up. Do you have this set to "No adjustment"?

              But yes, I think, ultimately, it would serve your purpose much better to use the Custom Meta field rather than rely on exact phrase matching with single letters, and hyphen delimited numbers, etc.

              Originally posted by nmyers View Post
              But thanks for your other tip that the custom meta field of Text type is not subjected to the normal word indexing rules. I will check this out. I realize I can not expect you to customize for my needs, but I am just trying to understand how the matching works so I can rewrite my pages accordingly. One suggestion I would make, that might have global appeal, would be instead of having the check boxes for characters to use as joins, you just leave a blank area for users to specify any characters desired, similar to the listing area to list the skip words.
              Thanks for the feedback. We've thought of this too, and we may well add this in the future.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                non-exact phrase match vs the exact phrase match

                Originally posted by Ray View Post
                The number of results and the scoring would vary between the non-exact phrase match and the exact phrase match, because the latter requires the keywords to be in the exact same sequence, while the former would allow any occurence of "1-13C" on that page to contribute to the score, or the single letter "D".
                It occurred to me after I replied that, to use a simplification, if I search for "See Jane run" WITH the quotes, that a page that says many other things about Jane, and so repeats her name many times, but has only 1 instance of the exact quote, that page score may still be incremented by the multiple occurrence of the word Jane throughout the page, even though I am searching for an exact term match. Is this correct? I have not looked at the Content density setting yet, trying not to change too many parameters at one time.

                Comment


                • #9
                  highlighting

                  Originally posted by Ray View Post
                  Note that the highlighting doesn't necessarily reflect the match or scoring of the page. So it doesn't mean the title is omitted if it doesn't highlight there.
                  OK, I understand the highlighting doesn't show everything considered in the scoring, as some of things you have mentioned can not be highlighted, and I have seen other examples, like the alt tags of images, that are not reflected in the highlighting. But I had expected it to be consistent, if it highlighted the search term/terms in the title in one search, that in another search, if a match was found again in the same title, it would highlight it.

                  Comment


                  • #10
                    square brackets

                    Originally posted by wrensoft View Post
                    Square brackets are effectively treated like a space character, and break up words.

                    So searching for DOG[CAT] is the same as searching for DOG CAT.

                    However if you select match ALL words, then you should get a match. That is to say, searching for DOG[CAT] should match both the text DOG[CAT] and DOG CAT on a page.
                    Because I need to distinguish between products such as d-[1,2-13C2]glucose and d-[1-13C]glucose, I have selected both the comma and hyphen to be join characters. I can't do anything about the brackets. When I search for d-[1,2-13C2]glucose with ALL search words selected, the Zoom engine replies
                    Search results for: d- 1,2-13C2 glucose
                    No results found.
                    If I then repeat the search, using the term d- 1,2-13C2 glucose, Zoom replies it searched for d 1,2-13C2 glucose and I have a good match.

                    I don't have DOG[CAT] on our pages, but it seems that searching for d-[1,2-13C2]glucose is not the same as searching for d- 1,2-13C2 glucose. My first bracket follows a join char (hyphen for us) but the second bracket is between two characters.

                    Is zoom stripping out brackets during the indexing, but not soon enough during the query?

                    Are square brackets treated identically to parentheses?

                    I recently upgraded to 1028 and I had been using 1023. The site in question can be seen at www.omicronbio.com/search.php. Please note that I am entering these terms in the Search for box and leaving the product name box blank.

                    Thank You

                    Comment


                    • #11
                      Yes. If you are search for "See Jane run", with quotes on two pages, then a page that mentions Jane and running a lot will be ranked higher than the page that doesn't. This is true even if the exact phase appears only once on the two pages.

                      We designed Zoom to search for words and numbers. We didn't really design it to search for the string of punctuation used in chemical names. You might find that in the end that some custom development is required to get the result you want.

                      Comment

                      Working...
                      X