PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Handling of character entities

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Handling of character entities

    Because I use these a lot I have looked carefully at the way they are handled. This is not a big deal but interesting (to me anyhow )

    I have been unable to find any other reference to the subject elsewhere on the forum.

    I have set encoding = ISO-8859-1 because that is what my site declares itself as. Any extra characters required are included in the pages by using character entities. I have also enabled accent/... insensitivity because that makes it easier for the user to search for accented words.

    &. During the index creation stage & gets converted to plain text "&". When the search results are displayed it is not converted back to & I have been very impressed by the way Zoom conforms to XHTML standards when creating the search results page, but this breaks it (though in general it still works).

    < and > According to chapter 6.1 of the user guide, the indexing process has items in the sequence

    7. Convert HTML character entities and numerical entities back to plain text.
    9. Remove all HTML tags ( <...> )

    By doing it in this order any &lt; and &gt; potentially get treated as HTML tag markers. I have observed that if they are used as brackets in a pair then the text enclosed gets stripped from the index as expected. I don't know what happens if singles occur.

    When .TXT documents are indexed the raw characters make it into the index unmolested and the "<" gets converted back to &lt; before going onto the search results page but not the ">". This can potentially break the results page.

    Other character entities. The ones I have observed seem to be handled very well, mdash/ndash convert to hyphen, the various unicode smart quotes convert to '"' and "'". As expected, the ones covered by the accent/... insensitivity convert to their unaccented partners for the purposes of the index (but I have observed that some remain in their original form when reported in the search results e.g. y umlaut)

    What happens to more obscure ones (Greek, Hebrew etc.) which have no equivalent in the chosen character set encoding I have no idea. I can't see how stage 7. can convert them into plain text. Perhaps they are just omitted.
    Cheers,

    Rick Parsons, Bristol, England

  • #2
    By doing it in this order any &lt; and &gt; potentially get treated as HTML tag markers. I have observed that if they are used as brackets in a pair then the text enclosed gets stripped from the index as expected.
    You're right, that is a potential problem. We will address this bug in the next build (4.2.1003).

    When .TXT documents are indexed the raw characters make it into the index unmolested and the "<" gets converted back to &lt; before going onto the search results page but not the ">". This can potentially break the results page.
    It is true that the context description (in the search results) are not currently encoded as HTML entities. However, we can't think of any situation (besides a page with already broken HTML) where this could break the results page, because the "<" character is always encoded.

    Also note that the title and meta descriptions are currently encoded as HTML entities where necessary. We do plan to add HTML entity encoding to the context description too, in a future version.

    What happens to more obscure ones (Greek, Hebrew etc.) which have no equivalent in the chosen character set encoding I have no idea. I can't see how stage 7. can convert them into plain text. Perhaps they are just omitted.
    The character entities are converted to Unicode text before indexing. However, if the user has a character set selected which does not support the character in question, it will be omitted or mapped to a different character when the index data is being written out in the user-specified charset (as selected in the Languages tab of the Configuration window).

    We expect that a user would have the appropriate charset / encoding selected if their website is to display certain foreign characters. In fact, we would think that the browser would not be able to display a HTML entity character if that character is not supported by the current charset. Are there examples where this is not the case?
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Thanks for your response. I think I agree with every thing you have said (I trust you didn't miss the point about &amp

      In the last section one needs to be clear about the distinction between Character Set (the list of characters used by HTML and defined as UNICODE), Character Encoding (the binary codes associated with these characters) and the Font (used to display the characters).

      The important disctinction is that the range of characters displayable by the Font can be much wider than the Character Encoding used in the source document and that is what Character Entities are for - in order to pass a code for the character required, expressed in the alowable Character Encoding of the source document.

      Using this method, if I use a comprehensive Font, I can cause to be displayed Greek, Hebrew & Cyrilic characters, even though the declared Character Encoding of the page and encoding only supports "Latin-!".

      The W3C has a good document about this issue at http://www.w3.org/TR/REC-html40/charset.html

      As you are converting back to UNICODE internally this should be ok, SO LONG AS the ones that can't be expressed in the chosen Encoding are converted back to a character entity again before being put on the search results page (as part of a title, summary or context section which I see that you plan to do in a future release).

      I would not expect a user to be able to search for these as the input text box character encoding would not support them and it would be unreasonable to expect them to use character entities. My requirement is only for the occasional non-standard character (otherwise I would change the Character Encoding use for the page)
      Cheers,

      Rick Parsons, Bristol, England

      Comment

      Working...
      X