Announcement

**Ray** · Oct-04-2005, 06:04 AM

By doing it in this order any < and > potentially get treated as HTML tag markers. I have observed that if they are used as brackets in a pair then the text enclosed gets stripped from the index as expected.

You're right, that is a potential problem. We will address this bug in the next build (4.2.1003).

When .TXT documents are indexed the raw characters make it into the index unmolested and the "<" gets converted back to < before going onto the search results page but not the ">". This can potentially break the results page.

It is true that the context description (in the search results) are not currently encoded as HTML entities. However, we can't think of any situation (besides a page with already broken HTML) where this could break the results page, because the "<" character is always encoded.

Also note that the title and meta descriptions are currently encoded as HTML entities where necessary. We do plan to add HTML entity encoding to the context description too, in a future version.

What happens to more obscure ones (Greek, Hebrew etc.) which have no equivalent in the chosen character set encoding I have no idea. I can't see how stage 7. can convert them into plain text. Perhaps they are just omitted.

The character entities are converted to Unicode text before indexing. However, if the user has a character set selected which does not support the character in question, it will be omitted or mapped to a different character when the index data is being written out in the user-specified charset (as selected in the Languages tab of the Configuration window).

We expect that a user would have the appropriate charset / encoding selected if their website is to display certain foreign characters. In fact, we would think that the browser would not be able to display a HTML entity character if that character is not supported by the current charset. Are there examples where this is not the case?

**Rick Parsons** · Oct-04-2005, 08:25 AM

Thanks for your response. I think I agree with every thing you have said (I trust you didn't miss the point about &amp

In the last section one needs to be clear about the distinction between Character Set (the list of characters used by HTML and defined as UNICODE), Character Encoding (the binary codes associated with these characters) and the Font (used to display the characters).

The important disctinction is that the range of characters displayable by the Font can be much wider than the Character Encoding used in the source document and that is what Character Entities are for - in order to pass a code for the character required, expressed in the alowable Character Encoding of the source document.

Using this method, if I use a comprehensive Font, I can cause to be displayed Greek, Hebrew & Cyrilic characters, even though the declared Character Encoding of the page and encoding only supports "Latin-!".

The W3C has a good document about this issue at http://www.w3.org/TR/REC-html40/charset.html

As you are converting back to UNICODE internally this should be ok, SO LONG AS the ones that can't be expressed in the chosen Encoding are converted back to a character entity again before being put on the search results page (as part of a title, summary or context section which I see that you plan to do in a future release).

I would not expect a user to be able to search for these as the input text box character encoding would not support them and it would be unreasonable to expect them to use character entities. My requirement is only for the occasional non-standard character (otherwise I would change the Character Encoding use for the page)

Announcement

Handling of character entities

Handling of character entities

Comment

Comment