Because I use these a lot I have looked carefully at the way they are handled. This is not a big deal but interesting (to me anyhow )
I have been unable to find any other reference to the subject elsewhere on the forum.
I have set encoding = ISO-8859-1 because that is what my site declares itself as. Any extra characters required are included in the pages by using character entities. I have also enabled accent/... insensitivity because that makes it easier for the user to search for accented words.
&. During the index creation stage & gets converted to plain text "&". When the search results are displayed it is not converted back to & I have been very impressed by the way Zoom conforms to XHTML standards when creating the search results page, but this breaks it (though in general it still works).
< and > According to chapter 6.1 of the user guide, the indexing process has items in the sequence
7. Convert HTML character entities and numerical entities back to plain text.
9. Remove all HTML tags ( <...> )
By doing it in this order any < and > potentially get treated as HTML tag markers. I have observed that if they are used as brackets in a pair then the text enclosed gets stripped from the index as expected. I don't know what happens if singles occur.
When .TXT documents are indexed the raw characters make it into the index unmolested and the "<" gets converted back to < before going onto the search results page but not the ">". This can potentially break the results page.
Other character entities. The ones I have observed seem to be handled very well, mdash/ndash convert to hyphen, the various unicode smart quotes convert to '"' and "'". As expected, the ones covered by the accent/... insensitivity convert to their unaccented partners for the purposes of the index (but I have observed that some remain in their original form when reported in the search results e.g. y umlaut)
What happens to more obscure ones (Greek, Hebrew etc.) which have no equivalent in the chosen character set encoding I have no idea. I can't see how stage 7. can convert them into plain text. Perhaps they are just omitted.
I have been unable to find any other reference to the subject elsewhere on the forum.
I have set encoding = ISO-8859-1 because that is what my site declares itself as. Any extra characters required are included in the pages by using character entities. I have also enabled accent/... insensitivity because that makes it easier for the user to search for accented words.
&. During the index creation stage & gets converted to plain text "&". When the search results are displayed it is not converted back to & I have been very impressed by the way Zoom conforms to XHTML standards when creating the search results page, but this breaks it (though in general it still works).
< and > According to chapter 6.1 of the user guide, the indexing process has items in the sequence
7. Convert HTML character entities and numerical entities back to plain text.
9. Remove all HTML tags ( <...> )
By doing it in this order any < and > potentially get treated as HTML tag markers. I have observed that if they are used as brackets in a pair then the text enclosed gets stripped from the index as expected. I don't know what happens if singles occur.
When .TXT documents are indexed the raw characters make it into the index unmolested and the "<" gets converted back to < before going onto the search results page but not the ">". This can potentially break the results page.
Other character entities. The ones I have observed seem to be handled very well, mdash/ndash convert to hyphen, the various unicode smart quotes convert to '"' and "'". As expected, the ones covered by the accent/... insensitivity convert to their unaccented partners for the purposes of the index (but I have observed that some remain in their original form when reported in the search results e.g. y umlaut)
What happens to more obscure ones (Greek, Hebrew etc.) which have no equivalent in the chosen character set encoding I have no idea. I can't see how stage 7. can convert them into plain text. Perhaps they are just omitted.
Comment