Announcement

**Ray** · Sep-09-2009, 01:42 AM

I just tried searching for both "TOKİ" and "toki" on your website and they returned the same results.

Searching for "TOKİ"
http://85.25.73.168/cgi-bin/search.cgi?zoom_query=TOKİ

Searching for "toki"
http://85.25.73.168/cgi-bin/search.cgi?zoom_query=toki

Where is the actual search page? If the page containing the search box is encoded in ISO-8859-9, it will submit the query in that encoding, instead of UTF-8. This would be a problem. Try using a search box on a page that is encoded in UTF-8.

**Computeus** · Sep-09-2009, 11:12 AM

When I click the links you provided, the first link gives no results while the second one gives 1 result. But I have 4 pages include TOKİ and one page includes toki.

So there must be 5 results.

**Ray** · Sep-10-2009, 03:19 AM

Interestingly, the difference in behaviour we are seeing appears to be caused by browser differences. In IE8, the above two links both return 1 result. In Firefox, it behaves as you describe.

The reason for this difference is not because of the search script (which has no way of determining what browser is used). The difference is in how the two browsers submit the "İ" character when it is entered in the URL.

Having said that, the real issue at hand stems from an inherent problem with the dotted and dotless "I" character in the Turkish language in Unicode. This is described in detail here:
http://www.i18nguy.com/unicode/turkish-i18n.html

While we can address this with a partial solution, there appears to be no complete solution to the problem. That is, we can recognize that the lower case version of "İ" is "i", but we cannot recognize the lowercase version of "I" is "ı" (undotted lowercae i character) instead of "i". Because in Unicode, a page can contain multiple languages and is not restricted to one particular language. So there is no practical way to determine which rule to apply.

This problem is also evident in the Windows API (as seen here under "Remarks"). Because of this, the changes necessary are a bit more significant than one would expect. We'll add this to our list of things to look at for V6.1.

**Ray** · Sep-10-2009, 04:13 AM

You might want to use iso-8859-9 in Zoom for now and add a synonym for "toki"="TOKİ".

**Computeus** · Sep-10-2009, 08:53 AM

Does the ISO-8859-9 encoding will solve my problem? I have 1M pages and I do not want to convert them from UTF-8 to ISO-8859-9 for nothing.

**Ray** · Sep-11-2009, 12:28 AM

Just change the setting in the Zoom Indexer ("Configure"->"Languages"). This will make Zoom convert the content to iso-8859-9 during indexing. It will also expect a iso-8859-9 search template. But you should not need to change your content pages.

**Computeus** · Sep-11-2009, 02:02 PM

Thanks for your help. I will test it today.
----------------------------------------
Edit: I tested the ISO-8859-9 based engine.

These two searches again returns different results

.
http://85.25.73.168/cgi-bin/rar/search.cgi?zoom_query=TOKİ
http://85.25.73.168/cgi-bin/rar/search.cgi?zoom_query=toki

Zoom Indexer still cannot map "i" as lowercase of "İ" with the iso-8859-9 encoding.

**Ray** · Sep-14-2009, 12:44 AM

Yes, please read my previous posts carefully, I said,

Originally posted by Ray

You might want to use iso-8859-9 in Zoom for now and add a synonym for "toki"="TOKİ".

You still need to add a synonym ("Configure"->"Synonyms") to make this work. I've also explained why it is difficult to map "i" to "İ" and there are system wide issues surrounding this problem (in the Windows API itself) that needs to be worked around. This will have to wait until V6.1 at best.

**Computeus** · Sep-14-2009, 05:33 PM

Of course I read your last post. But this issue occurs for every word in Turkish which contains "ı","I","i" and "İ". So I have to add synonyms for every word.

I hope you will find a complete solution for this issue. Thanks for your help.

Announcement

UTF8 Search problem with Turkish language.

UTF8 Search problem with Turkish language.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment