PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

UTF8 Search problem with Turkish language.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • UTF8 Search problem with Turkish language.

    I am using zoom search on my website.

    The address of the cgi script is http://85.25.73.168/cgi-bin/search.cgi

    I have a problem with the letters "i" and "İ" (capital "i" in Turkish). The result with the keywords "toki" and "TOKİ" must be the same in Turkish but not.

    The encoding of my pages are UTF-8 and the locale setting on my Ubuntu server is ISO-8859-9. I also set the Zoom Indexer to use UTF-8 encoding.

  • #2
    I just tried searching for both "TOKİ" and "toki" on your website and they returned the same results.

    Searching for "TOKİ"
    http://85.25.73.168/cgi-bin/search.cgi?zoom_query=TOKİ

    Searching for "toki"
    http://85.25.73.168/cgi-bin/search.cgi?zoom_query=toki

    Where is the actual search page? If the page containing the search box is encoded in ISO-8859-9, it will submit the query in that encoding, instead of UTF-8. This would be a problem. Try using a search box on a page that is encoded in UTF-8.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      When I click the links you provided, the first link gives no results while the second one gives 1 result. But I have 4 pages include TOKİ and one page includes toki.

      So there must be 5 results.

      Comment


      • #4
        Interestingly, the difference in behaviour we are seeing appears to be caused by browser differences. In IE8, the above two links both return 1 result. In Firefox, it behaves as you describe.

        The reason for this difference is not because of the search script (which has no way of determining what browser is used). The difference is in how the two browsers submit the "İ" character when it is entered in the URL.

        Having said that, the real issue at hand stems from an inherent problem with the dotted and dotless "I" character in the Turkish language in Unicode. This is described in detail here:
        http://www.i18nguy.com/unicode/turkish-i18n.html

        While we can address this with a partial solution, there appears to be no complete solution to the problem. That is, we can recognize that the lower case version of "İ" is "i", but we cannot recognize the lowercase version of "I" is "ı" (undotted lowercae i character) instead of "i". Because in Unicode, a page can contain multiple languages and is not restricted to one particular language. So there is no practical way to determine which rule to apply.

        This problem is also evident in the Windows API (as seen here under "Remarks"). Because of this, the changes necessary are a bit more significant than one would expect. We'll add this to our list of things to look at for V6.1.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          You might want to use iso-8859-9 in Zoom for now and add a synonym for "toki"="TOKİ".
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Does the ISO-8859-9 encoding will solve my problem? I have 1M pages and I do not want to convert them from UTF-8 to ISO-8859-9 for nothing.

            Comment


            • #7
              Just change the setting in the Zoom Indexer ("Configure"->"Languages"). This will make Zoom convert the content to iso-8859-9 during indexing. It will also expect a iso-8859-9 search template. But you should not need to change your content pages.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                Thanks for your help. I will test it today.
                ----------------------------------------
                Edit: I tested the ISO-8859-9 based engine.

                These two searches again returns different results .
                http://85.25.73.168/cgi-bin/rar/search.cgi?zoom_query=TOKİ
                http://85.25.73.168/cgi-bin/rar/search.cgi?zoom_query=toki

                Zoom Indexer still cannot map "i" as lowercase of "İ" with the iso-8859-9 encoding.
                Last edited by Computeus; Sep-11-2009, 10:33 PM.

                Comment


                • #9
                  Yes, please read my previous posts carefully, I said,

                  Originally posted by Ray
                  You might want to use iso-8859-9 in Zoom for now and add a synonym for "toki"="TOKİ".
                  You still need to add a synonym ("Configure"->"Synonyms") to make this work. I've also explained why it is difficult to map "i" to "İ" and there are system wide issues surrounding this problem (in the Windows API itself) that needs to be worked around. This will have to wait until V6.1 at best.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment


                  • #10
                    Of course I read your last post. But this issue occurs for every word in Turkish which contains "ı","I","i" and "İ". So I have to add synonyms for every word.

                    I hope you will find a complete solution for this issue. Thanks for your help.

                    Comment

                    Working...
                    X