PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Problem on result page with UTF-8 (Cyrillic)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem on result page with UTF-8 (Cyrillic)

    I've recently purchased the pro version of Zoom Search 5 and I'm very amazed of the straight forward handling and how easy it is to integrate it in a website

    I plan to use Zoom Search on a Russian (UTF- website. Nearly everything is working perfectly in Cyrillic except of one thing:
    On the result page, the text "Results found for:" is displayed correctly in Cyrillic, but the user search string next to it is just displayed as question marks (?????). When searching for an English word everything is fine, therefore I suppose it could be a format problem.
    Is it possible to suppress this information (as a workaround) or can I fix the problem in a different way?
    Thanks very much for advice.

  • #2
    Can you post the URL to your web site so that we can see the problem.

    Can you tell me,
    - What character set your are using the search_template.html file?
    - What character set you selected the Zoom configuration window?
    - What search script option are you using PHP, ASP, JS or CGI
    - What type of server you have IIS or Apache.

    Comment


    • #3
      yep, the url is http://www.24swiss.ch/search/search.php
      it's not the final version but I had Zoom search up & running in 1/2 hour, great.

      You could take the word Маленький as a search example

      - the charset in the search_template.html is utf-8
      - the charset / language in the Zoom config is Use unicode / utf-8 with Russian.zlang
      - The search script language is PHP
      - its an Apache server under Linux

      thx

      Comment


      • #4
        Hi,

        as I got no reply, I helped myself just commenting out the line

        print("<div class=\"searchheading\">" . $STR_RESULTS_FOR . " " . $queryForHTML)

        in the search.php script, so at least I have now a page without ??????????.
        However, the problem seems to be the PHP function htmlspecialchars where utf-8 support was added in version 4.1. I'll try to figure out the PHP version from my provider.
        A quite surpising fact is that the htmlspecialchars function is used several times in search.php but the line mentioned is the only one where it doesn't work...

        Comment


        • #5
          I just tested this but was unable to reproduce the behaviour. The word appeared correctly in the heading. I was using Zoom 5.0.1001, PHP 5.1, Apache 2.0.

          It may be an issue with a specific version of PHP. You can check what version of PHP you are running by uploading a simple .php script like the following:

          Code:
          [FONT=Courier New]<?php
          phpinfo();
          ?>[/FONT]
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Cyrillic character display problem

            Hi,

            I use Zoom for searching PDF files in Ukrainian.
            I works fine except one issue: character - "р" (CYRILLIC SMALL LETTER ER, Unicode: 0440, UTF8: D1 80) in search results substituted with "ç" (LATIN SMALL LETTER C WITH CEDILLA, Unicode: 00E7, UTF8: C3 A7).
            Because of that in order to find a word with "р" there two searchess needed: one in Zoom index file and than, manually, in PDF file itself.

            Any ideas why this happens and how to fix it?

            Thanks.

            Comment


            • #7
              We might need to see the PDF file in question to confirm the problem. Can you e-mail it to us?

              One possibility is that with PDF files, they often contain an "invisible" text layer, and the actual visible image layer of the text (especially if the document was scanned in from paper sources). In such cases, the text layer is usually created by an OCR program (such as Acrobat Paper Capture). In which case, the OCR program may have recognized the wrong character and stored that in the PDF file. You should be able to confirm this by trying to select the text in Acrobat Reader and copying and pasting it to another text editor.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                The character itself is ok - I can make search on individual pages with no problem.
                The poblem happens on PDF and DJVU files as well.
                But not always. Sometimes.

                Comment


                • #9
                  Another question - when I asked Zoom not to index words less then two chracters - they still appear in the search results. Why?
                  Here the link:
                  http://svoboda-news.com/arxiv.php

                  Comment


                  • #10
                    Sorry, but we don't really see the problem, being that we don't speak Ukrainian. Can you narrow down the problem more specifically, pointing us to a specific PDF file, the word you are searching for, and examples of the results you expect, and the results you are seeing.

                    In regards to your second question, I don't see any words less than two characters being indexed on your website. Remember that words with less than two characters will continue to appear in the search results as part of page titles, and context descriptions. It just means that they will not be indexed so you will not be able to search for a single letter word.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment


                    • #11
                      OK,
                      here is the problem:

                      I type in the search field any cyrillic word that contains character "р" (looks like "p" in English but encoded differently), for example: Україна
                      Click Submit and have search results as seen on pict.1 (first link, please download).
                      Red ovals points to where correct chracters displayed.
                      Red rectangular points to the place where is the problem. As seen, normal character replaced with different one which not even exist in cyrillic alphabet.

                      Now, when we sort the results by relevance or date - the incorrect character appears in more places, as seen on pict. 2, replacing the good ones. The problematic character in the searchheading still in the same place.

                      Context returned with correct characters.

                      Links to download pictures:
                      http://svoboda-news.com/ftp/Sites/filechute//1.pdf
                      http://svoboda-news.com/ftp/Sites/filechute//2.pdf

                      There is also link to djvu page that containes the word.
                      Copy and paste to make a search for it: Україна
                      http://svoboda-news.com/ftp/Sites/fi...yrillic17.djvu

                      Please, let me know if you need any more information.

                      Comment


                      • #12
                        We are currently looking at this problem. It is related to the CGI's method of comparing UTF-8 text, case insensitively. We'll update with more information as it becomes available.
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          Thank you for quick response.

                          Tolking case insensitivety - when I search for a word with first letter capital (a name, like "Часто") the results highlighted includes all words with same spelling ("часто") but the word with first capital letter.
                          May be this relates to the case.
                          Last edited by svoboda; Mar-04-2008, 02:54 PM.

                          Comment


                          • #14
                            Any success with that case insensitively problem, Ray?

                            Comment


                            • #15
                              We have fixed it for an upcoming build, for "CYRILLIC SMALL LETTER ER" specifically. If you would like to test it out, e-mail us and tell us which OS platform you are running the CGI on, and we can send you a test build.

                              Please also let us know if you find an issue with any other Cyrillic characters.

                              RE: the highlighting problem, no, that is unrelated. The highlighting function has always been limited in its ability to highlight non-english words with varying upper/lowercase forms. We are planning on implementing a different method to perform highlighting in V6 which should improve this.
                              --Ray
                              Wrensoft Web Software
                              Sydney, Australia
                              Zoom Search Engine

                              Comment

                              Working...
                              X