PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

problems with arabic diacritic marks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • problems with arabic diacritic marks

    I am having problems when using an arabic language search with diacritic marks not getting stripped. I have selected "Strip Arabic diacritic marks from words" from the configuration and have reindexed the website.

    for example; I have content on the website that is entered as الأردن
    most of our users would search for الاردن which still returns no results.

    Are there other settings I may need to adjust?

  • #2
    Are you using UTF-8 encoding?

    Can you give us the URL to the page in question?

    Also let us know if you are using PHP, ASP, ASP.NET, CGI, or JavaScript. And the version and build of Zoom you are using (click Help->About in the Indexer).

    Can you also give us the name of the diacritic mark in question. It's a little hard to recognize for those of us who don't read Arabic. Is that an "alif hamza"?
    Last edited by Ray; Apr-15-2010, 02:00 AM.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      hi
      i have the same problem i use zoom search php and i tried to type unicode :
      the first is utf-8 then i tried the second type:windos 1252 arabic

      but both are not strip my diacritic search for example :
      when i put this arabic word in search box :
      "ابليس"
      the result it find all the words that match
      the previous but its not detect this for example:
      "إبليس"
      or this:
      "أبليس"

      you see the difference ?
      most Arabic ppl when searching ignore to write this :
      "ء" which is called in arabic "hamza"
      or
      "أ" called "hamaza" above "alif"
      or
      "إ" called "alif" above "hamaza"
      and just type "alif"
      like this : "ا"
      and chm search detect the two last parts "أ" and "إ"
      even if i type it like this "ا"

      so I'm sure if i missed something ?
      Last edited by mrbasserby; Apr-29-2012, 05:15 AM.

      Comment


      • #4
        hi just one thing

        when you add "ا" and "ل" and "ا" the result is : "الا"

        and also when add "ا" and "ل" and "إ" the result is "الإ"

        and like that we could write "الأ" like this example:
        "الأردن"
        Last edited by mrbasserby; Apr-29-2012, 05:15 AM.

        Comment


        • #5
          hi sorry for posting too many post
          just three is another issue when i type a word in search box i expect from the search engine to find both diacritic and noun diacritic words no matter if the input word was diacritic or not for example if put this word in search box:
          "التّابُوتِ"
          as you see its dirictic and should the search give me all
          type the result if that was my option it like this :

          "التّابُوتِ"
          and
          "التابوت"

          but its not its just working like this when i put this word:

          "التابوت"

          the results as like this :

          "التّابُوتِ"
          and
          "التابوت"

          Comment


          • #6
            Did you enable the option to "Strip Arabic diacritic marks from words" under "Configure"->"Languages"?

            You will have to reindex (and upload your new index files) for it to take effect.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              hi
              yes i did but its not working as i want i give you other example to simplify :
              ------------
              when enable :"Strip Arabic diacritic marks:
              do: index
              type : التابوت
              result : "التابوت" and "التّابُوتِ" ///ok excellent

              type: التّابُوتِ
              result : sorry no result found !!! //// it should give me same result as above
              ------------
              when disable :"Strip Arabic diacritic marks:
              do: index
              type : التابوت
              result : "التابوت"/// ok good

              type: التّابُوتِ
              result : التّابُوتِ ///ok good


              ////////////////////////////////////////////////////////
              another example :
              --------------
              when enable :"Strip Arabic diacritic marks:
              do: index
              type : الأردن
              result : "الأردن" /// it should give me also this :"الاردن" and "الإردن"

              ------------
              when disable :"Strip Arabic diacritic marks:
              do: index
              type : الأردن
              result : "الأردن"/// ok good



              Comment


              • #8
                Just to clarify, it seems like you did not have this feature enabled before? So the behaviour is now different with the feature enabled?

                From your most recent examples, it seems that it now matches all occurences on a web page (with diacritics and without) so long as the user enters the non-diacritic version of the word into the search box.

                However, it will not match if the user enters the diacritic version of the word into the search box.

                Correct me if the above summary is inaccurate.

                If this is the case, then it is behaving currently as designed. The indexer is capable of stripping diacritic marks from arabic languages because it is run from your computer. However, the search script (PHP, ASP, CGI, etc.) does not have this available because most hosting platforms are limited and it would be difficult to impose locale/regional settings on the web server (alot of people are on shared hosting and not dedicated servers).

                As yourself and the original poster of this thread stated however, most of the time Arabic users do not type in the diacritic marks when searching. So perhaps you can simply add some advice on the search_template.html page before the search box to tell users they should enter in words without diacritic marks (and that it will match both diacritic and non-diacritic versions found on pages).
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Originally posted by Ray View Post
                  Just to clarify, it seems like you did not have this feature enabled before? So the behaviour is now different with the feature enabled?

                  From your most recent examples, it seems that it now matches all occurences on a web page (with diacritics and without) so long as the user enters the non-diacritic version of the word into the search box.

                  However, it will not match if the user enters the diacritic version of the word into the search box.

                  Correct me if the above summary is inaccurate.

                  If this is the case, then it is behaving currently as designed. The indexer is capable of stripping diacritic marks from arabic languages because it is run from your computer. However, the search script (PHP, ASP, CGI, etc.) does not have this available because most hosting platforms are limited and it would be difficult to impose locale/regional settings on the web server (alot of people are on shared hosting and not dedicated servers).

                  As yourself and the original poster of this thread stated however, most of the time Arabic users do not type in the diacritic marks when searching. So perhaps you can simply add some advice on the search_template.html page before the search box to tell users they should enter in words without diacritic marks (and that it will match both diacritic and non-diacritic versions found on pages).
                  hi sorry for my bad English and yes you got exactly what i mean when type
                  in Arabic diacritic and the strip option enabled no match found even if its exist.
                  and yes you right they don't type in diacritic except my case
                  because i have about 1330 pages most it contain Arabic diacritic holy quran text and they mostly looking for diacritic texts like in the holy quran and also normal texts to find the exact information they want so they do type "copy paste Arabic diacritic text" when searching to find the exact text or statement they looking for so maybe i should add script that will strip the diacritic inputs text that entered in the search box of the designed page
                  but i need to do more tests and i will tell you the results i got ..
                  thank you
                  Last edited by mrbasserby; May-01-2012, 03:41 AM.

                  Comment


                  • #10
                    hi sorry if i ask too much i really like you software thats why i care
                    anyway
                    but here is the case : i did another tests :
                    i indexed HTML files which they already in utf8 Unicode then generate two files one for php :
                    search.php
                    and then i indexed again and generate javascript:
                    search.html
                    and both i used utf8 for encoding character option
                    and also i modified the Unicode in the:
                    search_template.html
                    as utf8 and both files :
                    search.php
                    search.html
                    are diacritic striped option disabled just to make sure and notice i used utf8 because as i understand zoom search still not fully support Arabic windows 1256 for content highlighting so i start to search for the word was in the HTML content like :"أبوي"
                    and here the results :
                    --------------
                    for php the output was:
                    Search results for: "� بوي" No results found !!!// even if the diacritic striped option enabled or disabled
                    --------------
                    you see something wrong in the ARABIC LETTER ALEF WITH HAMZA ABOVE
                    "أ" the result in the search page came for that letter "أ" like this : �
                    which i really sure its encoding character issues in the software
                    ----------
                    for JavaScript did found an exact match for the word and its highlighted in the content page but didn't appear in the content description which is OK cause i understand its still not support it for JavaScript
                    ----------

                    so pls can you advice me how to fix this issue or i have to wait for new version of the software and i did this tests in win xp OS sp3 in portable xampp local host server
                    Last edited by mrbasserby; May-06-2012, 12:40 PM.

                    Comment


                    • #11
                      We've confirmed that this is a bug in the current release [V6.0.1028]. It has been fixed in the V7 Alpha release.

                      If you wish to apply the fix manually by editing the PHP script, then search for this line in "search.php":

                      Code:
                      $query = preg_replace("/[\s\(\)\^\[\]\|\{\}\%\£\!]+|[\-._',:&\/\\\](\s|$)/u", " ", $query);
                      And replace it with this:

                      Code:
                      $query = preg_replace("/[\s\(\)\^\[\]\|\{\}\%\!]+|[\-._',:&\/\\\](\s|$)/u", " ", $query);
                      Note that you will have to be very careful when you're editing the PHP script and we would not advise doing this if you are uncomfortable with PHP scripting.

                      Note that the "search.php" file in the output folder will be rewritten when you re-index. You can modify the source copy under "C:\ProgramData\Wrensoft\Zoom Search Engine Indexer\scripts\PHP or ASP\" but note that modified scripts are difficult for us to support as functionality may be broken by incorrect modifications. So if you are uncomfortable with editing, then use V7 Alpha.
                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment


                      • #12
                        hi thanks so much man i will try it and i hope everything goes right
                        thank you

                        Comment


                        • #13
                          hi guys i have one more question how to let the search engine detect all the following letter when search on of them :
                          "أ" alif with above hamza
                          "إ" alif with down hamza
                          "ا" just alif
                          "آ" alif with above madda

                          chm search engine able to find :
                          ( أ , ا, إ )
                          at the same time of searching is there anyway i can do that with search zoom
                          thanks in advanced

                          Comment


                          • #14
                            While I haven't looked at this in detail, I would have assumed these characters would just work like any other character if you are using UTF-8 as the character set.

                            Is there something special about these characters compared to all other Arabic characters?

                            Comment


                            • #15
                              Originally posted by wrensoft View Post
                              While I haven't looked at this in detail, I would have assumed these characters would just work like any other character if you are using UTF-8 as the character set.

                              Is there something special about these characters compared to all other Arabic characters?
                              i use utf 8 as my language settings for encoding characters and
                              those characters that i mentioned above post starts in first Arabic words for example :
                              أمي
                              even when i enable strip Diacritics still cant find similar words like
                              امي
                              إمي
                              and sometimes when typing in Arabic we dont put hamaza with Alif like this أ or إ
                              we simply type it like this " ا " without the quots and it would be nice to able to find this characters when searching the words that contain one of this 3 characters like this exmple :

                              search :
                              ان
                              results :
                              ان + إن +أن

                              Comment

                              Working...
                              X