PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

problems with arabic diacritic marks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ray
    replied
    Originally posted by mrbasserby View Post
    now my suggestion is why not make the search engine find words with or without diacritics there is no need to to give the user the option to strip words from diacritics and enable it as default
    We can't do that because it would very likely cause issues when the script is used to search sites in other languages. Some characters overlap in different character sets, or even in Unicode, it can cause problems when the script is used elsewhere that isn't using Unicode (the stripping will match the wrong characters, or the middle of a multi-byte character).

    Having said that, we already have an option to toggle "Strip Arabic diacritic marks", so we can change the script behaviour according to this. There might still be complications with different charsets used for Arabic websites, e.g. UTF-8 or windows-1256 or iso-8859-6.

    So this is more involved to get it to work properly and for everybody.

    We've added this to our V7 todo list.

    Leave a comment:


  • mrbasserby
    replied
    and also i want to say if use this regular expression it will strip all diacritics for arabic text for sure :

    PHP Code:
    return preg_replace('/َ|ِ|ً|ٍ|ُ|ٌ|ّ|ْ|ٰ/'''strtolower($string)); 
    the only thing remain is to able to find arabic characters that merged with hamza ( ء )
    and this case only alif charcter have two types long alif ( ا ) and short alif (ى )

    for example of input:
    أَبْتَغِى

    if we want match the output would be :
    أَبْتَغِى
    or
    (ا أ إ آ)بتغ(ا آ أ ئ ى)

    Leave a comment:


  • mrbasserby
    replied
    in search.php file could be like this example :

    PHP Code:
    function strip_Dia($string)
        {  
        return 
    preg_replace('/َ|ِ|ً|ٍ|ُ|ٌ|ّ|ْ|ٰ/'''strtolower($string));

    }


    // we use the method=GET and 'query' parameter now (for sub-result pages etc)
    $IsZoomQuery 0;
    if (isset(
    $_GET['zoom_query']))
    {   
    $inputStri $_GET['zoom_query'];
        
    $outputStri strip_Dia($inputStri);
        
    $query $outputStri;
        
    $IsZoomQuery 1;
    }
    else
        
    $query ""
    the same principle for other Unicode uft8 languages
    i still didn't find solution for highlight.js to mark both types .

    Leave a comment:


  • mrbasserby
    replied
    now my suggestion is why not make the search engine find words with or without diacritics there is no need to to give the user the option to strip words from diacritics and enable it as default and if the user input words contain diacritics in text search box the script will have first to strip the words form diacritics then trying to find the match words no matter if diacritic or not because we as users want it from the search system to find both types and also jump and highlight both types of characters ..

    for example i want to find the character alif ا or أ or إ or even آ or any alif with diacritics

    i want the result able to find all this types no matter what type of alif i input ..

    Leave a comment:


  • mrbasserby
    replied
    the above script example i found it in the net but it could helps as an example for the java-script of highlight diacritic words in the highlight script file and this is the standard arabic characters :



    أ alif with above hamza
    ب baa
    ت taa
    ث close to thaa
    ج jaa
    ح haa or 7aa
    خ khaa
    د daa
    ذ thaa
    ر raa
    ز zaa
    س saa
    ش shaa
    ص close to saa
    ض close to daa
    ط close to taa
    ظ close to thaa
    ع ayin or close to aaa
    غ close to khaa
    ف faa
    ق close to kaa
    ك kaa
    ل laa
    م maa
    ن naa
    هـ haa
    و waa
    ي yaa



    and this the diacritics used with it i will put it to ( ـ ) as indicator to the arabic characters :


    ( ـُ )
    damma

    ( ـَ )
    fattha

    ( ـِ )
    kassra

    ( ـٌ )
    tanween damma or double damma

    ( ـً )
    tanween fattha or double fatha

    ( ـٍ )
    tanween kassra or double kassra

    ( ـْ )
    skoon

    ( ـّ )
    shadda

    ( ـَّ )
    fattha above shadda

    ( ـُّ )= -ّ + -ُ
    damma above shadda


    ( ـِّ )= -ّ + -ِ
    shadda above kassra

    ( ـَّ )= -ّ + -َ
    fattha above shadda

    ّ( ـٌّ )= -ّ + -ُ
    double damma above shadda

    ( ـٍّ ) = -ّ + -ٍ
    shadda above tanween kassra


    إ = ا + ء
    stand alone characters
    hamza under alif


    أ = ا + ء
    stand alone character
    hamza above alif


    آ = ا + ~
    stand alone character
    madda above alif


    لأ = ل + أ
    stand alone character
    laa with hamza above alif

    لإ = ل + إ
    stand alone character
    laa with hamza under alif

    لآ = ل + آ
    stand alone character
    laa with maddda above alif

    ( ؤ )= و + ء
    stand alone character
    hammza above wow

    ئ = ى + ء
    stand alone character
    hamza above short alif

    ( ى ) short alif stand alone character

    ( ء ) just hamza consider as stand alone character

    and you could use notepad to see it better and to understand more how this characters sound you could use arabic text to voice program like this :

    https://acapela-box.com/AcaBox/index.php

    if guys need more information how to use it with keyboards I'm glad to help you for more details information about it and check wiki site for images and information :
    http://en.wikipedia.org/wiki/Arabic_diacritics
    Last edited by mrbasserby; Dec-03-2012, 04:23 PM.

    Leave a comment:


  • Ray
    replied
    We can probably add something like that into V7. However, we're not familiar with Arabic lettering so I'm not entirely sure how universal the above suggestion is. Did you write that bit of code yourself, or is it from someone else? Are you aware that it simply strips the following 5 characters:







    From the two strings being compared? Is that enough to fix all issues with diacritic marks in Arabic or are there other marks that are not addressed by this approach?

    Leave a comment:


  • mrbasserby
    replied
    hi i wonder if its possible of highlight words with diacritic for the file highlight.js like this example :

    http://jsfiddle.net/FUg85/15/

    Leave a comment:


  • mrbasserby
    replied
    Originally posted by wrensoft View Post
    OK I see, you are asking for the 3 types of alif character to be treated as the same character. So when you search for one of them, it matches the other 2 versions of the character. Correct?

    Like we do for French accents, é and e for example.
    exactly .. also there is four type of alif character which is آ alif with madda
    Last edited by mrbasserby; Nov-30-2012, 10:11 PM.

    Leave a comment:


  • David
    replied
    OK I see, you are asking for the 3 types of alif character to be treated as the same character. So when you search for one of them, it matches the other 2 versions of the character. Correct?

    Like we do for French accents, é and e for example.

    Leave a comment:


  • mrbasserby
    replied
    Originally posted by wrensoft View Post
    While I haven't looked at this in detail, I would have assumed these characters would just work like any other character if you are using UTF-8 as the character set.

    Is there something special about these characters compared to all other Arabic characters?
    i use utf 8 as my language settings for encoding characters and
    those characters that i mentioned above post starts in first Arabic words for example :
    أمي
    even when i enable strip Diacritics still cant find similar words like
    امي
    إمي
    and sometimes when typing in Arabic we dont put hamaza with Alif like this أ or إ
    we simply type it like this " ا " without the quots and it would be nice to able to find this characters when searching the words that contain one of this 3 characters like this exmple :

    search :
    ان
    results :
    ان + إن +أن

    Leave a comment:


  • David
    replied
    While I haven't looked at this in detail, I would have assumed these characters would just work like any other character if you are using UTF-8 as the character set.

    Is there something special about these characters compared to all other Arabic characters?

    Leave a comment:


  • mrbasserby
    replied
    hi guys i have one more question how to let the search engine detect all the following letter when search on of them :
    "أ" alif with above hamza
    "إ" alif with down hamza
    "ا" just alif
    "آ" alif with above madda

    chm search engine able to find :
    ( أ , ا, إ )
    at the same time of searching is there anyway i can do that with search zoom
    thanks in advanced

    Leave a comment:


  • mrbasserby
    replied
    hi thanks so much man i will try it and i hope everything goes right
    thank you

    Leave a comment:


  • Ray
    replied
    We've confirmed that this is a bug in the current release [V6.0.1028]. It has been fixed in the V7 Alpha release.

    If you wish to apply the fix manually by editing the PHP script, then search for this line in "search.php":

    Code:
    $query = preg_replace("/[\s\(\)\^\[\]\|\{\}\%\£\!]+|[\-._',:&\/\\\](\s|$)/u", " ", $query);
    And replace it with this:

    Code:
    $query = preg_replace("/[\s\(\)\^\[\]\|\{\}\%\!]+|[\-._',:&\/\\\](\s|$)/u", " ", $query);
    Note that you will have to be very careful when you're editing the PHP script and we would not advise doing this if you are uncomfortable with PHP scripting.

    Note that the "search.php" file in the output folder will be rewritten when you re-index. You can modify the source copy under "C:\ProgramData\Wrensoft\Zoom Search Engine Indexer\scripts\PHP or ASP\" but note that modified scripts are difficult for us to support as functionality may be broken by incorrect modifications. So if you are uncomfortable with editing, then use V7 Alpha.

    Leave a comment:


  • mrbasserby
    replied
    hi sorry if i ask too much i really like you software thats why i care
    anyway
    but here is the case : i did another tests :
    i indexed HTML files which they already in utf8 Unicode then generate two files one for php :
    search.php
    and then i indexed again and generate javascript:
    search.html
    and both i used utf8 for encoding character option
    and also i modified the Unicode in the:
    search_template.html
    as utf8 and both files :
    search.php
    search.html
    are diacritic striped option disabled just to make sure and notice i used utf8 because as i understand zoom search still not fully support Arabic windows 1256 for content highlighting so i start to search for the word was in the HTML content like :"أبوي"
    and here the results :
    --------------
    for php the output was:
    Search results for: "� بوي" No results found !!!// even if the diacritic striped option enabled or disabled
    --------------
    you see something wrong in the ARABIC LETTER ALEF WITH HAMZA ABOVE
    "أ" the result in the search page came for that letter "أ" like this : �
    which i really sure its encoding character issues in the software
    ----------
    for JavaScript did found an exact match for the word and its highlighted in the content page but didn't appear in the content description which is OK cause i understand its still not support it for JavaScript
    ----------

    so pls can you advice me how to fix this issue or i have to wait for new version of the software and i did this tests in win xp OS sp3 in portable xampp local host server
    Last edited by mrbasserby; May-06-2012, 12:40 PM.

    Leave a comment:

Working...
X