PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Arabic text not getting indexed correctly

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Arabic text not getting indexed correctly

    مصر
    I'm having some issues with the arabic language search. It doesn't seem like all of the pages are being indexed. Performing a search for 'مصر'('egypt') returns only one result, http://www.carnegie-mec.org/publications/?fa=40709&lang=ar

    I know there are other pages in the site that should be getting indexed such as
    http://carnegie-mec.org/publications/?fa=40907 or http://carnegie-mec.org/publications/?fa=40868 which have many occurences of 'مصر'.

    I checked the log files and saw the following:
    14|06/09/10 12:09:47|Index Thread got ready buffer for http://www.carnegie-mec.org/publications/?fa=40907 (Content-type: HTML text)
    01|06/09/10 12:09:47|Skipping http://www.carnegie-mec.org/publications/?fa=40907 (Identical page found: CRC signature matched)

    14|06/09/10 12:09:54|DL Thread #1, got URL (http://www.carnegie-mec.org/publications/?fa=4086 off queue
    04|06/09/10 12:09:54|Downloading file http://www.carnegie-mec.org/publications/?fa=40868
    14|06/09/10 12:09:54|Index Thread got ready buffer for http://www.carnegie-mec.org/publications/?fa=40868 (Content-type: HTML text)
    01|06/09/10 12:09:54|Skipping http://www.carnegie-mec.org/publications/?fa=40868 (Identical page found: CRC signature matched)

    Is there a way to find out which page in the index it matched? and why isn't the other page appearing in the search results?
    Last edited by dfisch; Jun-09-2010, 07:26 PM. Reason: added more information

  • #2
    Unfortunately, there is no indication or record of what the matching page is (storing more history data means using up more memory, which means being capable of indexing less pages with a set amount of resources on a computer).

    Usually it should be fairly obvious, although I agree in this case, I don't know why the matching page is not showing up in the search results. Perhaps it was filtered out for some other reason, e.g. it contains a word that matches your Content Filtering settings. Or the alternative URL was skipped because it matches your skip options?

    If you want us to take a closer look, send us your ZCFG configuration file via e-mail.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      I turned off the check for duplicates option in the zoom config. I now get 111 matches. Unsurptisingly there are two entries for each article.

      I can see that the duplicate pages are http://carnegie-mec.org/publications/?fa=40907&lang=ar is a duplicate of http://carnegie-mec.org/publications/?fa=40907

      Currently there are no filters in place. There is a skip list consiting of
      Code:
      /programs/arabic2/
      /programs/arabiccd/
      static/
      Static/
      New_vision
      Npp/
      npp/
      Publications1
      publications1
      newsletters/
      Newsletters/
      Zoomsearch
      Activeedit
      Qsets
      Carnegie China Insight
      programs/china/chinese
      programs/china/Chinese
      communications/
      fa=viewType
      fa=viewTitle
      fa=viewAuthor
      fa=viewTopic
      fa=viewProject
      fa=viewDate
      fa=listEvents
      %2Epdf
      fa=downloadArticlePDF
      So I don't see this filtering the &lang=ar pages.

      Would I be correct in assuming that if I set up a filter for &lang=ar the duplicate records would be removed?

      PS I am still waiting on hearing whether or not I have permission to send you the zconf file.

      Comment


      • #4
        You can e-mail us the ZCFG file whenever you are ready.

        While "http://carnegie-mec.org/publications/?fa=40907&lang=ar"
        is certainly a duplicate of "http://carnegie-mec.org/publications/?fa=40907" it is still odd that the former did not show up in the search results if it had been indexed prior to the latter duplicate being found.

        It would be a good idea to filter out the "&lang=ar" URLs only if all such pages are also linked with a URL that does not include this parameter. Remember that a spider can only find links that exist on your website*. So if there is a page that is only linked with "&lang=ar" at the end of it, the spider will not be able to index that page even though the same URL without "&lang=ar" may have worked to retrieve the same page.

        *You could manually add links that the spider can't reach however, by clicking on the "More" button and adding them as additional start points, but this is impractical if you have more than a few pages which aren't linked.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X