مصر
I'm having some issues with the arabic language search. It doesn't seem like all of the pages are being indexed. Performing a search for 'مصر'('egypt') returns only one result, http://www.carnegie-mec.org/publications/?fa=40709&lang=ar
I know there are other pages in the site that should be getting indexed such as
http://carnegie-mec.org/publications/?fa=40907 or http://carnegie-mec.org/publications/?fa=40868 which have many occurences of 'مصر'.
I checked the log files and saw the following:
14|06/09/10 12:09:47|Index Thread got ready buffer for http://www.carnegie-mec.org/publications/?fa=40907 (Content-type: HTML text)
01|06/09/10 12:09:47|Skipping http://www.carnegie-mec.org/publications/?fa=40907 (Identical page found: CRC signature matched)
14|06/09/10 12:09:54|DL Thread #1, got URL (http://www.carnegie-mec.org/publications/?fa=4086 off queue
04|06/09/10 12:09:54|Downloading file http://www.carnegie-mec.org/publications/?fa=40868
14|06/09/10 12:09:54|Index Thread got ready buffer for http://www.carnegie-mec.org/publications/?fa=40868 (Content-type: HTML text)
01|06/09/10 12:09:54|Skipping http://www.carnegie-mec.org/publications/?fa=40868 (Identical page found: CRC signature matched)
Is there a way to find out which page in the index it matched? and why isn't the other page appearing in the search results?
I'm having some issues with the arabic language search. It doesn't seem like all of the pages are being indexed. Performing a search for 'مصر'('egypt') returns only one result, http://www.carnegie-mec.org/publications/?fa=40709&lang=ar
I know there are other pages in the site that should be getting indexed such as
http://carnegie-mec.org/publications/?fa=40907 or http://carnegie-mec.org/publications/?fa=40868 which have many occurences of 'مصر'.
I checked the log files and saw the following:
14|06/09/10 12:09:47|Index Thread got ready buffer for http://www.carnegie-mec.org/publications/?fa=40907 (Content-type: HTML text)
01|06/09/10 12:09:47|Skipping http://www.carnegie-mec.org/publications/?fa=40907 (Identical page found: CRC signature matched)
14|06/09/10 12:09:54|DL Thread #1, got URL (http://www.carnegie-mec.org/publications/?fa=4086 off queue
04|06/09/10 12:09:54|Downloading file http://www.carnegie-mec.org/publications/?fa=40868
14|06/09/10 12:09:54|Index Thread got ready buffer for http://www.carnegie-mec.org/publications/?fa=40868 (Content-type: HTML text)
01|06/09/10 12:09:54|Skipping http://www.carnegie-mec.org/publications/?fa=40868 (Identical page found: CRC signature matched)
Is there a way to find out which page in the index it matched? and why isn't the other page appearing in the search results?
Comment