Announcement

**Ray** · Jun-10-2010, 05:23 AM

Unfortunately, there is no indication or record of what the matching page is (storing more history data means using up more memory, which means being capable of indexing less pages with a set amount of resources on a computer).

Usually it should be fairly obvious, although I agree in this case, I don't know why the matching page is not showing up in the search results. Perhaps it was filtered out for some other reason, e.g. it contains a word that matches your Content Filtering settings. Or the alternative URL was skipped because it matches your skip options?

If you want us to take a closer look, send us your ZCFG configuration file via e-mail.

**dfisch** · Jun-10-2010, 02:46 PM

I turned off the check for duplicates option in the zoom config. I now get 111 matches. Unsurptisingly there are two entries for each article.

I can see that the duplicate pages are http://carnegie-mec.org/publications/?fa=40907&lang=ar is a duplicate of http://carnegie-mec.org/publications/?fa=40907

Currently there are no filters in place. There is a skip list consiting of

Code:

/programs/arabic2/
/programs/arabiccd/
static/
Static/
New_vision
Npp/
npp/
Publications1
publications1
newsletters/
Newsletters/
Zoomsearch
Activeedit
Qsets
Carnegie China Insight
programs/china/chinese
programs/china/Chinese
communications/
fa=viewType
fa=viewTitle
fa=viewAuthor
fa=viewTopic
fa=viewProject
fa=viewDate
fa=listEvents
%2Epdf
fa=downloadArticlePDF

So I don't see this filtering the &lang=ar pages.

Would I be correct in assuming that if I set up a filter for &lang=ar the duplicate records would be removed?

PS I am still waiting on hearing whether or not I have permission to send you the zconf file.

**Ray** · Jun-11-2010, 01:03 AM

You can e-mail us the ZCFG file whenever you are ready.

While "http://carnegie-mec.org/publications/?fa=40907&lang=ar"
is certainly a duplicate of "http://carnegie-mec.org/publications/?fa=40907" it is still odd that the former did not show up in the search results if it had been indexed prior to the latter duplicate being found.

It would be a good idea to filter out the "&lang=ar" URLs only if all such pages are also linked with a URL that does not include this parameter. Remember that a spider can only find links that exist on your website*. So if there is a page that is only linked with "&lang=ar" at the end of it, the spider will not be able to index that page even though the same URL without "&lang=ar" may have worked to retrieve the same page.

*You could manually add links that the spider can't reach however, by clicking on the "More" button and adding them as additional start points, but this is impractical if you have more than a few pages which aren't linked.

Announcement

Arabic text not getting indexed correctly

Arabic text not getting indexed correctly

Comment

Comment

Comment