Announcement

**David** · May-04-2009, 01:51 AM

Looks like an interesting feature. And probably not very hard to implement. We'll look at supporting it for the future V6.1 release (no date as yet).

In the meantime the CRC function should filter pages that are exact duplicates. What are the URLs of the duplicate pages?

Also note that V6.0 of Zoom has a new, improved method of CRC duplicate page detection: the CRC comparison is now made after stripping out HTML and ZOOMSTOP sections. This means that a page with ads excluded using ZOOMSTOP will now be recognized as being duplicate, despite having different dynamic ads on the page.

You should also look at using the skip list to avoid duplicate pages. Using the skip list is MUCH more efficient than usng the CRC or the Google canonical feature. As the skip list will avoid the download of the page, saving a lot of indexing time and bandwidth.

**Queue** · May-04-2009, 04:23 PM

Ray,

Perhaps I need to look into Skip List a bit more...

I am running 6.0 Build 1012. Here is the url to my search page (this is a DEV site)...

www.qomcorp.com/zoom_search.asp

Search for "lorem" and you will get twenty-four (24) search results. od the 24 results the following pages are duplicate content...

http://www.qomcorp.com > http://www.qomcorp.com/default.asp

You will notice that I have some other issues to resolve, one of which is normalization of URL case. However, why doesn't Zoom normalize url's to lowercase? As I understand it, search engines perform such lowercase normalization...why doesn't Zoom?

BTW Ray...Zoom is not simply good work: IT IS BRILLIANT.

Thank you.

**Ray** · May-05-2009, 12:59 AM

Originally posted by Queue View Post

Search for "lorem" and you will get twenty-four (24) search results. od the 24 results the following pages are duplicate content...

http://www.qomcorp.com > http://www.qomcorp.com/default.asp

CRC duplicate detection would usually address this, but the problem here is that on each of your pages, there is some content that changes every time. For example, the left and right side panels with dates, time, last topic posted, etc. And most importantly, at the bottom, there is a "This page was generated in 0.4375 seconds" message, which of course, is likely to change every time you visit the page.

As mentioned by David above, you can use ZOOMSTOP and ZOOMRESTART tags to exclude these sections of the page. This way only the middle, content portion of the page is looked at when Zoom tries to determine if it is a duplicate page. The other benefit of doing this is, you won't get every page returned should someone search for a word in the navigation menus or, for example, "page generated" (since this message is on every page). More information on how to use these tags is in the FAQ:
Q. How do I prevent parts of my webpage from being indexed (eg. exclude navigation menus, or page footers)?

Originally posted by Queue View Post

You will notice that I have some other issues to resolve, one of which is normalization of URL case. However, why doesn't Zoom normalize url's to lowercase? As I understand it, search engines perform such lowercase normalization...why doesn't Zoom?

On the contrary, URLs are technically case sensitive and search engines such as Google do not perform "lowercase normalization" (see here and here).

This means an uppercase URL is a different URL to it's lowercase equivalent. This is evident on most Linux/BSD/etc web servers, where a file "apple.html" can actually be a different file to "Apple.html".

It is only Windows web servers which are case insensitive. But from a browser's or spider's point of view, there's no practical way of knowing this in advance.

By W3C standards, URLs should and have always been case sensitive and it is general good practice to adhere to this. Even though you are on a Windows web server at the moment, you never know if one day you might need to move to a Linux based server, and then hundreds of links may be broken if you depended on case insensitivity.

Originally posted by Queue View Post

BTW Ray...Zoom is not simply good work: IT IS BRILLIANT.

Thanks for the positive feedback, it makes it all worthwhile!

**Queue** · May-05-2009, 04:07 PM

Thanks for all of this information Ray, it is very instructive.

**chinkle** · Dec-04-2009, 05:06 AM

A vote for a case insensitivity option for URLs

Originally posted by Ray View Post

How do I prevent parts of my webpage from being indexed (eg. exclude navigation menus, or page footers)?

On the contrary, URLs are technically case sensitive and search engines such as Google do not perform "lowercase normalization" (see here and here).

[..]

It is only Windows web servers which are case insensitive. But from a browser's or spider's point of view, there's no practical way of knowing this in advance.

I like Apache as much as the next guy... but many of us are forced to live in an IIS world and deal with clients who insist on randomly capitalizing things in their URLs when they write content for websites. As such, it'd be great if in the indexing engine there were an option to allow the URLs to be case insensitive or to force to lowercase all URLs so that we don't get repeated search results. It seems like a fairly simple option to add -- put all the "DANGER WILL ROBINSON" around the option that you like, even, to make sure folks really REALLY want to do it -- and it would help out those of us poor schmucks who have to deal with standards non-compliant software.

Thanks for listening...
chinkle

**Ray** · Dec-04-2009, 06:28 AM

A fair request... we will add it to our list of things to consider for V7. We're still gathering requests at this point, so people should feel free to post them... the more demand for a feature, the more priority it gets.

**chinkle** · Oct-26-2010, 10:35 PM

Originally posted by Ray View Post

A fair request... we will add it to our list of things to consider for V7. We're still gathering requests at this point, so people should feel free to post them... the more demand for a feature, the more priority it gets.

Thank you... We just (finally!) went to production with the Zoom tool integrated -- people love it, but the multiple results returned per page (because of different casing in the URL) is an annoyance. The CRC checking is enabled, but it doesn't seem to make a difference.

**Ray** · Oct-27-2010, 12:05 AM

Originally posted by chinkle View Post

Thank you... We just (finally!) went to production with the Zoom tool integrated -- people love it, but the multiple results returned per page (because of different casing in the URL) is an annoyance. The CRC checking is enabled, but it doesn't seem to make a difference.

There must be something on the pages that make them different from one another for CRC to not consider them as duplicates.

For example, is there a current date or time rendered on the page? Or dynamically generated advertising (that is constantly changing)? Or "this page was generated in 0.2 seconds" sort of text?

With anything like this, you can wrap it in  ...  tags to not only exclude it from indexing (because they'd be pointless to index and search anyway), but it will also exclude it from the CRC calculation.

See this FAQ for more information:
Q. How do I prevent parts of my webpage from being indexed (eg. exclude navigation menus, or page footers)?

**chinkle** · Oct-29-2010, 05:47 PM

Originally posted by Ray View Post

There must be something on the pages that make them different from one another for CRC to not consider them as duplicates.

We are making use of the ZOOMSTOP and ZOOMRESTART tags, but the links within the rest of the content are dynamic and are constructed based on the path that appears in the URL... so if someone goes to http://myurl.foo.com/bar/page.aspx, the link inside might look like "/bar/page.aspx?print=1," whereas if someone went to http://myurl.foo.com/Bar/page.aspx, the link would be "/Bar/page.aspx?print=1." Of course, being IIS, these pages are all the same page.

So, yeah, CRC isn't failing really. :-/

**Ray** · Oct-31-2010, 11:56 PM

And I guess you have links from various parts of your website which vary in uppercase/lowercase form, and this is scattered? Not just in a few key places like perhaps the Menu is linking to the uppercase URL but the main page is linking to the lowercase URL?

Uppercase and lowercase URL variations is a tricky area because any client (be it a browser or a spider like Zoom) must assume that they can be completely different pages (because that is the standard and that is how servers like Linux would work). All IIS is doing is just serving the same page when a request is made for two different pages -- the client has no idea that it is "theoretically" the same page, and this is further made distinct when the content within it is actually different.

In general, as good practice for web design, you would want to maintain a consistent naming scheme for URLs throughout your website.

If there is only a small number of folder names which vary in upper/lower casing (like in your example), you could consider just skipping one form of them. E.g. add "/Bar/" to your skip page list and you will only index the "/bar/" pages. This assumes that there is no page which is only linked via the "/Bar/" style URLs and not the latter.

**chinkle** · Nov-01-2010, 08:36 PM

Right, there are various parts of the website which vary in the casing of links, but this is due to the fact that the site has a content management system that non-savvy users control. In terms of the same page appearing twice in the results due to casing issues and with the CRC check enabled, I see where that problem exists in my page rendering code and will take care of that tout de suite.

I understand that case-sensitivity is standard, and I agree that it is irksome that IIS doesn't follow these standards, but I have to use IIS for this project (and for most of my projects). So, standards notwithstanding, I would find it very useful to have an option for case insensitivity in the indexer.

**Ray** · Nov-01-2010, 11:47 PM

It is indeed an option we are adding to V7, to toggle behaviour of handling case insensitive URLs.

**chinkle** · Nov-03-2010, 10:12 PM

Originally posted by Ray View Post

It is indeed an option we are adding to V7, to toggle behaviour of handling case insensitive URLs.

W00T! Thank you!

Announcement

Google Canonical Support

Google Canonical Support

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment