Originally posted by Ray
View Post
Announcement
Collapse
No announcement yet.
Google Canonical Support
Collapse
X
-
It is indeed an option we are adding to V7, to toggle behaviour of handling case insensitive URLs.
Leave a comment:
-
Right, there are various parts of the website which vary in the casing of links, but this is due to the fact that the site has a content management system that non-savvy users control. In terms of the same page appearing twice in the results due to casing issues and with the CRC check enabled, I see where that problem exists in my page rendering code and will take care of that tout de suite.
I understand that case-sensitivity is standard, and I agree that it is irksome that IIS doesn't follow these standards, but I have to use IIS for this project (and for most of my projects). So, standards notwithstanding, I would find it very useful to have an option for case insensitivity in the indexer.
Leave a comment:
-
And I guess you have links from various parts of your website which vary in uppercase/lowercase form, and this is scattered? Not just in a few key places like perhaps the Menu is linking to the uppercase URL but the main page is linking to the lowercase URL?
Uppercase and lowercase URL variations is a tricky area because any client (be it a browser or a spider like Zoom) must assume that they can be completely different pages (because that is the standard and that is how servers like Linux would work). All IIS is doing is just serving the same page when a request is made for two different pages -- the client has no idea that it is "theoretically" the same page, and this is further made distinct when the content within it is actually different.
In general, as good practice for web design, you would want to maintain a consistent naming scheme for URLs throughout your website.
If there is only a small number of folder names which vary in upper/lower casing (like in your example), you could consider just skipping one form of them. E.g. add "/Bar/" to your skip page list and you will only index the "/bar/" pages. This assumes that there is no page which is only linked via the "/Bar/" style URLs and not the latter.
Leave a comment:
-
Originally posted by Ray View PostThere must be something on the pages that make them different from one another for CRC to not consider them as duplicates.
So, yeah, CRC isn't failing really. :-/
Leave a comment:
-
Originally posted by chinkle View PostThank you... We just (finally!) went to production with the Zoom tool integrated -- people love it, but the multiple results returned per page (because of different casing in the URL) is an annoyance. The CRC checking is enabled, but it doesn't seem to make a difference.
For example, is there a current date or time rendered on the page? Or dynamically generated advertising (that is constantly changing)? Or "this page was generated in 0.2 seconds" sort of text?
With anything like this, you can wrap it in <!--ZOOMSTOP--> ... <!--ZOOMRESTART--> tags to not only exclude it from indexing (because they'd be pointless to index and search anyway), but it will also exclude it from the CRC calculation.
See this FAQ for more information:
Q. How do I prevent parts of my webpage from being indexed (eg. exclude navigation menus, or page footers)?
Leave a comment:
-
Originally posted by Ray View PostA fair request... we will add it to our list of things to consider for V7. We're still gathering requests at this point, so people should feel free to post them... the more demand for a feature, the more priority it gets.
Leave a comment:
-
A fair request... we will add it to our list of things to consider for V7. We're still gathering requests at this point, so people should feel free to post them... the more demand for a feature, the more priority it gets.
Leave a comment:
-
A vote for a case insensitivity option for URLs
Originally posted by Ray View PostHow do I prevent parts of my webpage from being indexed (eg. exclude navigation menus, or page footers)?
On the contrary, URLs are technically case sensitive and search engines such as Google do not perform "lowercase normalization" (see here and here).
[..]
It is only Windows web servers which are case insensitive. But from a browser's or spider's point of view, there's no practical way of knowing this in advance.
Thanks for listening...
chinkle
Leave a comment:
-
Originally posted by Queue View PostSearch for "lorem" and you will get twenty-four (24) search results. od the 24 results the following pages are duplicate content...
- http://www.qomcorp.com > http://www.qomcorp.com/default.asp
As mentioned by David above, you can use ZOOMSTOP and ZOOMRESTART tags to exclude these sections of the page. This way only the middle, content portion of the page is looked at when Zoom tries to determine if it is a duplicate page. The other benefit of doing this is, you won't get every page returned should someone search for a word in the navigation menus or, for example, "page generated" (since this message is on every page). More information on how to use these tags is in the FAQ:
Q. How do I prevent parts of my webpage from being indexed (eg. exclude navigation menus, or page footers)?
Originally posted by Queue View PostYou will notice that I have some other issues to resolve, one of which is normalization of URL case. However, why doesn't Zoom normalize url's to lowercase? As I understand it, search engines perform such lowercase normalization...why doesn't Zoom?
This means an uppercase URL is a different URL to it's lowercase equivalent. This is evident on most Linux/BSD/etc web servers, where a file "apple.html" can actually be a different file to "Apple.html".
It is only Windows web servers which are case insensitive. But from a browser's or spider's point of view, there's no practical way of knowing this in advance.
By W3C standards, URLs should and have always been case sensitive and it is general good practice to adhere to this. Even though you are on a Windows web server at the moment, you never know if one day you might need to move to a Linux based server, and then hundreds of links may be broken if you depended on case insensitivity.
Originally posted by Queue View PostBTW Ray...Zoom is not simply good work: IT IS BRILLIANT.
Leave a comment:
-
Ray,
Perhaps I need to look into Skip List a bit more...
I am running 6.0 Build 1012. Here is the url to my search page (this is a DEV site)...
www.qomcorp.com/zoom_search.asp
Search for "lorem" and you will get twenty-four (24) search results. od the 24 results the following pages are duplicate content...
- http://www.qomcorp.com > http://www.qomcorp.com/default.asp
BTW Ray...Zoom is not simply good work: IT IS BRILLIANT.
Thank you.
Leave a comment:
-
Looks like an interesting feature. And probably not very hard to implement. We'll look at supporting it for the future V6.1 release (no date as yet).
In the meantime the CRC function should filter pages that are exact duplicates. What are the URLs of the duplicate pages?
Also note that V6.0 of Zoom has a new, improved method of CRC duplicate page detection: the CRC comparison is now made after stripping out HTML and ZOOMSTOP sections. This means that a page with ads excluded using ZOOMSTOP will now be recognized as being duplicate, despite having different dynamic ads on the page.
You should also look at using the skip list to avoid duplicate pages. Using the skip list is MUCH more efficient than usng the CRC or the Google canonical feature. As the skip list will avoid the download of the page, saving a lot of indexing time and bandwidth.
Leave a comment:
-
Google Canonical Support
Is there any plan to support Google Canonical Directive?
I'm having problems with duplicate content and CRC does not appear to be resolving the issue.Tags: None
Leave a comment: