I'm using v5 professional on a site that uses Vbulletin. I notice when indexing it often picks up two entries for the same page in the forum. This is occurs because some issues with the additional GET fields in the URLs. In the case of Vbulletin, some links use a session code, while others do not. For example:
../forums/showthread.php?t=930
is the same as:
../forums/showthread.php?
s = a4e878bda678b6c6b8f0f093940e0ba3 &t=930
(ignore the extra spaces, as the thread editor mangles a real URL).
What I'd like is a filter option on the URL before indexing consideration, by field name (best) or using some kind of regular expression. In this example, remove any occurrence of the "s=(hex digits)" portion of the string.
I don't want to just exclude URLs that have a "s=" in the URL (which I see is supported).
A global application of the filter rule is fine, as I don't use "s=" anywhere else, but it would be better if it just applied to a specific file/directory, such as "showthread.php".
I also find duplicates due to quirks in my SEO optimization (unrelated to Vbulletin), where I have the same page access with slightly different URLs. I have a "&n=SEO text here" type field. Some links have the "n=" option while others do not. This filter feature could be used here as well as it points to the same page/content.
If you implement a "field remover" type function, it needs to understand that the first field is different than later fields. For example, if the field to be removed is the first one (i.e. starts with ?), then after removal the next field (if any) needs to change from starting with "&" to "?".
../forums/showthread.php?t=930
is the same as:
../forums/showthread.php?
s = a4e878bda678b6c6b8f0f093940e0ba3 &t=930
(ignore the extra spaces, as the thread editor mangles a real URL).
What I'd like is a filter option on the URL before indexing consideration, by field name (best) or using some kind of regular expression. In this example, remove any occurrence of the "s=(hex digits)" portion of the string.
I don't want to just exclude URLs that have a "s=" in the URL (which I see is supported).
A global application of the filter rule is fine, as I don't use "s=" anywhere else, but it would be better if it just applied to a specific file/directory, such as "showthread.php".
I also find duplicates due to quirks in my SEO optimization (unrelated to Vbulletin), where I have the same page access with slightly different URLs. I have a "&n=SEO text here" type field. Some links have the "n=" option while others do not. This filter feature could be used here as well as it points to the same page/content.
If you implement a "field remover" type function, it needs to understand that the first field is different than later fields. For example, if the field to be removed is the first one (i.e. starts with ?), then after removal the next field (if any) needs to change from starting with "&" to "?".
Comment