PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Option to remove part of URL before considering

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Option to remove part of URL before considering

    I'm using v5 professional on a site that uses Vbulletin. I notice when indexing it often picks up two entries for the same page in the forum. This is occurs because some issues with the additional GET fields in the URLs. In the case of Vbulletin, some links use a session code, while others do not. For example:

    ../forums/showthread.php?t=930

    is the same as:

    ../forums/showthread.php?
    s = a4e878bda678b6c6b8f0f093940e0ba3 &t=930

    (ignore the extra spaces, as the thread editor mangles a real URL).

    What I'd like is a filter option on the URL before indexing consideration, by field name (best) or using some kind of regular expression. In this example, remove any occurrence of the "s=(hex digits)" portion of the string.

    I don't want to just exclude URLs that have a "s=" in the URL (which I see is supported).

    A global application of the filter rule is fine, as I don't use "s=" anywhere else, but it would be better if it just applied to a specific file/directory, such as "showthread.php".

    I also find duplicates due to quirks in my SEO optimization (unrelated to Vbulletin), where I have the same page access with slightly different URLs. I have a "&n=SEO text here" type field. Some links have the "n=" option while others do not. This filter feature could be used here as well as it points to the same page/content.

    If you implement a "field remover" type function, it needs to understand that the first field is different than later fields. For example, if the field to be removed is the first one (i.e. starts with ?), then after removal the next field (if any) needs to change from starting with "&" to "?".

  • #2
    We use Vbulletin for this forum and we also use Zoom with it. We don't have any issues with session ID's. See also this FAQ

    If you are see session ID's then it probalby means you are not accepting cookies. (if you allow cookies then session ID in the URL are not requried)

    The the 2nd problem, you can using the existing skip list function to remove all URLs with "&n=" in them. This will remove the duplicates.

    Comment


    • #3
      I think you are on the right track about cookies. Is there an option in Zoom search to enable cookies? I dug around and wasn't able to find one.

      I also checked Vbulletin options, under the cookies settings, and there doesn't appear to be any option to disable cookies. I'm using all their default values on that option page. Both browsers (IE and Firefox) have cookies enabled on the system.

      When using the forum in a browser, looking at the html links, I don't see any session codes in the URLs. Only when Zoom Search spiders, do i get the session codes in many but not all the forum URLs.

      I did add all your suggested skip options from your FAQ, although I did have most of these in already.

      Any other ideas?

      Comment


      • #4
        You can enable cookies on the "Authentication" tab of the Configuration window. The option is labelled "Use cookies from Windows and IE".

        More details in the Help file and Users Guide.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Thanks - enabling the cookies fixed the issue - no more session codes!

          Comment

          Working...
          X