PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Duplicate Pages Revisited

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate Pages Revisited

    I've read the posts regarding duplicate pages and adjusted all my settings (crc skip duplicates, cookies, reload all files, etc.) to eliminate these references, but to no avail.

    As a matter of fact, Zoom can return hundreds of hits all referencing the exact same content, which turns an EXCELENT search tool into a feeble waste of time. The explanation here is that the URL's or CRC’s are different so there is nothing that can be done.

    Granted, in a technical sense, the duplicate URL's are indeed dissimilar in that they can be differentiated by id numbers, which are internally assigned by the vending scripts, in this case the Php scripts used by the very popular vBulletin forum software. Aside from the different URL's, the content of the dynamically composed pages is usually different too, although the actual post may be the same. This is normally attributable to page composition layers and reference points that change according to the calling context. That is, the content being retrieved is displayed with other content and display attributes depending on how the Php script was called, where it was called from, and even the time it was called.

    Apparently, Zoom feels these technical issues and explanations are beyond their control whereas I disagree. Within these duplicate pages, vBulletin enforces a standard HTML title tag that is always the same as it is keyed to the intent of the actual requested content regardless of the differences in URL's and the differences in the CRC's of the returned results. For instance, it's possible for vBulletin to return 100 page variations in response to the same display query depending on the calling context, yet each page will contain the requested post (usually bookmarked) and the same HTML title tag regardless of the differences in the visual aspects or ancillary content of the pages.

    The question then is why doesn't the Zoom Indexer provide an option to ignore duplicate titles. Well, probably because there are many instances where duplicate titles actually represent completely different content for one. However, this could actually be remedied by modifying vBulletin to return additional parameters within the title. For instance, instead of returning “My Opinion” as a title, vBulletin could be modified to return “My Opinion by Joe Schome posted on June 12, 2006”. Setting Zoom to ignore duplicate Php titles would then be the best option since these duplicates would lessen the overhead associated with indexing. But that puts the onus back on the user. Additionally, as in our case, we run vBulletin, a storefront, and several other Php/MySql driven applications that affect Zoom in the same manner.

    None the less, it would still be desirable, in our case, to disallow duplicate titles, even if we had to sacrifice a couple of references to valid and distinct content. The indexing overhead is otherwise staggering. No matter how high we set the page count we always reach it before the URL queue is emptied. The index just keeps propagating more and more references to the same content while nothing new is really being added. What a calamity that is.

    On the other side of the coin, another option would be to pre-process search results, which could be accomplished with two variations. First, the entire result set could be read and processed to eliminate hits that contain the same title and the same abstract. For sure that would nail it. Second, a global “done that” array could manage the search results on a page by page basis. The array is indexed and checked with each search result before display. If the title and abstract are already in the array then the result is skipped. However, that one plays havoc with search result page count and navigation scheme.

    My point is, there are options. It’s not as if it can’t be done. Somewhere in all of this there is a solution. We actually know it can be done. Otherwise, Goggle would provide the same results as Zoom, yet it doesn’t.

    I favor being able to set the Zoom Indexer to avoid duplication of title tags. As I said before, this might adversely affect the results for some, however obscurely. For us, it’s definitely a worth while trade-off and less of a gamble. In any event, it avoids the staggering overhead of the indexing process and most importantly it definitely resolves the issue.

    You might be asking “why are you indexing your vBulletin forum when it has its own search facility.” That’s a fair question, and it applies to all the other Php based apps we run as well. The answer is we prefer to run everything through the same content frame in such a way as to integrate all these apps into one seamless presentation. In support of that, it’s much easier to search from one location and be able to find anything and everything.

    I would appreciate any comments or suggestions.

  • #2
    If your pages don't have identical content and don't have identical URLs, they are just similar pages. Not duplicates.

    A few different scripts have this problem of generating an near infinite number of different pages. Filtering on the title for uniqueness is not the best solution. Many sites don't have unique titles. So the usefulness is limited. There is also the secondary problem that filtering on the title still means a large number of unnecessary pages need to be downloaded (near infinite again for some scripts). This is because we don't know what the title of the page will be until after the page is downloaded.

    We have a FAQ question that covers this issue and a better solution.
    "Q. How should I index my site if it features a message board, forum, or calendar and other similarly complex scripts?"

    -------
    David

    Comment


    • #3
      My vBulletin Solution

      Originally posted by Wrensoft
      Thanks for that - I missed it in the search!

      I finally gave up on the notion of a one-size-search fits all. I've excluded the vBulletin forum. I just couldn't find a balance with my skip list. Either I got nothing, or I got infinity.

      After thinking about this in greater depth, I now believe the solution is to write a custom php script that enumerates post data from the db into separate HTML pages while removing all urls within each of those pages.

      Thanks again!

      Comment

      Working...
      X