PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Tricky duplicate page detection

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tricky duplicate page detection

    I'm indexing a bunch of cgi pages on a web site, but it seems like there are a number of ways to get to the same content.

    For example these are the first 4 results being returned:

    http://www.sasaki.com/what/portfolio.cgi?fid=280&service=3
    http://www.sasaki.com/what/portfolio.cgi?fid=280&service=3&page=2
    http://www.sasaki.com/what/portfolio.cgi?fid=280&page=5
    http://www.sasaki.com/what/portfolio.cgi?fid=280&page=7

    The only thing that matters here appears to be the "fid" number, but the spider crawls hundreds of variations of that page and they all show up in the search results.

    I was hoping that I could use the CRC duplicate page detection to solve this problem, but even though the pages look identical, the source reveals that some of the links contain the same "page=x" variations as in the URL.

    I thought there may be a way to use "skip" file paths and have tried adding "&page=" to the skip list. This definitely helps reduce duplicates, but there are many variations of URL "get" variables and it may also remove valid links (that may not require the page variable, but its possible that the only link a spider hits in some cases may contain that variable...)

    So it looks like I'm stymied on this one. I may just have to live with the duplicates until the website is redesigned... unless someone can think of a clever way to get around this.

    Thanks,

    KG

  • #2
    Maybe you can use ZOOMSTOP and ZOOMRESTART tags to exclude the sections of your pages with links and navigation. This way only the middle, content portion of the page is looked at when Zoom tries to determine if it is a duplicate page using CRC. This isn't the most efficient way to avoid indexing some pages however.

    Another option might be to make a single page with just links on it to all your valid pages (like a HTML format sitemap). Then use this page as the start point for Zoom, and configure the start point to only index to a single level. I am guessing that you don't know which pages are your valid 'main' pages and which are safe to leave out of the list? But if you don't know what pages are the valid ones, Zoom doesn't have much chance of working it out.

    Comment

    Working...
    X