I'm indexing a bunch of cgi pages on a web site, but it seems like there are a number of ways to get to the same content.
For example these are the first 4 results being returned:
http://www.sasaki.com/what/portfolio.cgi?fid=280&service=3
http://www.sasaki.com/what/portfolio.cgi?fid=280&service=3&page=2
http://www.sasaki.com/what/portfolio.cgi?fid=280&page=5
http://www.sasaki.com/what/portfolio.cgi?fid=280&page=7
The only thing that matters here appears to be the "fid" number, but the spider crawls hundreds of variations of that page and they all show up in the search results.
I was hoping that I could use the CRC duplicate page detection to solve this problem, but even though the pages look identical, the source reveals that some of the links contain the same "page=x" variations as in the URL.
I thought there may be a way to use "skip" file paths and have tried adding "&page=" to the skip list. This definitely helps reduce duplicates, but there are many variations of URL "get" variables and it may also remove valid links (that may not require the page variable, but its possible that the only link a spider hits in some cases may contain that variable...)
So it looks like I'm stymied on this one. I may just have to live with the duplicates until the website is redesigned... unless someone can think of a clever way to get around this.
Thanks,
KG
For example these are the first 4 results being returned:
http://www.sasaki.com/what/portfolio.cgi?fid=280&service=3
http://www.sasaki.com/what/portfolio.cgi?fid=280&service=3&page=2
http://www.sasaki.com/what/portfolio.cgi?fid=280&page=5
http://www.sasaki.com/what/portfolio.cgi?fid=280&page=7
The only thing that matters here appears to be the "fid" number, but the spider crawls hundreds of variations of that page and they all show up in the search results.
I was hoping that I could use the CRC duplicate page detection to solve this problem, but even though the pages look identical, the source reveals that some of the links contain the same "page=x" variations as in the URL.
I thought there may be a way to use "skip" file paths and have tried adding "&page=" to the skip list. This definitely helps reduce duplicates, but there are many variations of URL "get" variables and it may also remove valid links (that may not require the page variable, but its possible that the only link a spider hits in some cases may contain that variable...)
So it looks like I'm stymied on this one. I may just have to live with the duplicates until the website is redesigned... unless someone can think of a clever way to get around this.
Thanks,
KG
Comment