PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Newbie problem: Duplicate results from PHP site

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbie problem: Duplicate results from PHP site

    I have indexed a PHP site but when I do a search I just get a ton of pages that have the same content on each one, each with different URLs. Even when I turn the duplicate pages function off.

    Also, the URL's are super long and quite different than when I step through the site manually. For example, the page http://www2.algosolutions.com/?page=nsc&sid=10 is also listed as http://www2.algosolutions.com/?page=nsc&sid=10&PHPSESSID=87755b9f6a829cd023ed3c4 608d669d1
    as well as a whole bunch of other ones as well like:
    http://www2.algosolutions.com/?page=nsc&sid=10&PHPSESSID=ac9dd393f19d8e5708b2082 2c744e251

    Thanks

  • #2
    What you are seeing is simply PHP session ID's which are passed in the URL. It appears that your website is changing how it creates the links depending on whether the client/browser has cookies enabled.

    You can click on the Configuration window -> "Authentication" tab, and enable "Use cookies from Windows and IE". This would allow it to work the same way it does when you access the site from your browser.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Still have problem with php session ids

      Thanks. I have enabled the "Use Cookies from Windows and IE" however I still get search results that have the session ID in the URL, although it appears to be less of a problem. The result is that duplicates still show up, even with the duplicate page detection function checked. For example, here are two URLS that showed up in the search results for the same page:

      http://www2.algosolutions.com/?page=nsc&sid=10

      http://www2.algosolutions.com/?page=nsc&sid=10&PHPSESSID=3f3894b0fbb31fe6ddfdd2e 17eba1a73

      Is it possible to configure the indexing process so that it, for example, excludes URLS with certain text in it, like "PHPSESSID"?

      Comment


      • #4
        I think I solved my problem

        I clicked the "Reload all files (do not use cache)" function under the General tab of the Configuration function. That seemed to work for a couple of indexings but, then, for some reason, the Session ID URLS started to get included again. Sometimes it works, sometimes it doesn't . Am I doing things right?

        Comment


        • #5
          I think it is just the cache acting up. Checking "reload all files (do not use cache)" AND "Use cookies from Windows and IE" should fix it in theory. It might be worth clearing your cache in IE to be sure.

          The other problem is I can't be sure how your website is determining whether it should provide session IDs or not without seeing the actual source code to the backend. You might want to check that theres no other scenario where session IDs are given instead (eg. if there's a link to disable the use of cookies for this session and the spider follows it... or if the website changes behaviour for different browsers, etc.)

          As for why they do not get detected by the "duplicate page" function - that is because the pages are not actually identical. If you view the source, you will see that the PHPSESSID page links back to the index.php page (and various other pages) with this session ID appended. The other page does not have this. Also - they each use different banner images.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment

          Working...
          X