PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Spider wont follow links - Frames & Password protected

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spider wont follow links - Frames & Password protected

    I am starting from a site map page as start URL (using the eval version). Base URL is the domain (root URL) of the site.
    eg.
    start URL - http://domain/folder/folder/sitemap.php
    Base URL - http://domain/

    Only the site map is indexed and the spider does not follow the links (all relative links)

    When I use the the root URL as the start point (frame based site) it follows only to the 1st level - eg. linked pages from the navigation frame. It won't go any deeper.

    I have checked all the exclusions etc. but the index results say 1 page indexed!

    The pages all use php/MSQL cookie authentication (using cookie auth).

    please help.

  • #2
    From a search point of view, you have a complex site. You have frames & authentication which makes sites search engine unfriendly.

    To start with read these FAQ questions.

    Authentication:
    http://www.wrensoft.com/zoom/support/auth.html

    Frames:
    http://www.wrensoft.com/zoom/support...to.html#frames

    Links not being followed
    http://www.wrensoft.com/zoom/support...s.html#skipped
    http://www.wrensoft.com/zoom/support...avascriptmenus
    http://www.wrensoft.com/zoom/support...spider_finding


    Also, turning on verbose mode can give a lot of details about what Zoom is doing and why it isn't doing what you expect. If you still have problem, this is what we'll need (a verbose log file).

    -----
    David

    Comment


    • #3
      Wow, now that's prompt service - on a friday as well!

      Thanks for your help but I had read all of those FAQ's and tried to cover all bases befor asking dumb questions. I have repeated a few indexes, 1 as root and the other from the sub page which is a site map.

      Funnily enough, however, I forgot to log into to the site for the first two indexes. The next 2 were done when logged in. The results in the log seem the same.

      http://intranet.radford.act.edu.au/z...s/root_log.txt
      http://intranet.radford.act.edu.au/z...henticated.txt
      http://intranet.radford.act.edu.au/z...itemap_log.txt
      http://intranet.radford.act.edu.au/z...henticated.txt

      The resultant search (located at base URL/search/) seems to find pages based only the file name and not the content of the file which i guess is a result of the navigation page!

      Authentication is php based on user level and 2 being student (partial access) 3, being staff (total access) and is cookie based. Users are drawn from a combination of LDAP and MySQL table.

      Navigation frame is html/css based.

      I appreciate your promt reply and hope you can help as it will solve some big dilemas for me (and my users!).

      Tim

      Comment


      • #4
        Well my first guess would be that Zoom is not actually logged into your site.

        For the sitemap indexing session: Can you have a look in the zoom_dictionary.zdat file generated by Zoom and see what words have been indexed. Did it index the sitemap page or a login page?

        Does your authentication scheme work across multiple sessions of IE. So for example, if you start IE & login. Then start a 2nd copy of IE from the Start menu, the 2nd copy will not be logged in? Even though the 1st one still is logged in?

        -----
        David

        Comment


        • #5
          Only one logged in session can be active. Logging in again renders the original session inactive.

          You are on the right track as the dictionary file has indexed the words in the nav frame (no auth. required) but also the login page! So no the index session could not log in. Search files can be found at http://intranet.radford.act.edu.au/search/

          I am hoping to remove frames from the site soon, the authentication system can be tweaked as well - any suggestions?

          Thanks, Tim.

          Comment


          • #6
            1) If your login page can receive username and password information via the URL, then you can use a spider start point / URL with this information specified as GET parameters (for example,
            "http://intranet.radford.act.edu.au/login.php?username=george&password=ringo").
            Many login pages work like this even if they were never designed to handle a URL login.

            2) If you can modify the server-side script that does the authentication, you could change it so that it allows a user-agent containing the word "ZoomSpider" to bypass the login process. Similarly, you could also allow the IP address of the indexing computer to bypass the login process.

            3) If possible, consider using Offline mode to index your website. This requires a copy of the website to be accessible on your local hard disk, allowing Zoom to simply scan all the files without having to get pass the security restrictions on your live site. Note however that offline mode is not suited for websites which depend heavily on server-side scripting to deliver content (eg. PHP or ASP driven websites). See the Users Guide for more information on Spider mode and Offline mode.

            ----
            David

            Comment


            • #7
              Offline mode worked well. Thanks for you help. We have just purchased the Pro version for our school intranet. It will solve what could have been an expensive exercise in finding a new intranet solution.

              $140 aussie dollars, bargain! If only i had found this product months ago I could have saved myself hours of time doing more productive tasks!

              Thankyou very much for all your assistance, it made the sale!

              Tim

              Comment

              Working...
              X