PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Password Protected Directories - Spider list

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Password Protected Directories - Spider list

    As a follow up to this question that I previously posted in the wrong area (I hope I am doing it right this time by posting a new thread) and your reply:

    Quote:
    Originally Posted by WilliamJ
    Lastly, and this too is off the subject, did you know that zoom can index 2 separate password protected directories (with the same log-in info) but they must be in a particular order in the SPIDER listing of directories? And I think this order may be alphabetical. In other words, if you want to index /file/A_protected and /file/B_protected, I think they need to be listed in in this order in the SPIDER list. If I put /file/B_protected ahead of /file/A_protected in the spider list it will index /file/B_protected but /file/A_protected will error out with a 401 error. Reversing the order so that they are alphabetical results in a correct search of both directories with no errors. Now, someone may ask the obvious next question of indexing two directories that have 2 separate login passwords?

    No, the alphabetical order of the listing of URLs has no significance. I think what you observed would most likely be coincidence and is actually a behaviour caused by an unrelated issue.

    First question would be what type of authentication is your site using. Please see this support page for a detailed explanation:
    http://www.wrensoft.com/zoom/support/auth.html

    My guess is, if you're using session/cookie-based authentication - that your first URL/directory might contain a link to logout, and you have not added this to your skip list. Because of this, when you index the first URL, the spider is logged out, and unable to log back in to access the second URL. That's just purely guessing from what little information I have though. But you should look into it, and I can tell you that it's definitely not due to the alphabetical order of the directories.


    Initially I was able to change the order the protected directories were listed in the spider index URL list and it indexed OK. I thought it simply needed to be alphabetical. Today, we tried to index again and it would not index both directories - It 401 errored out on the 2nd directory in list. We made no changes to this list or to any files in those directories since the last index. What worked before was to delete the directory and change the order so I tried that today - deleted that directory listing from the spider listbox and added it back in at a different spot (changing the spider directory order), performed the index and it worked fine. We use HTTP authentication and we do not have a LOG OFF page or field on any of the pages being indexed in those directories. I thought it was alphabetical but as you noted above it is not, I think it must be somehow not performing the login on the 2nd directory correctly and may be out of sync or ??. It is no big deal as I worked around it and now that I know what I can do to get it to index, I can certainly live with it but it may be something in the program that may need to be addressed. I am using V5 build 1004 PROFESSIONAL.

  • #2
    What a long post

    I think you are saying that you have two start points and the 2nd start point does work because it gets a HTTP 401 error.

    Zoom only support a single user name / password for HTTP authentication. So if you have two protected sites and they each need different logins, then it isn't going to work.

    Would it be possible to get access to the sites temporarily in order to re-produce the problem here.

    Comment


    • #3
      Actually the logins are the same for each directory and the directories are on the same website.

      I will be happy to forward the login info for each directory - Just let me know an email address to contact you with the info.

      Comment


      • #4
        Our e-mail address can be found on the Contact Us page.

        I presume you have entered the correct login information on the "Authentication" tab of the Configuration window, and checked the "Enable HTTP authentication" option?
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Yes, I have the login & HTTP information in the configuration settings. As I mentioned above, it will login to the 1st protected directory but not the 2nd one. I am pretty sure it is something in the way the spider is searching, logging in, etc.., because I had to resort the directories again yesterday when I went to do another index. It seems every time I index, I have to move these directories around in the spider list to get it to spider to index both protected directories correctly. The problem is not with one of the directories because it has correctly spidered one or both at one time or another - It just seems to not want to do both at the same time in one single index unless I move the order around before indexing.

          I created a subscription for you and emailed the password just now to the info@ email address. You can access both of these directories with the same login info:

          http://www.georgiagrindingwheel.com/.../catalog_prot/
          http://www.georgiagrindingwheel.com/Catalog/tech_prot/

          Or, you can go to http://www.georgiagrindingwheel.com/catalogpages.htm and login by clicking on the red button in the right hand column.

          You can also see ZOOM in action now for those directories.

          Thanks.

          Comment


          • #6
            We had a look at your site and ran some tests, and agree with the behaviour reported. Zoom is being denied access to the second directory despite it being able to access the first - and when you switch to indexing the other directory first, it will be granted access to them both correctly.

            However, we noticed that your authentication system is not quite typical. First of all, the login behaviour is different when logging in from different browsers (we tested IE and Firefox, and each had a different login page - IE uses the HTTP authentication windows, while on Firefox, it appears to use a login form which processes the login via a Perl script called "adpass.pl").

            We then realized that you appear to be using a third-party security/member management product known as AdPass, developed by a company called Ascad Networks. This appears to be a complicated software package which is actually providing the authentication process on your site. We do not know exactly how this package has implemented the authentication procedure, and we would suspect from the information given, and the behaviour we have observed, that it is doing something unusual and different from the norm. We are unable to provide support as to why their product is denying access to an authenticated client unnecessarily but perhaps you can contact them for further information on this issue.

            If you do change over to a standard HTTP authentication implementation, and continue to have this problem, we will be happy to look into it further.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Glad that you could duplicate the problem but I am not sure that the ADPASS member management program is the culprit.

              What you are referring to is a browser related issue with the ADPASS software - It should not affect the zoom spider should it? We have no trouble accessing the directories from our FTP or WINSCP software or even from within Microsoft Frontpage as these do not require a browser as the interface. I can also access these directories outside of ADPASS by typing in the URL in my browser - The directories are using plain .htaccess protection.

              The adpass software does not have any control over the http login unless you try to access through our customized login screens (call the script). Try to access the directories directly with the passwords I gave you, you shouldn't have the issue that you describe (different logins). For example, if you type into your browser the following directory, it should present a straight http loign: http://www.georgiagrindingwheel.com/.../catalog_prot/
              Upon successful login, you would then be presented with the index.html page showing our catalogs. This all would be OUTSIDE of the adpass software. The only time the adpass software runs is when you go through a browser using our custom login pages (which direct you to one screen or another depending on the browser). The adpass software is simply a member management program that we use to handle subscriptions and passwords. Once the password is written to the .htaccess file, it has no bearing on access to those directories unless you go through our customized login pages.

              Thanks.

              Comment


              • #8
                Originally posted by WilliamJ View Post
                What you are referring to is a browser related issue with the ADPASS software - It should not affect the zoom spider should it?
                Actually, it does, because this indicates that the AdPass implementation will behave differently depending on the client's User-Agent ID. So while it may behave in one way for "Internet Explorer" and another for "Mozilla" (ie. Firefox), it is likely to do something different again for "ZoomSpider" (which is what Zoom identifies itself as when you index in Spider Mode.

                Originally posted by WilliamJ View Post
                We have no trouble accessing the directories from our FTP or WINSCP software or even from within Microsoft Frontpage as these do not require a browser as the interface.
                There's an important distinction you're missing. Those applications are using different protocols (namely FTP and SCP) to access the directories. Your HTTP authentication, does not apply when you access the files via a different protocol.

                When you use Zoom in Spider Mode (or indeed, any web crawler or spider), your files are accessed via HTTP (as is required by the nature of "spidering" since it needs to "follow" links), and that is why your HTTP authentication is in effect. This is essentially the same as accessing the files with a web browser, regardless of the interface. So a spider is not at all the same as using FTP/SCP software, and is much more similar to a browser.

                It just occurred to me that you may not have considered "Offline Mode" yet. Do you have a copy of the files that you wish to index on your local hard disk (or a networked drive)? If so, you could bypass all the authentication hoolabaloo and you simply tell Zoom to index all the files within certain directories. The only disadvantage to Offline Mode is that you can not index dynamically generated pages (such as PHP or ASP pages) which need to be accessed via HTTP for them. Otherwise, it is much faster than spider mode.

                For more information on the Spider Mode and Offline Mode, please see our Users Guide:
                http://www.wrensoft.com/zoom/usersguide.html

                Originally posted by WilliamJ View Post
                The adpass software does not have any control over the http login unless you try to access through our customized login screens (call the script). Try to access the directories directly with the passwords I gave you, you shouldn't have the issue that you describe (different logins).
                ...
                I have noticed that the directories do employ .htaccess style HTTP authentication. However, we can't be sure from looking at this remotely (or without dissecting the AdPass software and examining its implementation closely) if there is any additional server-side behaviour that we can not see. For example, it is possible that the server is configured to redirect URLs internally, and it may be changing the .htaccess on the fly. The AdPass package claims to do alot, and one presumes it may do something tricky like this if it was to offer any special functionality over the built-in authentication available on your web server. With these possibilities in place, it just makes it very difficult for us to determine what is exactly happening, and it may take us alot of time in having to debug and analyze an unrelated third-party product.

                However, I will attempt a quick test later today by simply creating two directories requiring HTTP authentication, on the same domain, and see if I can replicate this problem with the most defacto/common setup. I'll post my results and see if I can confirm if this is a problem in Zoom or not.

                Oh, by the way, it might be worth confirming if you are using IIS or Apache? I thought I noticed some IIS error messages on your site before, but your mentioning of .htaccess seem to imply Apache now.

                Edit: Just realized the IIS error messages I was thinking of was on another user's site. So nevermind about that, I presume you're using Apache and will setup my test on Apache as well.
                Last edited by Ray; Mar-07-2007, 12:29 AM.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Update: We tested the scenario on Apache, with two folders containing the same .htaccess file and requiring the same user authentication to be accessed. We then created two start points in Zoom, one to each folder, and indexed the site with the appropriate login details on the "Authentication" tab of the Configuration window.

                  Zoom successfully accessed and indexed both start points (and thus both directories) in Spider Mode without problems.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment


                  • #10
                    We are using Apache/Linux. I will keep an eye on this - Not sure where the issue lies since the software program that you suspect is the problem really shouldn't affect access to those directories outside the program.

                    I am now wondering if it might be our Norton Firewall/Antivirus/Antispam software - I did not consider it before - I may try it again with it off the next time I index.

                    I have been busy with other projects the past few days. I may be able to find time to get back into this in a couple of weeks - Will keep you posted if I find anything.

                    BTW, do you recommend any particular JS or HTML software to use a a debugger?

                    Comment


                    • #11
                      I don't think there are any "JS or HTML software" which can be used as a debugger. Or did you mean what programs can be used to debug JS or HTML? Visual Studio 2005 offers Javascript debugging. HTML shouldn't need debugging, just verification usually.

                      Also - would like to repeat and remind you to consider the "Offline Mode" option. If it suits your usage, it would allow you to index your site without having to worry about any HTTP authentication issues.
                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment


                      • #12
                        Yes, sorry - software to aid in debugging JS.

                        Will look into offline mode in future if it becomes a problem. Not a big deal to me now because I can resort the list and move on. Since I doubt we will reindex that often, it shouldn't be problem.

                        Just wanted to make you aware of the problem should it be on your end.

                        Thanks.

                        Comment


                        • #13
                          We use a lot of different tools depending on the problem. Including Visual Studio 2005, Dreamweaver 8, Wget, UltraEdit 11, and several others.

                          Comment

                          Working...
                          X