PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Rebuild CGI Search Engine without Re-Spidering?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rebuild CGI Search Engine without Re-Spidering?

    Hi,

    I want to make a few changes to the CGI search but I can't see how to do this without re-spidering the whole site.

    How can I re-compile/build search.cgi without re-spidering?

    If this is in the manual, I apologise. I didn't see it on scanning through it, but I'm not finished reading it through all the way yet.
    My Zoom-searchable poetry archives web site.
    http://poetryx.com

  • #2
    In most cases you should re-index the site to apply any configuration changes. This is because most configuration changes can change the requirements of what needs to be indexed or what needs to be omitted from the index files.

    If it is absolutely necessary, you can make minor setting adjustments in the "settings.zdat" file without re-indexing. However, we do not recommend this because many users in the past have made changes here without fully understanding the modifications they were making, and consequently breaking the search engine and requiring a complete re-index to fix it at the end. It is only worth doing this when you are sure the change you wish to make does not require re-indexing. If you have any doubts at all, we suggest re-indexing the whole site to apply your configuration changes instead.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      If that's the case, then there needs to be more of the configurable settings in "settings.zdat" then, because something as simple as changing the search page URL (I forgot a slash) that requires 5+ hours of reindexing to fix is unacceptable for any but the most casual of users.

      I wouldn't mind having to edit the settings file manually temporarily if it meant that I could get there now, and then I could change the setting in the indexer for the next time I wanted to index.

      It would save me having to send the Zoom indexer and the resulting search.cgi to one of my programmers for disassembly and reverse engineering, which is what I have to do now to make a change in the compiled cgi file.
      My Zoom-searchable poetry archives web site.
      http://poetryx.com

      Comment


      • #4
        What did you mean by "changing the search page URL"? Are you referring to the "Link back URL" on the Advanced tab of the Configuration window? This is specified in the settings file already, as "LinkBackURL".

        I'm not sure why you need to reverse engineer or disassemble the CGI file. What exactly are you trying to change? The search page URL (for the generated search form, and for links to the search file itself), can be overridden with the "Link back URL" mentioned above, which is a setting outside of the CGI file.

        We certainly do not recommend nor support hacking of the CGI binary application, and we can not see why this is necessary for what you are doing. Perhaps you can explain more the context of what you are trying to do - eg. if you are including the output of the CGI file from another dynamically generated page, etc.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Well, "reverse engineer" was a bit of a misnomer - the search.cgi file is very easy to decompile, make a few changes to, and then recompile. It was just temporary for us until we ran the indexer again and overwrote the search.cgi with a new version anyway.

          The LinkBackURL field seems to only show up in the settings file if it wasn't previously empty. When we ran it on a smaller site with no link back url specified it didn't show up in the settings.zdat file.

          However, having added it manually to the settings file it worked fine.

          Most of the functions are configurable with a little effort to manually alter the search code or other files, but it would be easier if the GUI offered a way to rebuild the CGI without respidering, even if it were to just issue a warning that "Features which change the behaviour of the spider will NOT be supported in this new search.cgi - you will have to respider your site to make use of XX features."

          Also, the HTML search form is inside the CGI binary without the GUI offering a way to edit it (this isn't a problem of course with the PHP or ASP versions, since they're just plain text). It's not obvious using the program that if you select the "Do not generate" option for the search form that you can include a search form youself in the template file.

          So if one wanted to alter the search form without respidering you'd have to either try to figure out if that feature was specified in the settings file (and what values the constants there would accept, since settings.zdat is undocumented) or do what we did - decompile and change the text in the CGI manually.

          Having said all that, the CGI search is easily about 15x faster than the PHP search was on our server, and the results are more relevant than Google searching for the same text on our site.

          I guess I was just thrown by the seemingly "missing" features for the CGI version that are included for the other platforms the Indexer generates (the ability to edit the search script after spidering and indexing).
          My Zoom-searchable poetry archives web site.
          http://poetryx.com

          Comment


          • #6
            Originally posted by jough
            The LinkBackURL field seems to only show up in the settings file if it wasn't previously empty. When we ran it on a smaller site with no link back url specified it didn't show up in the settings.zdat file.
            That's correct. When no "Link back URL" is specified, the search script (or the CGI in your case) will simply link back to itself - which is the default usage scenario. You only need to specify the "Link back URL" if you are trying to embed the search script within another dynamically generated page - you still haven't confirmed what you are trying to do with all your changes, but I'd assume this is it for now.

            Again, it would be alot clearer if you can tell us what you are trying to achieve, as there is most likely a better way to do all this than the approach that you've taken.

            Most of the functions are configurable with a little effort to manually alter the search code or other files, but it would be easier if the GUI offered a way to rebuild the CGI without respidering, even if it were to just issue a warning that "Features which change the behaviour of the spider will NOT be supported in this new search.cgi - you will have to respider your site to make use of XX features."
            The problem is not a simple matter of having certain features enabled/disabled. Often the data format can be completely different when a single option is changed. Inconsistent data files can cause all sorts of unexpected behaviour, and we can not provide meaningful error messages when it gets to that point. So instead, we decided it was more important to ensure a consistent set of index files exist to avoid these problems, and to do so, we purposely designed it so that a re-indexing is required for all configuration changes.

            Having said all that, you might want to note that "search.cgi" does not get compiled during indexing. It is simply copied over from the scripts\CGI folder. As this would imply, the CGI file does not need to be recompiled or updated when you make a configuration change, so a feature to just update the CGI would do absolutely nothing.

            You may find that re-indexing can be tedious for some really big sites (you mentioned 5 hours which must be quite huge), but we would recommend indexing a smaller portion of the same site first (using the configurable Limits tab in the Professional Edition) for the initial steps of getting everything setup as you like, before you ramp up the Limits and index the entirety of your site.

            So if one wanted to alter the search form without respidering you'd have to either try to figure out if that feature was specified in the settings file (and what values the constants there would accept, since settings.zdat is undocumented) or do what we did - decompile and change the text in the CGI manually.
            No. None of that is required or recommended. Please look at the documentation.

            From the FAQ,

            "How do I modify the appearance of the search form?":
            http://www.wrensoft.com/zoom/support...tml#modifyform

            "How do I put search forms on different pages of my website? (Or define my own search form?)":
            http://www.wrensoft.com/zoom/support...tml#searchform

            Other "how to... " FAQs:
            http://www.wrensoft.com/zoom/support/#howtos

            Also refer to Chapter 5.7 ("How do I modify the search form on the search page?") in the Users Guide:
            http://www.wrensoft.com/zoom/usersguide.html
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Originally posted by Ray
              You may find that re-indexing can be tedious for some really big sites (you mentioned 5 hours which must be quite huge)
              Even with it skipping 100,000 pages, it still indexes about 40,000. I'm trying to see how many of those can be further whittled down as not really needing to be searched. And of course I still need to add more stop/restart comment tags throughout the site to stop sidebars and things from being indexed and linked skipped 10,000 times, etc.

              No. None of that is required or recommended. Please look at the documentation.

              From the FAQ,

              "How do I modify the appearance of the search form?":
              http://www.wrensoft.com/zoom/support...tml#modifyform
              I saw that - but if I had indexed the site with a search form included, and then decided after the fact that I'd rather include my own search form in the template.html file, I'd need to re-index just to remove the search form first, right?

              I also had a "duh" moment last night wherein I realised that I could simply index a tiny portion of the site (say, one page) with the new configuration and then just re-upload the cgi and settings file manually. Problem solved, although again, it's annoying to have to jump through hoops to do what to me seems like something that should be quick and simple (including or not including a search form).

              Also, it'd be really nice to have the various search form templates available in editable text files. Ditto with the results layout.

              I understand why Wrensoft may have made the system work like it does, but just because something's hard to do (for you) doesn't mean that your users should suffer. Sometimes you gotta take one for the team.

              Anyway, now that I've got things set up (mostly) how I like and can re-index regularly I'm mostly satisfied. With a little more control and ease of configuration though I'd be better inclined to recommend Zoom for my clients to use.
              My Zoom-searchable poetry archives web site.
              http://poetryx.com

              Comment


              • #8
                I also had a "duh" moment last night wherein I realised that I could simply index a tiny portion of the site (say, one page) with the new configuration and then just re-upload the cgi and settings file manually.
                This is a similar situation to what Ray warned about in a earlier post. You're asking for trouble doing this. The settings file contains information the defines the overall size of the database. The end result will be wrong search results and strange unpredictable crashes. Don't play with the settings file.

                ...it's annoying to have to jump through hoops
                Sometimes the hoops are there for a good reason, even if as a user you might not realize it.

                -----
                David

                Comment


                • #9
                  If the nature of the database and its effectiveness changes because I've decided to add or not add a search form using the CGI file on the search page, then the software is broken, plain and simple. I hope you were just overstating things, and this isn't really the case.

                  Anyway, I've been experimenting with the software a bit and by trying the same searches with the same flat files I've been unable to return different results by altering most settings in the settings.zdat file. Pretty much every setting that doesn't affect the spider (like the character set, min word length, etc.) is editable in the settings file with no change to the search results.

                  It's not just a matter of the many hours of respidering - it's that the presentation layer, database layer, and business logic are all far too dependent on each other.

                  What you're telling me is akin to taking your car to the shop to get a paint job and the mechanic tells you that you'll need a new engine because as it stands, the blue car won't run if you paint it red.

                  Sometimes the hoops are there for a good reason, even if as a user you might not realize it.
                  No, I understand that the hoops are there for a reason. It may be difficult to rebuild the engine and parameters at your user's will. If it wasn't difficult, your clients would write their own search engines.

                  But how is it difficult to change something like the LinkBackURL on the fly? In fact, as far as I can tell (without having really examined all of the CGI code or the PHP search script) the ONLY place that the LinkBackURL is stored is in the settings.zdat file. Please explain how editing it manually in 2 seconds rather than waiting to respider in 5 hours is there for a "good reason."
                  My Zoom-searchable poetry archives web site.
                  http://poetryx.com

                  Comment


                  • #10
                    Originally posted by jough
                    If the nature of the database and its effectiveness changes because I've decided to add or not add a search form using the CGI file on the search page, then the software is broken, plain and simple. I hope you were just overstating things, and this isn't really the case.
                    We provide several methods to customize the search form, and they are well documented in the links given above. The problem here is not the fact that you are trying to change the search form. The problem here is that you did not refer to the documented/supported methods available, and taken the liberty to assume your own methods of hacking a binary CGI application. We do not support modifications to the CGI application, and in this case, we see no purpose in doing so as it adds no functionality that is not already provided.

                    The supported, and designed methods of modifying the search form provide full customisation capabilities. It is a different approach to what you had in mind, and for good reason - because it does not require the user to hack a binary CGI application.

                    Anyway, I've been experimenting with the software a bit and by trying the same searches with the same flat files I've been unable to return different results by altering most settings in the settings.zdat file. Pretty much every setting that doesn't affect the spider (like the character set, min word length, etc.) is editable in the settings file with no change to the search results.
                    We can assure you that changes to the settings file can cause crashes or unexpected behaviour - there has been a good number of cases of this in the past. Note that there are many test scenarios and a small set of test searches may not trigger the crash. I guess you'll have to take our word for it.

                    What you're telling me is akin to taking your car to the shop to get a paint job and the mechanic tells you that you'll need a new engine because as it stands, the blue car won't run if you paint it red.
                    I don't believe this is accurate. It is more akin to taking a car to the shop and requesting to get a new paint job but instead of following the recommended methods, you specifically wish to disassemble the entire body of the car and repaint each part - then you wonder why it would be difficult or necessary to put that car back together.

                    Again, we provide a number of methods to modify the search form completely. We have designed around those methods to ensure that they provide the easiest way for the end user to customize their search page, and the least likely to cause problems.

                    You made the incorrect assumption in thinking that you needed to hack the binary CGI directly to customize the appearance of the search page. We do not support this method because it is difficult for most users, and is prone to breaking the application. It also does not do anything that you can not achieve via our recommended methods.

                    But how is it difficult to change something like the LinkBackURL on the fly? In fact, as far as I can tell (without having really examined all of the CGI code or the PHP search script) the ONLY place that the LinkBackURL is stored is in the settings.zdat file. Please explain how editing it manually in 2 seconds rather than waiting to respider in 5 hours is there for a "good reason."
                    For one, it simplifies the process of updating the search files for the end user. If, for example, you were to keep some settings in a seperate file which you can independently upload from the main set of index files - then you would have to make sure that this file coincides with the changes in the index files. It will become an additional configuration file, which every user will have to check and update - seperately from the main configuration. The user (especially one that is uploading the files themselves) will have to ensure that his/her latest changes correspond to their other configuration settings - they might need to look at the file's last modified date/time to see if it is the correct settings for that set of index files, etc. We think this compromises the ease of having a "single point of maintenance", for little benefit.

                    Yes, you don't want to take another 5 hours to make that update, but again, for a site that large (or slow) to index, we would recommend indexing a smaller portion of the site first (using the Limits in the Configuration window - you can limit this down to just 10 pages even), and getting the look+feel configured as you like first. Re-indexing would take less than a second at this point and it would become a non-issue. Once you have ironed out your LinkBackURLs and search forms, and what not, you can then bump up the Limits and get your whole site indexed.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment


                    • #11
                      I understand all of that, Raymond, but I think we're mostly coming from different schools of thought about software. You seem to think that the user should conform to the exegencies of the software, and I belive that the tool should be made to suit the human, not the other way around.

                      Everything you're suggesting means that one would have to try to work around the deficiencies in the software. A search engine is a tool for other developers - so "ease of use" is a secondary consideration to "easy to configure, customize, and control every minutae of the functionality of the software."

                      I will admit that I'm spoiled by using mostly open source software, and used to having to customise Apache and other server software to suit the needs of a particular project.

                      It seems that Zoom is geared more towards those who fear the command line, rather than for enterprise-level implementations. Having to do the "expensive" actions (spidering, indexing, and building the database/flat file) for changing presentation-layer elements (search form, results pages, etc.) is unheard of in business applications, which is why I'm so incredulous.

                      Zoom is otherwise a very strong product, but its marriage of presentation and business logic is a weakness, planned or otherwise.
                      My Zoom-searchable poetry archives web site.
                      http://poetryx.com

                      Comment


                      • #12
                        Originally posted by jough
                        I understand all of that, Raymond, but I think we're mostly coming from different schools of thought about software. You seem to think that the user should conform to the exegencies of the software, and I belive that the tool should be made to suit the human, not the other way around.
                        It is funny you said that, and I guess this is open to subjectivity, but in our mind, the reason for our approach was exactly that - to suit the human user, and not to force users through unnecessary complexities and points of maintenance. One of our primary focus was to simplify the procedure as much as possible and make it easy for anyone to add an otherwise complicated search component to their website.

                        I think that software which is supposed to be "user-centric" in design should surely have "ease of use" as a primary consideration, as well as being easy to configure, customize and control. However your definition of ease is clearly different to ours, with yours being, "I should be able to write a few bash scripts and grep commands and make whatever modifications I want", as opposed to ease as in, "I should be able to get this running intuitively, and make configuration changes by clicking a few buttons". It's really the age old GUI vs command-line argument all over again. We think it is easier for an user to have a guided procedure, and well categorized and easy to understand options. But you think it is easier to memorize variables, and remember setting flags and strings in a settings file.

                        And you are correct in that we did aim to cater for users who do not necessarily have command-line experience. But at the same time, we also designed the product for advanced users, administrators and programmers - note the abundance of configuration features and different usage methods. We use this software ourselves, and we can be just as picky with having things done in the most efficient manner possible. So we have also made sure that it is flexible and convenient for "power users" as it were, and it is well suited for enterprise level use.

                        The approach is different to most open source projects in that you do not need to "customise", hack scripts or make mods to "change the software" to suit a project (or to get it up and running at all). I guess this is something that you're used to doing and expected as being necessary.

                        Our approach is different, the core functionality is well encapsulated and allows even the most inexperienced user to install and customize a working search engine. More advanced usages such as embedding within other CGI pages or even desktop applications, to post-processing the search results, etc. are all possible (they are documented in our FAQ) and they act as a layer on top - but you don't see them until you need them. However they are all there, and well supported. As seen in the above thread, all the challenges you faced all had pretty simple solutions - it just wasn't what you're used to doing with other server-side packages. While it may be a different approach to what you're used to, those differences are actually one of the big usability advantages that people seem to appreciate about Zoom.

                        Maybe we should just agree to disagree and chalk one up on the "GUI vs command-line" religious wars.
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          Wrensoft has done an excellent job on their product, it's a great price and easy to use. That's what you get for $99. If you want a ton of other options and for it to be highly configurable then the cost of support is going to grow for Wrensoft and they will have to start charging more for their product and possibly charge for support.

                          There are other search engines out there, maybe you would be happier with another product. Since you seem to be happy with Apache and how configurable it is then maybe Lucine would fit your needs.

                          Comment


                          • #14
                            Well, I'd be all for a GUI point and click setup if you could do simple configuration changes without having to do lengthy spidering and indexing all over again. If Zoom had a button that said "Apply these changes without respidering." then we wouldn't be having this conversation.

                            Simple solutions for doing most of what I want/need to do, yes. But when you're paying for bandwidth respidering a site can be added expense (okay, bandwidth is cheap and this is a minor point - but what if my site had 6Gb of generated PDF files on it or something?) and time for something that has (or shouldn't have) anything to do with the search results, and therefore not require spidering. Why do the unnecessary and waste your users' time, their servers' resources, etc.?

                            Also, something like modifying the HTML output of the results page is nigh impossible using the CGI version. You can check the boxes all you'd like, but there's no way (that I can tell) to alter some attribute of the anchor tag without altering the CGI programme. For instance, if I wanted to add a javascript event to the link, I'd have to use the PHP engine, right? And then modify that?

                            I guess I'm saying not that Zoom is too point-and-click - it's that you can't do everything that one may need to do with the GUI interface, and there's no easy way to do some of those things without it. So it's not point and click enough.

                            Even with these points of basic philosophy, I was able to configure and have a relevant, fast, and for the future pretty much push button search solution set up within the span of less than a day, and that was with running experiments, testing the various platforms for speed, etc. So my complaints, while VALID, are minor.

                            The not being able to edit the results page is a problem, though. Maybe I'm just not finding that section in the FAQ.
                            My Zoom-searchable poetry archives web site.
                            http://poetryx.com

                            Comment


                            • #15
                              Originally posted by broman
                              Since you seem to be happy with Apache and how configurable it is then maybe Lucine would fit your needs.
                              We need a solution for a LAMP setup, but may consider Lucene if we were running a Java server and needed a Java-centric solution (although probably not).
                              My Zoom-searchable poetry archives web site.
                              http://poetryx.com

                              Comment

                              Working...
                              X