PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Great tool, but... Word limit is too low

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Great tool, but... Word limit is too low

    Excellent tool, but my major problem with it is the limits imposed on maximum pages to index as well as maximum words to index. I can't remember seeing these restrictions on any other search tool and it is your achilles heel for anyone doing serious indexing. It is just impossible to index a single multilingual site with an average number of documents. I've tried several and I never make it past 10-15,000 documents before maxing out my index word limit of 150,000. Maybe it works for people indexing many small html files in english only and no special codes like part numbers. I think you would do very well if you figured a way around your memory bottleneck limitations. Perhaps occassional writes to disk once a certain amount of content has been indexed?

    But again, it is a very nice product. Don't mean to rain on your parade!

  • #2
    The limits are configurable. If you have a large site, go to the limits tab in the configuration window and increase the value from 150000 to, for example, 300000.

    Also remember that this limit is not for the total number of words. It is for the number of unique words. So if the same words appears on 500 different pages, it still is only counted as a one single unique word.

    This also means that the rate of growth in the unique word count slows as you index more pages. On the 1st page indexed almost every word is new to the index. On the 10th page maybe only 40% of the words are new words. The other 60% have already appeared on the first 9 pages. By the 10000th page and beyond there are probably only one or two new words per page. So an increase in the unique word limit from 150000 to 300000 can increase the number of pages indexed by maybe 100000 pages (depending on your site).

    We make the limit configurable so that you have some control over how much RAM is used during indexing. Other solutions will either a) use way more RAM than required or b) will be slower because they will need to expand their storage requirements many times during indexing or c) will fail unexpectedly hours into the indexing process because they run out of RAM, becuase they didn't know how much they needed at the start.

    And a lot of the content is written to disk during indexing. There is a portion we need to hold in RAM however to avoid thrashing the disk and get good indexing speed.

    The PHP & ASP options are restricted to 300,000 unique words for performance reasons. (6 times the number of words in the English language). The CGI option can go much higher and is effectively only limited by your PC hardware.

    So far from being our Achilles heel, it is feature that gives people doing 'serious' indexing more control over the indexing process.

    -----
    David

    Comment


    • #3
      Fair enough David, I didn't mean to sound cynical or overly critical. I understood already the mechanics of the configurations. It's just not a very scalable methodology, that is what I am trying to convey. You also have an anglophone bias, there are indeed other languages around, and many sites that offer content in several languages, and you want to index those as well. So the unique word limit makes little sense in that setting.

      Have a look at Thunderstone's webinator. I used that for a long time but it is too expensive for the projects I work on now, hence the need for something a little more accessible in terms of price. That product does exactly what zoom does (except for price if you don't use the free version) and never runs out of RAM no matter the size of the index project. Performance is top notch and search options are too numerous to mention.

      I noticed you mentioned in one post that you do development for additional requirements, perhaps we can have a sideline chat about the cost implications.

      Comment


      • #4
        It's just not a very scalable methodology
        I disagree. If you need a bigger limit, then you enter in a bigger limit. By telling Zoom in advance roughly how much stuff you plan to index it can be more efficient while indexing.

        You also have an anglophone bias
        True, English speaking countries are our biggest markets. But we have put a lot of effort into international language support and Zoom works equally well in most languages. (The exception are some Asia languages, where support could be better).

        The unique word limit is really unrelated to international language support.

        BTW: I worked in France, Germany, Belgium & Japan for several years before doing this. Most of our staff are also bi-lingual, so we are well aware that there are other languages.

        ...and never runs out of RAM no matter the size
        I am sure their sales people want you to believe this, but just because they don't publish figures on RAM requirements, doesn't mean their solution has infinite capability.

        It would be fairly easy to make Zoom use more disk space and less RAM. But the performance would be worse & index files would be larger.

        We publish indepentantly verifiable benchmarks for indexing and searching times. There is really only one reason for competitors not publishing figures on resource usage and performace. They aren't that good. Do you have any actual numbers on CPU usage, indexing speed, RAM usage and index file size?

        There is no doubt that webinator has more features than the current Zoom release. But you are comparing a $10,000 product to a $99 product.

        Yes, happy to have a chat about custom development. Contact details are here, http://www.wrensoft.com/contactus.html

        ---
        David

        Comment


        • #5
          It's not scalable because you have an upper limit that you reach quickly if you are indexing large multilingual sites. Not to mention that performance degrades as you progress (my experience in any case) When I speak about multilingual, I am not talking about multilingual support. The more languages there are on a site you are indexing, the more of the unique word limit will be used up since words are different in each language. And you have to understand there are things other than words that are indexed that may be relevant for different applications, such as part numbers or document symbols etc. So to say there are 50,000 unique words in the english languages doesn't address the real world applications of search technology.

          I've used webinator in the past, and I don't care what salepeople say, the proof is always when you put a tool through its strides. I indexed thousands of massive sites with it and never had a glitch. It would use available memory but it would intelligently put content it spidered into its own DB tables and run the index at the end. It's awesome, which is why eBay use it as their search technology. But you are right, it's bloody expensive and not to be compared to Zoom. I was just trying to show you that there is another methodology that another search company uses that might be useful to you guys in your own planning. At the end of the day I like the product and my hats off to you. Sorry for the suggestions

          Comment


          • #6
            There is no fixed upper limit when using the CGI option. So really I am a bit confused as to why you think there is.

            There are sites that we know of that can't be indexed because the PC doing the indexing doesn't have enough RAM, and this is an issue for these huge sites, with hundreds of thousands of pages. But I am not aware of any customer that could not index a site due the unique word limit (when using the CGI option).

            So just set the limit to 500,000 or 1,000,000 words and then you don't need to think about it anymore.

            The overall RAM usage on huge sites is a real issue and is something we can still improve on (and will). But the unique word limit is not really the main cause of this issue.

            If you want to publish some benchmark numbers on indexing speed and resource usage, so we could have an informed disucssion, that would be great. I was going to download webinator and run some benchmark tests myself, but after 45min of playing around I still can't get it running so I have given up.

            ----
            David

            Comment


            • #7
              I did some more research and the found that Webinator does in fact have similar memory usage problems and limits to set and problems of slow indexing here and here (even though you might not have personally come across them).

              Now it would be unfair to claim this is the norm for Webinator. But it would be equally unfair to claim that there are no resource & indexing performance issues.

              Also, many of the Webinator users seem to be using much more high end hardware (e.g. 64bit Sparc servers) than our average users (with a 32Bit Office PC).

              ------
              David

              Comment


              • #8
                Unfortunately the CGI version is not an option for me (server requirements) and I'm having a real problem since I bump into the 300000 word limit after indexing about 50% of my content (it's multilingual).

                What methods can I use in order to index all my content?

                Comment


                • #9
                  Using the CGI search option really the only solution to index such a large site. PHP and ASP tend to use too much RAM and run to slowly to search really large amounts of data.

                  Also you have posted this in the V4 section of the forum. So I assume you are using V4 of the software. You should consider upgrading to V5. It uses less RAM, has a larger capacity, indexes content faster and searches faster compared to V4.

                  Comment

                  Working...
                  X