PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

CGI and possible further speedups?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CGI and possible further speedups?

    Hello. I have a purely file-based intranet search set up using Zoom. We have about 30,000 files, mostly XLS and a few thousand DOCs.
    I have the system all set up just how I want it, except its painfully slow. Its on a P4-2.8Ghz w/1GB RAM and it still takes 60 seconds or so for a basic query. I'm using the CGI engine, of course..

    I have a few questions regarding how we may speed the system up.

    First, is there any way, or could you modify the CGI or whatever is needed, to have the engine skip indexing of numeric values? Since we are indexing XLS files, the total "unique" word count comes out to 1.7million.
    While indexing the numbers would be nice if it were extremely fast (for quick employee ID number searches and things like that), it is so slow that its not worth the effort.
    I read the README with xlhtml.zip, and it seems that this is an open source program. I suppose I could modify and recompile it to ignore numeric values, if necessary.

    Also, would it be any faster if it ran on a true database backend? While I know your proprietary database system (whatever it may be) is probably faster for small sites, is there any chance of getting Zoom connected to mysql, mssql, postgresql, firebird, etc. in the near future? Thus you could pass off much of the programming logic onto the optimized database server..

    Maybe I'm pushing your software too far and asking too much of a $99 product (when Google sells their mini for $4000 or so) but it really seems to have a lot of promise, even at the high end, and I appreciate the relatively open nature of it.

    Anything you can do will be appreciated.

    Thanks,

    Chuck

  • #2
    1.7 million unique words is alot of words to search through - just for our other readers, we should note that the English dictionary only contains around 40,000 unique words. Note also that we are talking about unique words rather than the total number of words in the files. It is unusual for a site's content to have this many number of unique words unless it contains large databases of serial numbers, product codes, etc.

    However we should confirm first, whether the size of the index correctly reflects the data or if there may be other issues causing the index to bloat out excessively. For example, if it is mistakenly indexing binary data (or malformed conversions of the XLS files), or if it is indexing documents in a different codepage encoding (and failing to properly determine words. Is the search accessible online, and if so, can you give us the URL so we can take a closer look at it?

    Another thing to check would be whether you have enabled dots to join words in the "Indexing Options" tab of the configuration window. This (and the other word join options) would change what is considered a single unique word, where "123.1234.123" may be indexed as a single unique word (if dots were enabled) or as 3 different unique words (if dots were disabled). This setting may help reduce the number of unique words.

    Of course, if the data being searched really is massive then the above would not make much difference. We do plan to improve the search performance for massively large content, and will be researching into further optimizations that can be made. We have some ideas in mind on how to speed up the searching process. Eliminating the numbers from being indexed would probably be a good short term solution to your particular problem. We will look into one of the above methods to address this.

    We also believe that our index database format is still advantagous to running over a third party database backend such as MySQL. While there are further optimizations to be made, we believe its performance can still match these backends due to the fact that it is optimized specifically for our searching methods. We will be continuing optimizations in this area and will of course, be keeping an eye on its performance in comparison, to ensure that there is no disadvantage to not using an external database server.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      We've just performed some tests on the search performance of the CGI for a site of comparable size to your reported scenario, and the results proved to be interesting. We think that the performance issue is not with the CGI itself, which proved to be much faster than reported in our test situations.

      First of all, we randomly generated a site containing 2 million unique words (in 20,000 files) and indexed this with the latest public build of Zoom (Version 4.0 Build 1016) using the CGI/Win32 option. We then put this on our local IIS test server (Athlon 3200+, 1 GB of RAM) and performed some searches.

      For a single search word query, our search times averaged around the 1 to 3 second range. For resource intensive wildcard searches, it would take 5 to 7 seconds (this includes the worse case scenario of searching for every word containing the letter 'e').

      This seems to be vastly different from the 60 second searches you were getting, and we suspect that there is something else causing the degraded performance on your server. While our test server has a marginally faster CPU, the difference in performance should not be so significant.

      Can you tell us if you are running on a shared hosting environment? If your server hosts other websites, it may also be under heavy load from other running tasks or processes.

      Other things to check would be if there are other server settings which may be affecting performance, such as average CPU load, restrictions on CPU time, caching issues, antivirus software, etc.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment

      Working...
      X