Announcement

**Ray** · May-17-2005, 12:57 AM

1.7 million unique words is alot of words to search through - just for our other readers, we should note that the English dictionary only contains around 40,000 unique words. Note also that we are talking about unique words rather than the total number of words in the files. It is unusual for a site's content to have this many number of unique words unless it contains large databases of serial numbers, product codes, etc.

However we should confirm first, whether the size of the index correctly reflects the data or if there may be other issues causing the index to bloat out excessively. For example, if it is mistakenly indexing binary data (or malformed conversions of the XLS files), or if it is indexing documents in a different codepage encoding (and failing to properly determine words. Is the search accessible online, and if so, can you give us the URL so we can take a closer look at it?

Another thing to check would be whether you have enabled dots to join words in the "Indexing Options" tab of the configuration window. This (and the other word join options) would change what is considered a single unique word, where "123.1234.123" may be indexed as a single unique word (if dots were enabled) or as 3 different unique words (if dots were disabled). This setting may help reduce the number of unique words.

Of course, if the data being searched really is massive then the above would not make much difference. We do plan to improve the search performance for massively large content, and will be researching into further optimizations that can be made. We have some ideas in mind on how to speed up the searching process. Eliminating the numbers from being indexed would probably be a good short term solution to your particular problem. We will look into one of the above methods to address this.

We also believe that our index database format is still advantagous to running over a third party database backend such as MySQL. While there are further optimizations to be made, we believe its performance can still match these backends due to the fact that it is optimized specifically for our searching methods. We will be continuing optimizations in this area and will of course, be keeping an eye on its performance in comparison, to ensure that there is no disadvantage to not using an external database server.

**Ray** · May-18-2005, 06:06 AM

We've just performed some tests on the search performance of the CGI for a site of comparable size to your reported scenario, and the results proved to be interesting. We think that the performance issue is not with the CGI itself, which proved to be much faster than reported in our test situations.

First of all, we randomly generated a site containing 2 million unique words (in 20,000 files) and indexed this with the latest public build of Zoom (Version 4.0 Build 1016) using the CGI/Win32 option. We then put this on our local IIS test server (Athlon 3200+, 1 GB of RAM) and performed some searches.

For a single search word query, our search times averaged around the 1 to 3 second range. For resource intensive wildcard searches, it would take 5 to 7 seconds (this includes the worse case scenario of searching for every word containing the letter 'e').

This seems to be vastly different from the 60 second searches you were getting, and we suspect that there is something else causing the degraded performance on your server. While our test server has a marginally faster CPU, the difference in performance should not be so significant.

Can you tell us if you are running on a shared hosting environment? If your server hosts other websites, it may also be under heavy load from other running tasks or processes.

Other things to check would be if there are other server settings which may be affecting performance, such as average CPU load, restrictions on CPU time, caching issues, antivirus software, etc.

Announcement

CGI and possible further speedups?

CGI and possible further speedups?

Comment

Comment