If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.
But I don't think you'll be able to customise our CGI code to make it significantly than it already it (not without removing some functionality). And you'll need a good level of C/C++ knowledge before you can make any changes.
The main speed increase that we think is available to the CGI is switching to using 'FastCGI'. But most web servers are not supporting this. So it would benefit only a few people. (but it is something we are looking at in V6 nonetheless).
Note that you don't need to purchase the SDK to use the CGI option.
We are using the CGI version, and implementing FastCGI would help us out.
We did get the SDK, and are looking at it.
Are there any quick tips if say, we wanted to store all of the data in a database? We're examining the possibility of exporting the zoom output to database tables, which we could look through to grab the information we want quicker.
Maybe you should describe the entire problem that you have. What hardware you have, how many documents you have, what are your current search times. etc..?
The CGI option is already very fast, searching 300,000 pages in 0.3 seconds (and hardware that is 3 years old). I would think that anything you do with a SQL style database will only slow things down.
The box we use to search is a Windowz 2003 2.5GHz with 4GB RAM. Normal CPU usage is under 6%, and page file is < 725MB.
We're seeing search times (depending on the word/phrase) anywhere from .25 to 12 seconds just for the search.cgi portion. Our .zdat files are anywhere between 30 and 150MB and the largest index contains 113k pages. On the 300k pages in 0.3 seconds quote I'd expect a sub-second response, but that's not what we're seeing.
We have the fast-slow slider set to slow (since we need accuracy for phrase searches). Setting this to fast did increase the search time, but decreased our accuracy too much.
Not only do we get results from Zoom, but we also have additional processing that must be done before results are returned to the user, so the final time from user request to display is anywhere from .5 to 20 seconds...unacceptable of course.
There are other ways we're working to decrease the time, but a major portion is the search.cgi, which is why we're looking into this option.
We have a lot of experience in using the SDK - we have it running on linux ppc, ia64, x86, and x86_64. In all our experience, the speed comes from proper indexing. The CGI is very fast and we haven't needed to edit it other than getting the includes to work correctly with the different architectures. We have zdat files up to ~500MB, and we run it on top of SHTTPD web server.
Much of the slow down we recieve is the initial loading of the zoom_dictionary.zdat with the web server which might takes a couple seconds occuring only the first time; then everything else searches at a fraction of a second afterwards.
Most of the slow down would be the web server serving and loading, not the CGI (in my experience).
D
Last edited by dbuck; Mar-25-2008, 06:20 PM.
Reason: spelling
If you 1) have a large index, 2) have selected Slow for deep searches in the configuration 3) Are doing exact phrase searches, and 4) This is the first search and nothing has been cached, then yes search times can stretch into seconds.
The slightly simplified explanation is that exact phrases searches require a lot of extra disk access compared to single word, or even multiple word searches. The word order is not stored in the fast (in RAM) part of the index. And we need word order for good exact phrase searches. The word order comes from the document context, which is left on the disk. Exact phrases that contain common words (that occur in many documents) are especially bad; and especially especially bad with the "Slow" setting.
The good news is that most users don't do searches like this. 90% of searches were for a single word on our web site. 1% were exact phrase.
FastCGI won't help much with this scenario.
So some possible solutions would be,
Don't use the slow setting
Set up two servers, the deep search server and the fast search server. Then let the user decide. But as noted, for 99% of searches this will make no difference.
Do a near total re-write on the CGI to improve performance in this scenario. But I think this would reduce the speed in other scenarios (like the common single word search).
Drop the Zoom index files on a small solid state drive. They are cheap now. Most of the disk access is seeking. The solid state drives are 100x faster for seeking. Or even better, allocate 1GB of your RAM as a RAM disk and copy the Zoom index onto the RAM disk upon boot.
I like option 4. It is a easy solution that should have a major impact in this particaular scenario. If you do decide to do this, please post the results.
Last edited by David; Mar-26-2008, 10:25 PM.
Reason: Added point 4) about caching
Comment