64bit Edition of Zoom Search Engine

David replied

Aug-01-2008, 11:51 PM
This is getting rather off topic (as it has nothing to do with 64bit). But anyway, the CGI is a native compiled executable in C++. Not a script. But we sell the C++ source code as part of a SDK here. So you can edit it and re-compile it, if you know C++. The CGI is slightly more complex than the PHP and ASP code as we spent a fair amount of time optimizing it for for very large sets of files (1M+ documents)
Leave a comment:
jcbeck replied

Aug-01-2008, 05:35 PM
I see this now... now that this is live (with just 18k documents) we're seeing multiple second queries in ASP (25+ seconds). Is there some way to customize the CGI the way I am customizing the ASP (http://www.wrensoft.com/forum/showthread.php?t=2712) to filter results based on the URL of the result and a variable set based on a cookie value?

EDIT: From your forums, it appears my only option is to enable XML output which I parse and filter: http://www.wrensoft.com/forum/showthread.php?t=1461. Since the CGI to XML searching is very fast, the remaining filtering should be reasonably fast within ASP, so I can apply my filters there and update match counts accordingly before generating the finished output.

Last edited by jcbeck; Aug-01-2008, 06:09 PM.
Leave a comment:
Ray replied

Jul-30-2008, 12:23 AM
Please note that this is stated in the message box when configuring above the 65,500 page limit. And also on our "Which Edition?" page:

*Dependant on memory and hardware resources available. More details here. Sites containing over 65,000 pages in total will need to use the higher performance CGI platform option.
Leave a comment:
David replied

Jul-29-2008, 11:45 PM
jcbeck, this is not a 64bit issue.
The problem is with the ASP language itself, which has terrible performance. So we limit the ASP script to 65,000 files in order to stop the search times being minutes long and putting a huge load on the server. If you switch to using the CGI option you'll get a 10 fold improvement in performance, higher capacity and much less server load.
Leave a comment:
jcbeck replied

Jul-29-2008, 03:49 PM
ASP Limitation

While our professional version claims to support 200,000 documents, it won't allow me to configure it higher than 65,500 to be indexed. Is this a 32-bit ASP issue? Would 64-bit Zoom using ASP on 64-bit Server 2003 allow for a higher limit?
Leave a comment:
dbuck replied

Aug-08-2007, 07:15 PM
Just mention our success and show that getting the source code to the CGI was beneficial:

We have successfully compiled, tested, and run the CGI on Linux 2.6 - x86, x86_64, ppc, and ia64; as well as Win32 and MacOSX - ppc.

We had to adjust the Makefiles a little (especially for the ppc variants) and some includes in the source. But it is stable and we are happy.

Thanks Zoom,
D
Leave a comment:
David replied

Jul-23-2007, 10:49 PM
To date the only 64bit platform we have compiled the search.cgi script for is Sun / Sparc hardware.

We do however sell the source code for the search.cgi if you wanted to re-compile it yourself.

But all 64bit Linux systems should (in theory) be able to run 32bit binaries, if configured correctly. However I know from experience that different distributions of Linux have serious binary compatibility issues.
Leave a comment:
andersmeyer replied

Jul-23-2007, 12:57 PM
Need 64 bit version

Hi there,

I am looking for a 64 bit version of search.cgi as I currently get a "/lib/ld-linux.so.2: bad ELF interpreter" error when trying to execute search.cgi. I tried symlinking from my lib64 to lib but that does not work.
Do you have any 64 bit versions?

Regards

Anders
Leave a comment:
David replied

Jul-06-2007, 12:20 PM
Betty,

If you just asking about the future 64bit version of the software becuase you have a 64bit operating system, then you don't really need it.

The 32bit version of the software will run fine on 64bit Windows (up until to the memory limits, etc.. discused above).

As far as we know there are only one or two customers that will actually benefit from 64bit at the moment. And the above discussion hasn't really changed that, as no one seems to be running out of RAM. So demand for 64bit isn't strong (yet).

The file system limit isn't really related to 64bit, it is more related to support for continued support for old operating systems and the current structure of internal pointers in the index files. In other words, files greater than 4GB are possible in most versions of 32bit Windows and newer versions of 32bit Linux.

There is no date for the release of a 64bit version as yet. We are thinking that in addition to doing a 64bit code release that we also need to change the index file formats (to remove both the limits discussed above, rather than just one of them). But this is a bigger job.
Leave a comment:
betty replied

Jul-06-2007, 11:31 AM
Does the professional version have a 64bit version too? When can it be released?
Leave a comment:
Ray replied

Jul-05-2007, 03:28 AM
Originally posted by will View Post

Yes it is the file system limit not RAM I was referring to.

Can you not allow 3GB for CGI? Seeing as you have separate Windows and Linux options for the CGI version.

This is already the case. The 2 GB filesystem limit only applies to the Linux and BSD versions of the CGI output. Have you tried indexing with CGI/Win32 selected in the Indexer? What is the exact error message you are seeing when the Indexer hits the limit?

Originally posted by will View Post

- I'm not trying to index 100MB PDFs
- I'm not trying to index the whole internet (just 100,000 PDFs)

Av words per file is ~5000
Av unique words per file ~ 40

So that's about 4 million unique words. That's a large amount, despite the relatively smaller number of documents. Our references to the number of pages/files are approximations based on average document sizes. Your PDFs are clearly quite large, due to the number of chemical names and numeric values involved.

Originally posted by valt View Post

Without this flag your app is limited to 2GB of Virtual Memory. And this is not enough for indexing non-English-text (chemistry related articles in our case) when dictionary grows fast. For our purposes extending it to 3GB would suffice. All you have to do is to specify /LARGEADDRESSAWARE flag (if compiling with Microsoft compiler). This could even be done without relinking using some kind of binary editor, but it seems you're checking CRC on startup to avoid file modifications. The good thing about this modification is you could get 1.5 times capabilities improvement for free!

According to Will above, the limit he is hitting is the filesystem limit, rather than the memory limit. So this would not get around that.

While it is possible to compile the application with the abovementioned linker flag enabled and extend it to use up to 3GB of virtual memory (in XP and Server 2003 only), it is not without risk or costs - the fact that Windows require you to specify this flag (and apply a change in the boot.ini file), as opposed to making this the default behaviour, itself suggests that complications can occur and such an action should require some extensive testing. We have some internal memory management in some areas of optimizations that may complicate things. On top of that, it is quite possible that the filesystem limitations will be reached before the memory limit, as seen in Will's scenario above.

Having said that, if you like, we could provide you with a test build with this flag enabled, and see if it helps much in your situation. We would be interested to hear the results since we have not tested it with data of this size. Please e-mail us for more information.
Leave a comment:
valt replied

Jul-04-2007, 02:24 PM
2G limit for 32 bits applications

Guys

the problem is ZoomIndexer.exe is linked without IMAGE_FILE_LARGE_ADDRESS_AWARE bit set in PE header. Please refer to the following for details: http://www.microsoft.com/whdc/system...AE/PAEmem.mspx.

Without this flag your app is limited to 2GB of Virtual Memory. And this is not enough for indexing non-English-text (chemistry related articles in our case) when dictionary grows fast. For our purposes extending it to 3GB would suffice. All you have to do is to specify /LARGEADDRESSAWARE flag (if compiling with Microsoft compiler). This could even be done without relinking using some kind of binary editor, but it seems you're checking CRC on startup to avoid file modifications. The good thing about this modification is you could get 1.5 times capabilities improvement for free!
Leave a comment:
will replied

Jul-04-2007, 12:27 PM
Yes it is the file system limit not RAM I was referring to.

Can you not allow 3GB for CGI? Seeing as you have separate Windows and Linux options for the CGI version.

Also:
- I'm not trying to index 100MB PDFs
- I'm not trying to index the whole internet (just 100,000 PDFs)

Av words per file is ~5000
Av unique words per file ~ 40

High unique words mostly due to people's names in the documents, chemical names and numbers.

There is some rubbish in some of the PDFs but not enough to alter the situation.
Leave a comment:
Ray replied

Jul-04-2007, 01:36 AM
Originally posted by will View Post

w.r.t.. what Ray says: I'm pretty sure they are PDFs but how do I know whether they are indexing binary garbage? Also, how would I stop this from happening?

If they are just PDF files, then there should be no problem unless your PDF files were created unusually (eg. with a third party application besides Acrobat), and they contain an invalid searchable text layer. You can confirm this by opening "zoom_dictionary.zdat" in a text editor and seeing if it contains a list of normal looking words, or not (do NOT modify this file however, or you'll break functionality).

What I was suggesting before was more related to cases where you might be indexing unrecognized file types (eg. "myfile.abc") which Zoom would treat as a text/HTML file. Another possibility is if your server is not serving the PDF files correctly and serve them as text/HTML content-type. I can't guess all the possibilities without seeing anything or having any more details though.

How big are some of these PDF files in filesize? As we have mentioned several times in this thread, it would help if you can describe your project: the file types indexed, the number of files, the sizes of the files, if you are indexing a multi-lingual site, if you are indexing multiple sites, ...etc.
Leave a comment:
David replied

Jul-03-2007, 11:07 PM
I think you are confusing the 2GB file system limit and the 2GB Virtual memory limit.

Once 2GB limit is reached I guess your software cuts off indexing.

From this description I am guessing you are hitting the file system limit and not the virtual memory limit. If you are hitting the virtual memory limit then either indexing will not start at all with a message about lack of RAM, or indexing will fail rather catastrophically with a memory allocation failure of some sort.

There is no work around for the file system limit until we drop support for old Linux and then we can use files of 4GB files.

It would have helped if you answered my other questions, from my 2nd post.
Leave a comment:

Announcement

64bit Edition of Zoom Search Engine

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: