Maximum performance best practices

David replied

May-13-2021, 12:46 AM
Once you get to the point of dealing with TB of data everything starts to get exponentially harder and slower. Small mistakes, bugs or experiments take days of time to sort out. Dealing with incremental updates can also be problematic and requires good planning.

Some other ideas.

1) Setup a development / staging environment so you aren't experimenting on the live system. Potentially you can even build the search index on the staging machine, then just move the index to the live machine (without ever indexing the live environment). You could build the dev environment on a single machine. i.e. SQL, Web server and indexer all on the same box with a fast M2 SSD. With no network latency and no background load the indexing might be 5x faster (just a wild guess).

2) Write a script that pre-generates all your web pages into very simple HTML files (no need for CSS, JS, graphics or any significant formatting). Just include the headings and text. This could be done in any programming language that can call your SQL database. Then once all the HTML files are made, run the Zoom indexer in offline mode on those simple HTML files. Rewrite the URLs and job done! This has a bunch of advantages. Zero network latency, no load on your DB during indexer, much much faster indexing and easier incremental changes. With offline indexing it can be 10x faster.

For MasterNode, As of May 2016, this product has been discontinued and is no longer supported. There wasn't enough demand for doing really really big indexes, so we stopped development. However, it remains open source should anyone want to use it. No special license is required for this. Despite being open source, we have had no code contributions for the last 5 years. And no recent testing has been done. So I wouldn't be very confident of MasterNode being a good solution.
Leave a comment:
BluejacketSoftware replied

May-12-2021, 08:25 PM
Reading the FAQ, I have some questions:

Split indexing process over multiple machines

If it makes sense, split the source files into categories and perform indexing on smaller portions of the data using separate machines, and thus greatly reducing the amount of time required to index the complete data set. Wrensoft provides a free software tool known as Zoom MasterNode that could be used as a front-end to these distributed index files so that they can be collectively searched. MasterNode works by taking any search request and transparently dividing the work amongst its slave node machines (where the various actual indexes are stored), which can result in better search performance and greater search capability.

Question: Does this not require a higher level of license to run the indexer on multiple machines then?

And I finished the webserver troubleshooting. Turns out it was an infrastructure problem. My hyper-v cluster had a hiccup causing all kinds of trouble. Everything is back online now and I'm going to resume the index.
Leave a comment:
BluejacketSoftware replied

May-12-2021, 06:36 PM
Hi David. Years behind? I just purchased and downloaded last week?

I will attempt to get a crash dump for you should it fail again.

The data is in the sql server but pages have to be assembled from blocks which is all done in software so I don't think there's a path to connect directly to the database (if that's where you were going). As it stands, I have a search feature built into the site but unfortunately it relies on the the SQL Server Full Text Search feature and since there are hundreds of thousands of pages broken down into blocks (i.e. paragraphs, headings, etc) the search is slow (the database is almost 3/4 TB).

I specifically purchased Zoom to be able to give me a fast, type ahead search function to replace the slow, labored version. I don't necessarily need it to be 'unstructured' but the features of Zoom far exceed what we can grow in house in a reasonable amount of time.

I will read the FAQ for indexing large sites and see if there's anything in there that I can implement.

We're a C# development house so we're comfortable with getting our hands dirty with software but this is clearly not written in a 'managed' (ie dotnet) language so other than digging through logs and event viewer, there's not much we can deconstruct on our end.

I have turned on logging at the diagnostic level so hopefully that will give some insight.

Unfortunately, I just got word that the web server is not responding at the moment. Not sure if it's related to the indexer or not yet. Going in to troubleshoot now. If it turns out to be the indexer that caused the webserver to stop responding, I'll let you know and see if I can replicate the problem with enough diagnostic information to see if you can get to the root of the problem. I looked at the server briefly already and noticed that the indexer was running but apparently stuck. I performed and IIS reset and verified all the app pools were indeed running and then paused/resumed the indexer. That seemed to re-awaken the indexer and it started processing again for a while but the other sites appear to be unavailable still. It might require a server reboot to bring everything back online. The only thing that's different from normal operations is that the indexer was restarted yesterday afternoon. I'm not prepared to blame Zoom yet thoug, not enough info. Could be many other things. I'll let you know if I find a root cause.

Scott
Leave a comment:
David replied

May-12-2021, 01:37 AM
Yes, you were a couple of years behind on the patches. So there is some chance that might help. Hard to be sure, as we don't know the cause of the crash.
Exception code: 0xc0000005 means "Access Violation". It is the most common form of error for software and means that the software wrote to a memory address that it shouldn't have. Typically this due to a software bug, but hardware fault can also cause the same error.

If you can get a crash dump that can sometimes help.

Turning on logging in Zoom can also help (from the Logging options window). Especially if the crash happens each time on the same web page / document. But it will slow down indexing a bit.

There is also a lot of hardware out there that just isn't stable enough to run anything under heavy load for a week.

We have this page in the FAQ for indexing large sites. Some of the tips might help.
Incremental indexing in particular might help for your site, especially if the indexing job can be broken up into discreet parts (e.g. different domains).
A lot of the indexing speed is dependant on the speed of the web site you are indexing. If each web pages takes 10 seconds to generate then indexing can be very slow. If your web site can server 100pages / second, then indexing can be quicker.

Is all the data already in SQL in a nice structured format? Maybe you don't really need a unstructured text search engine in that case?
Leave a comment:
BluejacketSoftware started a topic Maximum performance best practices

May-12-2021, 12:41 AM
Maximum performance best practices

I have purchased the Enterprise version because I have a VERY LARGE website to index and would like to know how to configure the indexer to maximize it's index speed. The server that hosts the website is a vm with 8 cores and dynamically expanding available memory (right now it's at approximately 24GB). The vm is a windows server 2019 install with IIS serving the website with a dedicated SQL server as another VM on the network.That machine is also 8 cores with dynamically expanding RAM sitting at approximately 24GB also. Both the web server and sql server are using a SAN for storage.

The indexer is running on the webserver itself and has the procs sitting at around 66% use with memory at around 60% in use and a disk que length of about .5. (It's using 10 threads right now - is that bad? With 8 cores and hyper-v, on my server it should handle up to 16 threads - no?) As I said, the website is exceedingly large and I've allowed the indexer in spider mode to run for almost a week, checking on it about twice a day. Unfortunately when I checked on it today, I got a message indicating that some component of the indexer wasn't responding. I did not take a screenshot of it so I can't tell you exactly what it said, but I can tell you that it forced the indexer to restart, losing all the progress.

I just updated to the latest version, hoping that might randomly fix the issue I ran into and have restarted the index from scratch.

Two things I would like to know - is there a server configuration that I could follow to maximize the speed of the spider/indexer? Also, are there any logs (event viewer or otherwise) that would shed some light on the failure I experienced so I can avoid another failure in a week? The only thing listed in EventViewer is this:

Faulting application name: ZoomEngine64.exe, version: 8.0.1004.0, time stamp: 0x5ffb9ff7
Faulting module name: ZoomEngine64.exe, version: 8.0.1004.0, time stamp: 0x5ffb9ff7
Exception code: 0xc0000005
Fault offset: 0x0000000000091774
Faulting process id: 0x2ab0
Faulting application start time: 0x01d741a4e1d757fb
Faulting application path: C:\Program Files\Zoom Search Engine 8.0\ZoomEngine64.exe
Faulting module path: C:\Program Files\Zoom Search Engine 8.0\ZoomEngine64.exe
Report Id: e76edd89-1396-49b3-b293-30cb00974dbc
Faulting package full name:
Faulting package-relative application ID:

Any help to get this working at best possible velocity and reliability would be greatly appreciated.

R/
Scott
Tags: None

Announcement

Maximum performance best practices

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Maximum performance best practices