Announcement

**David** · Jul-14-2011, 11:02 AM

The noindex meta tag isn't anywhere near as efficient as using either the robots.txt file or the page skip list in Zoom.

Using the meta tag means Zoom still needs to hit your page, and the database behind it, in order to discover the meta tag. Using the skip list avoids the page hit entirely.

Second point is that 2sec is a long time to spend in your database. There would have be to scope for speeding this up (if it is your own database). Things to look at would include,
- Checking the query is efficient
- Checking the correct index fields are defined in the DB
- Check the locking to make sure queries don't block each other
- Checking there is enough RAM in the machine to cache the DB
- You might also look at what you are getting from the DB. Maybe you don't need to get some data from the DB for Zoom (and Google) indexing. i.e. have a cut down efficient page for the spiders.
- Check the networking if the DB is on a separate machine from the web server.
- Check the general background load on the machine.
I would have thought 0.2 seconds would be a more reasonable time for a DB query for a web page.

Another option might include having a 2nd staging server, and doing the indexing on the 2nd machine. Or some complex option of having static versions of the pages as well as dynamic versions.

In terms of incremental index updates in Zoom, feeding it with a list of known page changes (as you are suggesting) will save a huge amount of time. Adding 5 new pages via a list might only take 10sec, depending on your hardware. You are saving the spider all the work of visiting each page to check the update date and time.

This FAQ might also help,
http://www.wrensoft.com/zoom/support...rge_sites.html

**localhorst** · Jul-14-2011, 02:05 PM

thanks,
for incremental update, I wrote a batch file, fetching a urllist to updated pages and write it to disc, then start zoomindexer with -addpages urllist.txt,
works fast, yes - super (just uploading takes time).

one question:
if I start a batch from programm dir like this:

Code:

ZoomIndexer64.exe -s -c "D:\zoom.zcfg" -addpages updated.txt

the indexer breaks with

Code:

08|07/14/11 15:36:35|Error: Could not open text file containing list of pages to delete: C:\ProgramData\Wrensoft\Zoom Search Engine Indexer\updated.txt

so per default zoomindexer looks in ProgrammData path?
can I change this?

for the slow queries:
Unfortunately you missunderstood me.
Not the generation of my pages takes 2 seconds - they run fast enought.
The searchquery to the zoomindexer files takes around 2 seconds in cgi mode.

here are some infos about the finished crawling process

Code:

12|07/14/11 12:40:29|INDEX SUMMARY
12|07/14/11 12:40:29|Files indexed: 59675
12|07/14/11 12:40:29|Files skipped: 79783
12|07/14/11 12:40:29|Files filtered: 0
12|07/14/11 12:40:29|Files downloaded: 70144
12|07/14/11 12:40:29|Unique words found: 285869
12|07/14/11 12:40:29|Variant words found: 204525
12|07/14/11 12:40:29|Total words found: 4683457
12|07/14/11 12:40:29|Avg. unique words per page: 4.79
12|07/14/11 12:40:29|Avg. words per page: 78
12|07/14/11 12:40:29|Start index time: 10:25:16 (2011/07/14)
12|07/14/11 12:40:29|Elapsed index time: 02:15:13
12|07/14/11 12:40:29|Peak physical memory used: 221 MB
12|07/14/11 12:40:29|Peak virtual memory used: 508 MB
12|07/14/11 12:40:29|Errors: 0
12|07/14/11 12:40:29|URLs visited by spider: 74961
12|07/14/11 12:40:29|URLs in spider queue: 0
12|07/14/11 12:40:29|Total bytes scanned/downloaded: 923655701
12|07/14/11 12:40:29|File extensions: 
12|07/14/11 12:40:29|    .php indexed: 59675

We are not on shared hosting maschine!
I fear the crawler finds too much garbage no one search for, like product specific data
at example: width: 62 mm, height: 33 mm
I exclude as much as possible text and try to keep just productdescription, article numbers, EAN numbers, Price, headers and metatags.

Anyway - thanks for clearing up the fact, that <meta robots follow,noindex> slow down the process.
I thought the same, when watching the log running.

what is processed first:
robots.txt or internal skiplist?

thanks m.

**Ray** · Jul-15-2011, 04:51 AM

Originally posted by localhorst View Post

one question:
if I start a batch from programm dir like this:

Code:

ZoomIndexer64.exe -s -c "D:\zoom.zcfg" -addpages updated.txt

the indexer breaks with

Code:

08|07/14/11 15:36:35|Error: Could not open text file containing list of pages to delete: C:\ProgramData\Wrensoft\Zoom Search Engine Indexer\updated.txt

so per default zoomindexer looks in ProgrammData path?
can I change this?

The default path is the working directory.

Specify a full path for your config file in your bath script, i.e.:

Code:

ZoomIndexer64.exe -s -c "D:\zoom.zcfg" -addpages C:\somewhereelse\updated.txt

Originally posted by localhorst View Post

for the slow queries:
Unfortunately you missunderstood me.
Not the generation of my pages takes 2 seconds - they run fast enought.
The searchquery to the zoomindexer files takes around 2 seconds in cgi mode.

That is unusual. Given the number of files and words indexed.

Things to check:

- What sort of queries are you submitting? Wildcards and exact phrases can be slower than keyword searches.

- Are you looking at the search time measured by Zoom itself? If not, you should enable this for comparison at least, under "Configure"->"Search Page"->"Show time taken to perform search".

- Is the CGI called from another script? For example, if you have a PHP page which acts as a wrapper to the CGI output. There may be other things adding to the time of the search.

- Is the search function online? We can take a look at it if you can give us the URL (via PM or email if needed). We may notice something quicker this way than guessing.

Originally posted by localhorst View Post

what is processed first:
robots.txt or internal skiplist?

Practically the same time, there's little difference in terms of speed. Strictly speaking, it's the internal skip list first. Also, if this is a long list, there's extra time to download the "robots.txt" file and parse it into memory.

**localhorst** · Jul-15-2011, 04:01 PM

Batchfile for incremental index works perfectly now.

for the other problem:
The time is measured by Zoom itself.
There is a php file which acts as wrapper (via curl) but that's not the problem, because searching over the plain cgi page takes the same time.

Right now, I update the Index and then I will send you a PM with some relevant files (zcfg, last indexlog, HTML structure etc) and a link to the live site (the zoom search is hidden right now)
thanks in advance!

Announcement

reindex large sites takes too long

reindex large sites takes too long

Comment

Comment

Comment

Comment