PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

spidering speed

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • spidering speed

    Hi guys
    We index 200,000 pages of our site, and it takes 12 hours or so.
    The spider skips lots of pages, maybe something like 200,000 or so, and I have at least 4 zoomstops and restarts in all pages, sometimes more.
    Do these impact on the speed of the spider?

    I read in other threads that the main bottleneck was the download speed for the spider. The files size of the pages has gone up over the last few weeks (due to a 40 kb page size increase.)
    Would that impact on the speed?

    The spider is downloading the pages across the internet. I figure it would be faster if it could download them across our network. The spider is on a windows server, and the website is on a linux server. They are both in the same IP range.
    Is it possible to get the spider accessing the pages across the network instead of involving the internet? Is there a setting in zoom that I need to change to do this? I can get my hosting company to help out.

    thanks

  • #2
    For the skipping, you should check if you are skipping those pages via the Skip Options (page skip list) or if you are using something else such as CRC, Content Filter, or robots meta tags -- all of which rely on the spider downloading the file entirely before it can decide to skip it.

    Make sure also that you do not have throttling enabled under "Configure"->"Spider options".

    Also check if your robots.txt file has throttling specified which Zoom may be complying with.

    Originally posted by boxoffice View Post
    The spider is downloading the pages across the internet. I figure it would be faster if it could download them across our network. The spider is on a windows server, and the website is on a linux server. They are both in the same IP range.
    Is it possible to get the spider accessing the pages across the network instead of involving the internet? Is there a setting in zoom that I need to change to do this? I can get my hosting company to help out.
    You can use a local IP address (from the local network) for your start spider URL, so the indexer will access the web server locally.

    You then go to "Configure"->"Indexing options" and check "Rewrite all indexed URLs..." and do a replacement of the local IP address with the external domain name. Click the "Help" button on that screen for more examples.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      thanks Ray,
      I am skipping using the skip list plus I have CRC on, so that means it has to download pages in their entirety.
      I have the spider throttle st to no throttling.
      And the robots.txt file doesn't have anything in it about throttling.
      I will try to get the hosts to set something up so the spider doesn't have to go out the router and then back in to the web server. Maybe the local IP will work.

      Comment


      • #4
        Also what are the hardware specs of the machine doing the indexing?
        What is the current TCP/IP ping latency between the machines?
        What is the max download speed between the 2 machines?
        Is the server overloaded? (might indexing at 3am in the morning help)
        Are the pages dynamic? Or are they mostly plain HTML / PDF. If so then offline mode will be better.

        What is the URL of the site so we can have a look.

        Comment


        • #5
          Hardware specs
          Intel Xeon 2.6 Ghz
          2GB ram
          32 bit Windows server 2007 service pack 2

          Ping time from windows server to linux web server - less than 6 milliseconds

          Max download speed - the host told me the 'connection' is 100 Mbps

          Server load - it does go up a bit. Our web server software is lightspeed which is much better than apache, it runs on 2 CPUs and the load is around 2 - 4. It might go up to 8 while the indexer is running and we start to get a few 503 errors, maybe 1 every 5 - 10 minutes. I start the indexing at 2am and this helps a bit, we get less 503 errors, but since it takes about 12 hours to index 200k pages it goes into the main part of the day anyway.

          95% of pages are php and the other 5% are shtml, so you could say they are all dynamic.

          This morning I kept the old sitemap.xml files (up to sitemap5) and just reindexed about 3k pages and reuploaded the .zdat files. This took zoom about 30 minutes to do.
          If I look at zoom's 'status' details, I see
          files indexed 3023
          files skipped 14169
          files filter 0
          files downloaded 3106

          I will PM you the url if thats ok.

          Comment


          • #6
            Sound like pretty old hardware on the server. I assume the CPU doesn't have a model number. All the newer Xeons have model numbers, but the old ones didn't. They started making 2.6Ghz units in 2002. So it could be anywhere up to 9 years old now.

            If you are getting 503 errors (Service unavailable) then this is a sign that the server is really overloaded (or that there are other issues, like file corruption).

            Also 2GB of RAM for a server machine is very low, by today's standards. I just checked at NewEgg. You can buy an additional 2GB of ram for $12.99. It is a no brainer of an upgrade for ~$12.

            I assume the load figures you are referring to are from the Linux 'uptime' command?
            Our Unix server typically runs with load like this,
            8:33PM up 29 days, 44 mins, 1 user, load averages: 0.20, 0.19, 0.20

            If you are getting load averages of 4 to 8, then this is way too high IMHO to provide good response times.

            In your 30min test you are getting ~1.7 pages indexed per sec. Which also implies that indexing 200,000 pages at this rate would take 33 hours. So it must get quicker than this when you do the full job.

            By comparison we were able to get ~11 pages per sec in house testing on a local LAN, but this was on static HTML pages, but with hardware that is now 2 years old.
            See,
            http://www.wrensoft.com/zoom/support...rge_sites.html

            You should also look into incremental indexing if you aren't already doing this. I can't work with all servers, but if it does work, it can save a lot of time.

            There is also an option for assisted incremental indexing. To do this, you write a custom script that queries your DB to get a list of new pages (e.g. from the last 24 hours). Then you feed this relatively small list of new and updated URLs into Zoom. Indexing might only last 1 minute in this case. Then maybe once a month you do a full re-index to pick up deleted content.

            Yet another option would be to backup the DB (I assume this already happens) and run the indexing process against the backup data on a separate server. Might be possible to do this in house and thus not be dependent on the ISP's old overloaded hardware. I would think with modern hardware you could get to ~20 pages / sec and index the entire site in 2 or 3 hours.

            Comment


            • #7
              The model of the Xeon is E5530, sorry I didn't mention it before. Anyway yes its not the most powerful server. We own it purely to run Zoom

              Good news - I changed the connection so it uses a local IP and local NIC. I changed the host file - I related our domain name to the local IP, so that in zoom it would use the new connection. (If I put in the IP address in zoom, it would only find one page, the cpanel / apache page I think.) The time it takes to index 3000 pages has gone from 30 mins to 22 mins.

              With incremental indexing, will it re-index a previously indexed page that has had content added to it? Many of our pages are forum threads, and if the indexer didn't check for new text on the page, we'd only be indexing the first post of each page.

              I don't think it would be possible for us to do the assisted incremental indexing. And to do the indexing process on backup data on a seperate server would require duplicating our current server and database. That could be a possibility but would be a lot of work, might be better if I tweak the config of zoom.

              I notice when I run the indexer that in the 'status' window each thread is 'waiting' a fair bit, so that there are maybe 1 or 2 threads downloading a file at any one time. Does this mean that they are waiting for the spider to find a file to download - that its skipping files?

              I will look into getting a 64 bit server so we can get Zoom 64 bit, that should improve things, right?

              Comment


              • #8
                The E5530 isn't such a bad CPU.

                With incremental indexing, will it re-index a previously indexed page that has had content added to it?
                Yes. Incremental indexing will potentially pick all new and updated pages. This function uses the last-modified date retrieved and the filesize. If these attributes are inaccurate or do not represent the changes to the file, then it will not be able to accurately find the files which have been changed.

                See also these posts for more details,
                http://www.wrensoft.com/forum/showthread.php?t=1123
                http://www.wrensoft.com/forum/showthread.php?t=2991
                http://www.wrensoft.com/forum/showthread.php?t=2178

                How many threads do you have configured in Zoom. More threads will generally give better speed, but also place more load on the server and the network link. Maybe you could send us your Zoom configuration file.

                64bit will help a lot of you are running out of RAM. As 64bit systems can use more than 2GB of virtual address space, and more than 4GB of physical RAM. If you aren't running out of RAM, there won't be much difference.

                If you have anyone who can write some code, you might want to look more deeply into the assisted incremental indexing option. The vBulletin forum software already has a function to dump our new posts in a date range. So you could get all the URLs from this function and then feed them into the indexer each day. For example this link returns today's posts from our forum.
                http://www.wrensoft.com/forum/search.php?searchid=41808
                I know it is a lot of initial work. But the savings in CPU time are enormous once it is setup.

                Comment


                • #9
                  I had a look at your configuration file that you send. You are using 10 threads already.

                  In 1 min of indexing I was able to index 160 files (2.6 files / sec). My ping time to your server was 60ms. Our ADSL link speed is around 11MBbit/sec.

                  So even though your latency was 10x less and link speed 10x higher, you were indexing less pages per sec (1.7 files / sec). So this might warrant further study. It might be that your server was faster today. Or it might be that there is some issue with your machine doing the indexing.

                  Also interesting is that to index these 160 files the spider had to visit 263 pages. There are many re-directs happening on your site. Can these be avoided? It could reduce the number of HTTP requests required to index the site and speed up indexing.

                  Take a look at this example from your site. In this case there is a double re-direct with two HTTP 301 status messages returned from your site.

                  So three HTTP requests are needed to download 1 page.
                  C:\> wget http://www.xxxxxx.com.au/servicesmidwives.php
                  --10:48:27-- http://www.xxxxxx.com.au:80/servicesmidwives.php
                  => `servicesmidwives.php'
                  Connecting to www.xxxxxx.com.au:80... connected!
                  HTTP request sent, awaiting response... 301 Moved Permanently

                  Location: http://www.xxxxxx.com.au:80/servicesmidwives.php/ [following]
                  --10:48:27-- http://www.xxxxxx.com.au:80/servicesmidwives.php/
                  => `index.html.18'
                  Connecting to www.xxxxxx.com.au:80... connected!
                  HTTP request sent, awaiting response... 301 Moved Permanently

                  Location: http://www.xxxxxx.com.au/directory/find/midwives/ [following]
                  --10:48:28-- http://www.xxxxxx.com.au:80/directory/find/midwives/
                  => `index.html.18'
                  Connecting to www.xxxxxx.com.au:80... connected!
                  HTTP request sent, awaiting response... 200 OK
                  Length: unspecified [text/html]

                  Comment


                  • #10
                    Thanks for testing the indexing on your computer. You had it indexing faster than us, even with your limitations like you said. Maybe there is something wrong with our computer that runs the indexer.
                    and you're right about the redirects, I noticed it too after going through the logs. I will rewrite our site to address the redirects.

                    I have run a few tests which have led to some more questions.

                    1 - I turned off 'spider options' -> 'reload all files' and then I ran the indexer 4 times, to index 5000 pages each time. I expected that at the end I would have a 20,000 page search index. But it seems that the same pages were indexed each time. It said this in the logs, and the zdat files and xml files were the same size.
                    Is there something else I need to do to use the cache so it doesn't reindex pages each time the indexer is run?

                    2 - Maybe the computer that was doing the indexing is slow, so I thought I'd run the indexer on my local computer. After installing zoom (the trial version, I can only index 50 pages) I can't get it to actually run. When I press 'start indexing' it gives me the error 'check that the url satisfies the settings in the configuration window'. The settings in 'start options', numbers 1,2 and 3 are the same as on the other indexing computer. Is this issue something to do with me installing zoom on my local computer?

                    3 - I still have zoom v5 installed so I tried to make the settings the same on both v5 and v6. Then I ran each one at separate times, indexing 1000 pages. v5 took 3:30 mins and skipped 139,609 files. v6 took 7:40 mins and skipped 39,463.
                    I'll PM you the config files, because I expected v6 to be faster. Maybe I've overlooked a setting.

                    thanks

                    Comment


                    • #11
                      In this version of the Zoom software the cache is shared with IE. This will change in the next major release. But for the moment the cache isn't very big for most people and in many cases web servers give instructions not the cache pages. So the cache won't do much once you get beyond a few hundred pages (and might do nothing at all depending on your site).

                      Caching will not effect what pages are indexed. It only effects is the pages are downloaded or read from the local disk.

                      Zoom should work fine from you local PC. Try loading up the config file you are using on the other server (just remember to turn off auto-upload before you start indexing). Other possible issues are that you have a firewall blocking internet access on your local machine.

                      I tried indexing with V5 and V6 using your config files. I found V6 to be faster over the first 100 pages.

                      V5 Run1: 2.4 pages / sec
                      V6 Run1: 4.1 pages / sec
                      V5 Run2: 2.6 pages / sec
                      V6 Run2: 4.6 pages / sec

                      Note however that the spider behavior isn't identical between releases. So pages might be indexed in a different order, and HTML parse slightly differently.

                      As this contradicts your result I also did it over the first 1000 pages

                      In both V5 and V6 the indexer spent a lot of time waiting for your site to serve event pages from your DB, like this,
                      http://www.xxxxxxx.com.au/kids-activities/events/index.php?com=detail&eID=2538

                      Time for 1000 pages were.
                      6 min 46 sec in V6
                      7 min 39 sec in V5

                      So I don't think going back to V5 will help.

                      Version 5
                      ========
                      16:52:20 - INDEX SUMMARY
                      16:52:20 - Files indexed: 101
                      16:52:20 - Files skipped: 6958
                      16:52:20 - Files filtered: 0
                      16:52:20 - Files downloaded: 112
                      16:52:20 - Unique words found: 7930
                      16:52:20 - Total words found: 74965
                      16:52:20 - Avg. unique words per page: 78.51
                      16:52:20 - Avg. words per page: 742
                      16:52:20 - Start index time: 16:51:38 (2011/11/03)
                      16:52:20 - Elapsed index time: 00:00:42
                      16:52:20 - Errors: 0
                      16:52:20 - URLs visited by spider: 130
                      16:52:20 - URLs in spider queue: 1830
                      16:52:20 - Total bytes scanned/downloaded: 11640338
                      16:52:20 - File extensions:
                      16:52:20 - .htm indexed: 0
                      16:52:20 - .html indexed: 0
                      16:52:20 - .txt indexed: 0
                      16:52:20 - .php indexed: 53
                      16:52:20 - .asp indexed: 0
                      16:52:20 - .cgi indexed: 0
                      16:52:20 - .aspx indexed: 0
                      16:52:20 - .pl indexed: 0
                      16:52:20 - .php3 indexed: 0
                      16:52:20 - .shtml indexed: 25
                      16:52:20 - No extensions indexed: 23

                      Version 6
                      ========

                      16:53:21 - INDEX SUMMARY
                      16:53:21 - Files indexed: 100
                      16:53:21 - Files skipped: 653
                      16:53:21 - Files filtered: 0
                      16:53:21 - Files downloaded: 107
                      16:53:21 - Unique words found: 4366
                      16:53:21 - Variant words found: 5043
                      16:53:21 - Total words found: 68627
                      16:53:21 - Avg. unique words per page: 43.66
                      16:53:21 - Avg. words per page: 686
                      16:53:21 - Start index time: 16:52:57 (2011/11/03)
                      16:53:21 - Elapsed index time: 00:00:24
                      16:53:21 - Peak physical memory used: 54 MB
                      16:53:21 - Peak virtual memory used: 154 MB
                      16:53:21 - Errors: 1
                      16:53:21 - URLs visited by spider: 121
                      16:53:21 - URLs in spider queue: 1754
                      16:53:21 - Total bytes scanned/downloaded: 11503361
                      16:53:21 - File extensions:
                      16:53:21 - .htm indexed: 0
                      16:53:21 - .html indexed: 0
                      16:53:21 - .txt indexed: 0
                      16:53:21 - .php indexed: 53
                      16:53:21 - .asp indexed: 0
                      16:53:21 - .cgi indexed: 0
                      16:53:21 - .aspx indexed: 0
                      16:53:21 - .pl indexed: 0
                      16:53:21 - .php3 indexed: 0
                      16:53:21 - .shtml indexed: 23
                      16:53:21 - No extensions indexed: 24

                      Comment


                      • #12
                        Doh I realised the reason zoom wasn't indexing the site on my local computer was because it was the free version which is limited to indexing only 100kb pages, and our site's pages are over 100kb.

                        Anyway, I am thinking that the only option is to run some sort of incremental indexing like you suggested.

                        In zoom, I start the index process, and limit it to 50 files. Then if I go in the menu - Index > Incremental Indexing > update existing index it says 'you can not perform an incremental update when the maximum number of pages has already been reached', so I have to manually increase the limit to 60 files and then it indexes another 10 files.

                        Question 1 - Is there a way to tell the incremental indexer to update 'X' number of pages, so for example each time I run it, it will add 10 files to the index?

                        You said
                        If you have anyone who can write some code, you might want to look more deeply into the assisted incremental indexing option. The vBulletin forum software already has a function to dump our new posts in a date range. So you could get all the URLs from this function and then feed them into the indexer each day.

                        I have 5 different systems like vbulletin (a directory, event calendar, reviews and articles) which get updated by users regularly. I could create a file which lists all of the urls that have been updated or created. Then run an 'incremental index' on the command line, using the '-addpages' option eg

                        HTML Code:
                        ZoomIndexer.exe -s zoom.zcfg -addpages newpages.txt
                        Question 2 - If I created the 'newpages.txt' file on my web server, is it possible for the zoom command to access it, like ZoomIndexer.exe -s zoom.zcfg -addpages http://www.xxx.com.au/newpages.txt

                        Question 3 - The latest version of Zoom is 6 build 1027. I am running build 1025. Is it worth installing the newer build?

                        Comment


                        • #13
                          I had a closer look at your site's home page. You are right, it is huge.

                          You have,
                          342KB of Javascript.
                          133KB+ of HTML
                          etc...

                          So on one page you have 169 objects which result in around 1020KB being transferred, just for your home page. (Would have been 1300KB but there was some caching going on).

                          I can't help but think this could be reduced a bit.

                          The HTML syntax is also a bit wonky on the page. You should run a few of your pages through the HTML validator and fix up the worst of the problems.
                          http://validator.w3.org/

                          In answer to your questions,
                          Q1. No easy way. It will index all new pages up until the max page limit set.

                          Q2. The -addpages switch needs to be given a local file on the local hard disk (not a remote file). So you would need to wget from the server first in your script.

                          Q3. I don't think any of the bug fixes in the newer release will be noticed. See,
                          http://www.wrensoft.com/zoom/whatsnew.html

                          Comment


                          • #14
                            I know I know, our homepage is a bloated sack of digital protoplasm. Zoom only downloads the html files, not the javascript or images right? Without all that stuff, the html is about 150kb. Most of our pages are about this big.

                            I will get the assisted incremental indexing going with the -addpages switch, and report back. This will take a couple days probably.

                            Comment


                            • #15
                              Yes, Zoom will not download the JS files nor the images, based on the configuration you have setup.

                              Comment

                              Working...
                              X