PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Spider Mode download file sizes vs real sizes

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spider Mode download file sizes vs real sizes

    Using spider mode, the files sizes appear in bytes against some, but not all files on downloading. These sizes do not match the actual size of the files as reported online using an FTP program or locally using Windows or DOS programs. Why are they so different and not shown on every file downloaded? Is it because the bytes are the actual text rather than the whole file?

    The fact that these sizes do not match makes me nervous of the results. Please can you clarify as soon as possible?

  • #2
    If can be dependant on the server & file type (and caching I believe) as to whether we know the size of the file in advance. For example, we pretty much never know how large a dynamically generated PHP page will be.

    Can you give a specific example of the type of file and the reported size and the actual size.

    Also are you really using the old V4 software (or did you post in this section of the forum by accident).

    Comment


    • #3
      An example, one of many, is file type .shtml which contains one ssi call to a text file of 11 bytes, the .shtml file is 14.4kb, the properties in IE report the rendered file as 14732 bytes where as the download in Spider Mode reports the size as 1024 bytes. Similar effects with .htm pages.

      Of 18 .asp pages 10 file sizes match and 8 have variations in size:

      Zoom size = (6957 bytes) IE property size = 6321 bytes
      Zoom size = (3707 bytes) IE property size = 3511 bytes
      Zoom size = (4747 bytes) IE property size = 4541 bytes
      Zoom size = (4227 bytes) IE property size = 4026 bytes
      Zoom size = (7015 bytes) IE property size = 6350 bytes
      Zoom size = (3711 bytes) IE property size = 3513 bytes
      Zoom size = (8456 bytes) IE property size = 8308 bytes
      Zoom size = (4333 bytes) IE property size = 4079 bytes

      All these files obtain information from a database dynamically, including those that have a size match.

      We are using version 4.2 build 1013.
      Last edited by webby; Aug-30-2007, 12:32 AM.

      Comment


      • #4
        There is probalby (at least) two issues at play.

        1) Our sizes include the HTTP header (which is never displayed and not part of the page content, but most be downloaded).

        2) There might have been a display issue in V4, that it didn't print out the final size, as files are downloaded in chunks, maybe the display never got updated with the final chunk.

        These are just guesses. It is hard to be sure without knowing the URL you are indexing and seeing the problem ourselves.

        Comment


        • #5
          Reference your reply to my last posting, point 1 does not explain why some do match, with headers included, whereas others do not, refer to the first example zoom size 1024 bytes vs rendered 14732 bytes! A zoom shortfall of 13708 bytes, where there is only the possible reduction of a http header and the ssi call of an 11 bytes text file?

          Agreed the .asp page sizes listed all have an increase in Zoom against local versions, however, 18 files were compared and only 8 had a size mismatch. This means that either the headers in 10 are being ignored or that something else is being added to 8 files that we do not know about.

          Point 2 indicates that not all of the download is being recorded, so in a number of cases it looks like only one or two chunks are being either recorded or downloaded, which makes me wonder how much information may not be downloaded for scannning and indexing? As the example above shows, only 1024 bytes of a possible 14732 bytes was shown as being downloaded, and this by your explanation in point 1, may be missing the header. Also as I mentioned a number of files do not have any download values shown against them, does this mean that they have not been downloaded properly at all? Although there appears to be a queued and scanned process for each.

          I am happy to supply the URLs we are indexing, but as these are on our ISPs secure server am relunctant to give any other information in an open forum. I have tried to contact you under the private message facility, but am blocked. The URLs we are scanning, and have provided examples from are :

          www.ced.co.uk
          www2.ced.co.uk

          There are a number of exclusions in place as we have a number of languages on this site, each having its' own search page section.

          Comment


          • #6
            I looked at dozens of pages on the 1st site. But failed to find even a single ASP page.

            As you must already know, there was no point you posting the link to the second www2 site, as access is restricted. I can can't even see the home page.

            You missed the point about point 2). I was speculating that the final size might not have been displayed, the entire files are being downloaded in all cases.

            If you want to post actual URLs I am happy to look at the issue a bit more. If you can't post the details, then I suggest trying the new V5 Zoom software instead.

            Comment


            • #7
              OK, so you are saying in point 2 that the files are downloading entirely, but ZoomIndexer is not reporting the full file size. How can I then be sure that the files have downloaded fully in preparation for the index scanning of all available information to produce a reliable index for searching?

              You don't explain why some files do not have any download size values, again how do I know these have downloaded fully?

              Our main site is hosted on a unix server, however, all our .asp pages are held on a windows server with a secondary url of www2.ced.co.uk and we target file cat01u.asp as the first file. Apologies for not making it more clear that the www2 is restricted. The paragraph above the urls in that posting indicated this, but not as clearly as it should have. Obviously this makes it more difficult, however, the .asp files are definitely being queued, downloaded and scanned, albeit with questionable file sizes. The files are linked via our http://www.ced.co.uk site using a series of framesets that consist of a top menu for switching between data types and side menus within each section providing links to the main documents. The various .asp pages are linked to through a number of these framesets, the first set is:

              http://www.ced.co.uk/pl/2pl01u.shtml
              http://www.ced.co.uk/pl/2pl02u.shtml
              http://www.ced.co.uk/pl/2pl03u.shtml
              http://www.ced.co.uk/pl/2pl04u.shtml
              http://www.ced.co.uk/pl/2pl05u.shtml
              http://www.ced.co.uk/pl/2pl06u.shtml
              http://www.ced.co.uk/pl/2pl08u.shtml
              http://www.ced.co.uk/pl/2pl10u.shtml
              http://www.ced.co.uk/pl/2pl12u.shtml

              Mirroring this is a second set, in another currency

              http://www.ced.co.uk/usa/2pl01y.shtml
              http://www.ced.co.uk/usa/2pl02y.shtml
              http://www.ced.co.uk/usa/2pl03y.shtml
              http://www.ced.co.uk/usa/2pl04y.shtml
              http://www.ced.co.uk/usa/2pl05y.shtml
              http://www.ced.co.uk/usa/2pl06y.shtml
              http://www.ced.co.uk/usa/2pl08y.shtml
              http://www.ced.co.uk/usa/2pl10y.shtml
              http://www.ced.co.uk/usa/2pl12y.shtml


              I am currently cross checking data between the ZoomIndexer log file and our online data, tracking links. When I have done this I may then be able to feed back further information to help resolve this query.
              Last edited by webby; Sep-03-2007, 11:53 PM.

              Comment


              • #8
                Originally posted by webby View Post
                Reference your reply to my last posting, point 1 does not explain why some do match, with headers included, whereas others do not, refer to the first example zoom size 1024 bytes vs rendered 14732 bytes! A zoom shortfall of 13708 bytes, where there is only the possible reduction of a http header and the ssi call of an 11 bytes text file?
                It is possible that this URL actually failed to return the page you expected. It might be your custom 404 page (not reporting itself as 404, which is common) for example. We can't check if this is the case without seeing the web pages in question.

                Another possibility is the bug with the displaying of filesizes in this old version of Zoom you are using. Note that the bug only relates to the filesize displayed. The file is always entirely downloaded or completely skipped.

                Originally posted by webby View Post
                Agreed the .asp page sizes listed all have an increase in Zoom against local versions, ...
                You can not compare the local filesize of an ASP page to the actual size of the page that is downloaded, spidered, or indexed. They are server-side scripts, so looking at the local copy of the file means you are merely looking at the filesize of the script source code. The actual size of the page that is sent to the client by the server can vary depending on the content generated.

                Originally posted by webby View Post
                OK, so you are saying in point 2 that the files are downloading entirely, but ZoomIndexer is not reporting the full file size.


                Yes. The files are always entirely downloaded. There was a known bug in the older version of Zoom you are using (V4.2) where the full size that has been downloaded was not being reported on the screen, in the Index Log.

                Originally posted by webby View Post
                How can I then be sure that the files have downloaded fully in preparation for the index scanning of all available information to produce a reliable index for searching?


                Zoom was designed to always download the entire file, and there is no known issue with this. Unless you have some specific reasons to believe Zoom is not downloading or indexing the entire file, then there is no need to suspect this.

                There are other reasons why the content in certain pages may not be completely indexed. For example, if you have misplaced ZOOMSTOP tags. Or if you have specified a "Limit words per file" setting. If you can point to a more specific symptom that is leading you to worry about your pages not being entirely indexed, we may be able to address the real cause of this.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Additional problems

                  OK

                  I have done a thorough check through the files online and those that appear to be indexed using spider mode. Apart from the file sizes not matching, which you have explained as being a bug, there are quite a number not being indexed at all.

                  I have done the following checks:

                  ZOOMSTOP/ZOOMRESTARTS are correctly implemented in all files containing this requirement and this gave the reason for some not to be indexed as all the content was stopped. However, some files that had content outside these tags were indexed whereas others were not.

                  I also accounted for some of the files not being indexed due to ssi links rather than a href links, which is OK as they are not likely to be indexed.

                  However, there are some important files that are not being indexed, although they have proper a href links from indexed files to them. Some of these links have a page target suffix on them, whereas others do not. For example

                  http://www.ced.co.uk/pow1401u.htm has links to
                  powd1u.htm#(variable)
                  powd2u.htm#(variable)
                  powd2u.htm#(variable)
                  powd2u.htm#(variable)
                  synchu.htm (no page target)

                  I have checked a number of keywords within search after running indexer, none of the above listed linked files appear. Also when checking the log for these file names, they do not appear as download, queuing, scanning or skipping!

                  Some of the missing files are listed on our sitemap with the following a href tags:-

                  <a href="pru.shtml?why324u.htm?id=5" target=_top>Why upgrade Spike2 from version 3 to 4?</a>

                  <a href="#" title="New window for the language support download" OnClick='window.open("uplanu.shtml","mywindow","wi dth=500,height=137,innerwidth=500,innerheight=137, left=150,top=300,screenX=150,screenY=300,scrollbar s=yes,resizable=yes");return false'>Language support</a> (this file contains ZOOMSTOP/ZOOMRESTART example)

                  Can you help to explain these problems?

                  Comment


                  • #10
                    Originally posted by webby View Post
                    I have done a thorough check through the files online and those that appear to be indexed using spider mode. Apart from the file sizes not matching, which you have explained as being a bug, there are quite a number not being indexed at all.
                    To clarify, the bug in V4.2 (and since fixed in all later versions) only affects the file sizes displayed in the Index Log (the "Downloading ... " messages in dark green). This was always served to indicate the progress of the files being downloaded by the spider. These messages do not serve to indicate the data being indexed.

                    The other issues you are now describing are more typical usage questions regarding spidering and the following of links. These are mostly covered in the FAQ in questions such as:
                    Q. Why are some of my pages being skipped by the indexer?
                    Q. Why are links in my Javascript menus being skipped?
                    Q. I am indexing with spider mode but it is not finding all the pages on my web site

                    But I will try to answer your specific questions below, hopefully with more detail.

                    Originally posted by webby View Post
                    ZOOMSTOP/ZOOMRESTARTS are correctly implemented in all files containing this requirement and this gave the reason for some not to be indexed as all the content was stopped. However, some files that had content outside these tags were indexed whereas others were not.
                    ZOOMSTOP and ZOOMRESTART tags should only be used to exclude a portion of a page from being indexed. It does not serve to exclude an entire page from being indexed (instead, you should be skipping the URL all together by adding the filename or path to the "Page Skip List").

                    If you have placed a ZOOMSTOP and ZOOMRESTART tag which surrounds the entire page, you will be excluding the content within these tags only. The file itself would still be downloaded and "indexed", including any content outside these tags, such as titles, meta description or filename (if these are enabled on the "Indexing Options" tab).

                    For the files that you found where the content outside the tags were not indexed - it is possible that (a) the file was not indexed at all due to the link not being found, (b) the content outside the tags were not considered text content, e.g. it was Javascript code. If you do not think this is the case, can you give us a URL to the page in question (or send us a copy if it is not available), and we can determine if there is a problem.

                    Originally posted by webby View Post
                    I also accounted for some of the files not being indexed due to ssi links rather than a href links, which is OK as they are not likely to be indexed.
                    This comment sounds like you have a misconception of sorts. Spider mode indexing means that the indexer will only see what a web client would see (eg. a browser) when requesting a page from a web server.

                    A SSI (Server Side Include) is not visible to the client as individual files. They become part of the file that they are included into. A SSI link is never visible to a client, and should not be considered a separate file in any context outside of the development and maintenance of the site.

                    Originally posted by webby View Post
                    However, there are some important files that are not being indexed, although they have proper a href links from indexed files to them. Some of these links have a page target suffix on them, whereas others do not. For example

                    http://www.ced.co.uk/pow1401u.htm has links to
                    powd1u.htm#(variable)
                    powd2u.htm#(variable)
                    powd2u.htm#(variable)
                    powd2u.htm#(variable)
                    synchu.htm (no page target)
                    Unfortunately that URL returned a 404 File Not Found, when I tried to look at it just then, so I'm unable to see what you mean exactly.

                    But I will add that Zoom can follow links with anchor parameters (in the style of "powd1u.htm#1234") without problems, and and the target= parameter is also not an issue. So these are unlikely to be the reason as to why these links are not being crawled.

                    Originally posted by webby View Post
                    I have checked a number of keywords within search after running indexer, none of the above listed linked files appear. Also when checking the log for these file names, they do not appear as download, queuing, scanning or skipping!

                    Some of the missing files are listed on our sitemap with the following a href tags:-

                    <a href="pru.shtml?why324u.htm?id=5" target=_top>Why upgrade Spike2 from version 3 to 4?</a>
                    Zoom should have no trouble following this link. However, when I went to look at the page that this link points to (http://www.ced.co.uk/pru.shtml?why324u.htm?id=5), the problem quickly became evident.

                    This page does not actually contain the content that you would see when you load it in a browser. Instead, it contains Javascript which redirects the page content as necessary, and loads a different file (why324u.htm) in the main frame.

                    There are a couple of issues here. First of all, your site clearly depends heavily on Javascript. In general, Javascript is NOT spider friendly. There are many online resources that will explain this further (try a Google search for "javascript spider friendly"), but if you think about how Javascript works, this will become more obvious and evident. Javascript is client-side scripting that requires execution and often dependent on user interaction (eg. you may need to hover a mouse cursor over a button for several seconds before a link appears). There is no way for an automated client such as a web spider, to run through all the possible scenarios of such scripts, so in general, spiders do not execute Javascript. This is also explained in our FAQ regarding Javascript links.

                    In general, using a little bit of Javascript for unessential features of the site is OK. But when navigation around your site depends entirely on Javascript, then that is a problem. It means that spiders for search engines will not be able to navigate around, and also many users with Javascript disabled browsers (commonly found at Internet Kiosks and more limited computers) would be unable to use your site.

                    There are many ways to improve this however. You can make better use of the <noscript> tag to specify HTML that is only displayed when a Javascript disabled client accesses the page. Here, you can specify the text links to navigate to the actual content page, and spiders will be able to follow them. Or change your site so that it does not depend on Javascript for navigation, but merely offers it as one method of getting around the site. This would be the recommended solution for improving the accessibility of your site and online presence (ie. rankings and presence on Internet-wide search engines such as Google).

                    In regards to indexing with Zoom, there is one additional alternative. You can use Offline Mode instead, to index a local copy of the pages on your hard disk. This eliminates the need for finding links to crawl, and Zoom will simply scan all files in a certain folder or subfolder which satisfy the indexing conditions. You can specify a list of files which it would not need to index, and only index the main content that is to be loaded within your frames and Javascript. This would be the best method to index what you currently have without changing your site, but note that it will not be able to index dynamically generated webpages.

                    There is more information on Offline Mode indexing in the Users Guide.

                    Originally posted by webby View Post
                    <a href="#" title="New window for the language support download" OnClick='window.open("uplanu.shtml","mywindow","wi dth=500,height=137,innerwidth=500,innerheight=137, left=150,top=300,screenX=150,screenY=300,scrollbar s=yes,resizable=yes");return false'>Language support</a> (this file contains ZOOMSTOP/ZOOMRESTART example)
                    Same thing applies here. This HREF link points to nowhere ("#" is just the current page). It is entirely dependent on Javascript (specified as part of the onClick attribute) to open the link. Note that while it may seem obvious from a human point of view, it is not so when parsing this automatically. The filename may not be the full path to the file, and the function called could add a path or change the filename completely. There is really no way to determine what the URL is in such links, and this is why they are not spider friendly.

                    I hope that helps explain the problem here. I have tried to be as clear as possible.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment

                    Working...
                    X