PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Missing pages

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing pages

    Hi-

    I'm a new user to the product. My site is still under construction, but is active at

    http://www.testsnh.com

    the search is under http://www.testsnh.com/search.html

    The search is missing all the pages that can be found under the Newsletter link.

    I am using the free version, but it is only searching 31 pages, so I haven't reached the limit yet.

    Any ideas?

  • #2
    Take a look at the following FAQs:
    Q. Why are some of my pages being skipped by the indexer?
    Q. Why are links in my Javascript menus being skipped?
    Q. I am indexing with spider mode but it is not finding all the pages on my web site
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      I'm also having a problem indexing newly-added pages.

      The first time this occurred I immediately realized that I had not linked to the added pages yet from any existing page. As we say here in the US, "duh!"

      So yesterday I added a link to one of my main pages to a page that has all the newly added links. Today I did a full reindex without changing anything in the main config file.

      The files appear in the indexlog.txt file. But none of the new pages are found when I search from my search page using the title or main keywords (in this case, it's country names like Peru where the title of the page, content, filename, etc. have the word Peru in them). Those items (like title) are checked under Indexing options.

      I have read the pertinent FAQs quoted above and checked each item discussed, and also double-checked my config for ZOOM 5.

      I'm sure it's something minor but I can't figure out why it's not working.

      Comment


      • #4
        It might be a caching issue. Zoom might be indexing an old copy of the pages (an old copy, without the new link you just added).

        You can turn off caching from the "General" tab of the Zoom configuration window.

        But if the files appeared in your index log file, then that would indicate that the files were found and indexed (unless there is an error message, or a skipped message, also in the log). Can you post that section of the log.

        Or maybe there is some formating issue on the new pages and they aren't valid HTML and so not all of the content was indexed.

        Or maybe all the new data is in the set of index files, but you didn't move the new files to your server?

        Comment


        • #5
          I spent some more time looking at the indexlog.txt file. I think I found the problem and maybe you can tell me an easy way to fix it.

          I have links in my site like this:

          http://www.mysite.com/directory/directory/

          ZOOM says it can't find those files. It doesn't see the index.html file in there.

          If there is a link like this:

          http://www.mysite.com/directory/directory/index.html

          It finds the index.html file.

          Please don't tell me I need to go through and change every directory link to add the index.html... : (
          Last edited by Fork; Feb-09-2007, 09:43 PM.

          Comment


          • #6
            There is a check box on the "scan options" tab, called, "Scan files with no extension". This check box should be ON by default. Did you turn it off?

            Comment


            • #7
              Indeed, somehow I must have clicked off that scan option without fully understanding its purpose.

              I turned it back on and turned off the cache and multithreading, then I ran the indexing again.

              It now states that it's indexing files it missed before but when I search for keywords and files I know it indexed, I'm not finding them. I ftp'd the files directly from ZOOM and the logs indicate everything went well.

              Here is an excerpt from indexlog.txt that actually shows the indexing of some pages taking place but searching for any of the main keywords for those pages does not return the page.

              07:58:16 - Downloading file http://www.globalgourmet.com/destina...carbonada.html
              07:58:16 - Indexing http://www.globalgourmet.com/destina...st/hummos.html
              07:58:17 - Downloading file http://www.globalgourmet.com/destina.../empanada.html
              07:58:17 - Indexing http://www.globalgourmet.com/destina...carbonada.html
              07:58:17 - Downloading file http://www.globalgourmet.com/food/sp...c/falafel.html
              07:58:17 - Indexing http://www.globalgourmet.com/destina.../empanada.html

              So the file at

              http://www.globalgourmet.com/destina...carbonada.html

              has the word Uruguay in it several times but a search for Uruguay turns up 0 items.

              http://www.globalgourmet.com/destina.../empanada.html

              has the word Venezuela in it several times and a search for Venezuela turns up 12 items but NOT this file Zoom reports that it indexed.


              Here are the header and footer for the indexlog.txt file. It was run only on html files but with debug on:

              08:24:08 - Config file loaded: C:\PROGRAM FILES\ZOOM SEARCH ENGINE 5.0\zoom.zcfg
              08:41:12 - Start indexing (spider mode)
              08:41:12 - Maximum number of words: 200000
              08:41:12 - Maximum number of files: 20000
              08:41:12 - Will scan files with extensions
              08:41:12 - .html
              08:41:12 - Spider from: http://www.globalgourmet.com/index.html
              08:41:12 - Web site URL: http://www.globalgourmet.com/
              08:41:12 - Estimated RAM required during index process: 222425 KB
              08:41:13 - Initiating HTTP session (thread #1) ...
              08:41:13 - Downloading file http://www.globalgourmet.com/index.html
              08:41:13 - Initiating HTTP session (thread #2) ...
              08:41:14 - Downloading file http://www.globalgourmet.com/food/co...ins/index.html

              [main text edited out]

              12:17:20 - Indexing http://www.globalgourmet.com/food/fo...99/fff003.html
              12:17:20 - Writing index data for PHP search... (Please wait)
              12:17:20 - Created pagedata data file (zoom_pagedata.zdat)
              12:17:20 - Created pagetext data file (zoom_pagetext.zdat)
              12:17:20 - Created pageinfo data file (zoom_pageinfo.zdat)
              12:17:46 - Created spelling data file (zoom_spelling.zdat)
              12:17:46 - Deleting presaved index data...
              12:17:46 - Deleting pageinfo data...
              12:17:46 - Deleting miscellaneous buffers...
              12:17:46 - Deleting URL history...
              12:17:46 - Writing out the dictionary...
              12:17:48 - Created dictionary data file (zoom_dictionary.zdat)
              12:17:48 - Created wordmap data file (zoom_wordmap.zdat)
              12:17:48 - Created script settings file (settings.php)
              12:17:48 - Indexing completed
              12:17:48 - INDEX SUMMARY
              12:17:48 - Files indexed: 7833
              12:17:48 - Files skipped: 150860
              12:17:48 - Files filtered: 0
              12:17:48 - Files downloaded: 7833
              12:17:48 - Unique words found: 84470
              12:17:48 - Total words found: 4451321
              12:17:48 - Avg. unique words per page: 10
              12:17:48 - Avg. words per page: 568
              12:17:48 - Start index time: 11:27:54 (2007/01/10)
              12:17:48 - Elapsed index time: 00:49:54
              12:17:48 - Errors: 178
              12:17:49 - URLs visited by spider: 8023
              12:17:49 - URLs in spider queue: 0
              12:17:49 - Total bytes scanned/downloaded: 86589218
              12:17:49 - File extensions:
              12:17:49 - .html indexed: 7644
              12:17:49 - No extensions indexed: 189
              12:17:49 - Cleaning up memory used for index data... please wait.
              12:17:49 - Deleting wordmap data...
              12:17:49 - Deleting presaved index data...
              12:17:49 - Deleting pageinfo data...
              12:17:49 - Deleting miscellaneous buffers...
              12:17:49 - Deleting URL history...
              12:17:49 - Finished cleaning up memory.
              12:21:35 - Connecting to FTP server at www.globalgourmet.com on port 21 ...
              12:21:35 - Using PASV mode ...
              12:21:35 - Opening remote folder "public_html/ggsearch/logs" ...
              12:21:35 - Queuing files for upload ...
              12:21:35 - Total items to be queued: 9
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\search.php
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\settings.php
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\search_template.html
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\zoom_dictionary.zdat
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\zoom_wordmap.zdat
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\zoom_pagetext.zdat
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\zoom_pagedata.zdat
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\zoom_pageinfo.zdat
              12:21:35 - Queueing for upload: C:\My Documents\ggzoom\zoom_spelling.zdat
              12:21:36 - Created handle
              12:21:37 - C:\My Documents\ggzoom\search.php uploaded
              12:21:38 - Created handle
              12:21:38 - C:\My Documents\ggzoom\settings.php uploaded
              12:21:38 - Created handle
              12:21:38 - C:\My Documents\ggzoom\search_template.html uploaded
              12:21:39 - Created handle
              12:22:10 - C:\My Documents\ggzoom\zoom_dictionary.zdat uploaded
              12:22:11 - Created handle
              12:27:59 - C:\My Documents\ggzoom\zoom_wordmap.zdat uploaded
              12:27:59 - Created handle
              12:33:41 - C:\My Documents\ggzoom\zoom_pagetext.zdat uploaded
              12:33:42 - Created handle
              12:34:21 - C:\My Documents\ggzoom\zoom_pagedata.zdat uploaded
              12:34:21 - Created handle
              12:34:24 - C:\My Documents\ggzoom\zoom_pageinfo.zdat uploaded
              12:34:25 - Created handle
              12:34:29 - C:\My Documents\ggzoom\zoom_spelling.zdat uploaded
              12:34:30 - Files successfully uploaded

              Comment


              • #8
                I can see the problem from the logs you posted. You have got confused and are uploading the files to the wrong folder on your server.

                From this log line,
                12:21:35 - Opening remote folder "public_html/ggsearch/logs" ...

                I can see you are uploading all the search files to a directory called '/logs'.

                So when you do a search, the search is working with an old set of index files.

                Comment


                • #9
                  Double "duh".

                  Not sure how that happened but once it did, I kept on doing it.

                  I always figured it was something simple (or stupid).

                  Thanks for the patience and great support. All seems to be working now.

                  Comment


                  • #10
                    I have been going through the errors in the indexlog.txt file discussed above and started to correct the links that are reported by the log as not working but the first few don't make any sense:

                    09:04:23 - Downloading file http://www.globalgourmet.com/food/recarch/rec0196.html
                    09:04:23 - Could not download file: http://www.globalgourmet.com/food/recarch/rec0196.html (File not found)

                    But if I copy and paste the URL from the log file into a browser, the file is indeed there. It's a basic html file, nothing odd about it -- it's quite old (1996) but it parses correctly in Firefox.

                    I tested eight similar files in a row like this, all with the same result, so I stopped testing any more.

                    When I search for a term that I know appears on that page, the file shows up as #6 on the list:

                    6. Recipe Index
                    ... Rib (eGGsalad) Carving a Turkey (eGGsalad) Celery Remoulade (The Artist's Table) Chicken Salad with ... and Avocado (The Great Hot Sauce Book) Chilli Salt Squid (A Taste Of Australia) Chipotle Red ...
                    Terms matched: 2 - Score: 56 - URL: http://www.globalgourmet.com/food/recarch/rec0196.html


                    What might cause that file to show up as an error in the indexlog.txt file?

                    Comment


                    • #11
                      I index the first 1000 page from your web site from here. I only got 3 errors. All of them "file not found" errors. You normally get this error when you have a broken links, and the server returns a 404 error.

                      Thre 3 errors that I saw were all correct. They were all broken links on your site, like these.
                      http://www.globalgourmet.com/food/sp...ras/index.html
                      http://www.globalgourmet.com/food/sl...100/index.html

                      How far into the indexing process was it before you saw your errors?

                      There are other rarer issues which can result in a 404 error. e.g. A problem on the server, an internet connection problem, or some kind of load throttling on your server.

                      If you do get a File not found error it is impossible that this file can be in your index. I think you must be looking at a older set of index files that was generated when this error did not occur.

                      Comment


                      • #12
                        .... or you might be looking at older "indexlog.txt" entries from a session where this problem DID occur - but is no longer a problem.

                        Remember that your index log file is automatically appended to, so it will contain log entries from various sessions if you do not delete/clear out the file yourself. Make sure you look from the bottom up if you wish to find the latest session - or always delete the file prior to indexing a new session if you only want entries from the latest session.

                        We have plans to add an option to automatically clear out the log file per session in a future version. We will also add a date stamp (as opposed to just a time stamp) to the log file so that different sessions in the log file will stand out more clearly.
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          I selected the menu option "Clear Index Log" but that didn't appear to work as the same files that show errors (even though they are no longer errors) are in there, and indeed, near the top of the log.

                          I assumed using the menu option was the easy way to do it but I'll manually delete the file on my next try.

                          I did not realize the sessions were appended so I'm sure that's the reason I'm seeing those old errors.

                          Is there a reason why the default is to append the sessions?

                          Comment


                          • #14
                            I manually deleted the indexlog.txt file, reindexed and everything seems back to normal. The remaining link errors appear to be real errors -- now I can get back to fixing them.

                            Thanks a lot.
                            Last edited by Fork; Feb-12-2007, 09:30 PM. Reason: sp

                            Comment


                            • #15
                              Originally posted by Fork View Post
                              I selected the menu option "Clear Index Log" but that didn't appear to work as the same files that show errors (even though they are no longer errors) are in there, and indeed, near the top of the log.
                              No, this only clears the index log which is displayed in the GUI window. This does not affect the saved log file.

                              Originally posted by Fork View Post
                              Is there a reason why the default is to append the sessions?
                              Largely because that's how most error log files work in general - for Apache, IIS, Windows, etc. It also means that you can look up previous index sessions and not only the latest one, without resorting to multiple files.

                              Glad you've got it worked out now.
                              --Ray
                              Wrensoft Web Software
                              Sydney, Australia
                              Zoom Search Engine

                              Comment

                              Working...
                              X