PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Content not being included

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Content not being included

    I'm having trouble getting page content indexed. The indexing log indicates the pages are being indexed, but many of the words contained on the pages are not found in a search. What am I missing in the configuration or operation?
    I am assuming the pages are indexed because they show up in the log. Using the spider mode, the indexed pages show up in the log as:
    1> Queued URL: (page URL)
    2> DL Thread #1, got URL (page URL) off queue
    3> Downloading file (page URL)
    4> Index Thread got ready buffer for (page URL)
    5> Spidering for links on (page URL)
    6> Indexing (page URL)
    The missing words do not appear in the 'zoom_dictionary.zdat' file. Any ideas or help will be apprciated....

  • #2
    If you see the "indexing" message, then yes, the page would have been indexed.

    Content might be missed if you have
    1) Invalid HTML on the page
    2) Content that is not in HTML but is generated on the fly, client side, by a script of some sort (e.g. Javascript)
    3) You have excluded text from being indexed using ZOOMSTOP tags.
    4) You have configured Zoom not to index page content (on the indexing options tab).
    5) You have tricky server side browser sniffing which returns different content for different browsers
    6) You have some authentication scheme running on your server and if you are not logged in, the content if every page says something like, "Please login to view this page". So the real page content is not visible.

    If you still have a problem can you post the URL for the page in question and some examples of words that you think should have been found on the page.

    Comment


    • #3
      Content not being included

      Thanks for the quick response! The subject pages do normally require authentication using htaccess/htpasswd. I disabled the htaccess so you can take a look. An example page is:
      http://www.feuling.org/family/valentin/phg01.htm
      and an example word on that page is: ProGenealogists

      I have authentication setup in the Zoom Search configuration and it appears to be working properly. The results of the indexing process appear to be the same with the directory authentication enabled and with it disabled.

      Comment


      • #4
        The HTML on the phg01.htm page is invalid. This is an extract from the HTML on your page.
        <html>
        <head>
        </head>
        <body background="images/background.jpg">
        <html>
        <head>
        <title>The Feuling Family Genealogy</title>
        </head>
        <body background="background.jpg">
        </body>
        </html>
        <html>
        <head>
        <title>Genealogy page footer</title>
        </head>
        <body background="family/background.jpg">
        </body>
        </html>
        </html>

        3 body tags, 2 title tags and 7 HTML tags. You also have large blocks of NULL characters (0x00 in hex) in the document, which will cause problems. In short it is a bit of a mess.

        The W3 HTML vaidator reports 277 errors on the page in question. See,
        http://validator.w3.org/check?uri=ht...Inline&group=0

        Comment


        • #5
          The invalid pages have existed for years, so I never thought to look
          closely at them (they are generated by a genealogy program). Thanks again
          for your quick help and getting me pointed in the right direction! I've
          figured out how to correct the HTML mess and now Zoom Search is working
          perfectly. Zoom Search is a great product with Awesome support!

          Comment

          Working...
          X