PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Suspected invalid html on page with version 6.0.1010

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by cardiogr View Post
    In version 6.01010 the pages with errors were skipped. This is why I am asking if in this version pages are also skipped.
    I can identify the html errors (it now gives more information in the warning description). It is very painful however to fix 2000 pages (even with regular expressions), so at this time I had to downgrade to 6.01010.
    Are you sure the content of your pages with HTML errors were actually significantly less indexed than in build 6.0.1010? Note that while build 6.0.1011 may report more warning messages (because it does a better job of finding and reporting them), it also handles the broken HTML better, such that it will index more content than the previous build. So while there may be more warnings, it may have actually indexed more content. A good way to check is a comparison of the "number of unique words indexed" at the end of indexing, between the two builds.

    We've added more functionality to tolerate certain bad HTML scenarios in the next build (6.0.1012), but that may be a few weeks away still. We're trying to find that balance between providing the most proper support for valid HTML (which is important), while not penalizing common broken/invalid HTML too harshly.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #17
      Originally posted by firstrebel View Post
      You are right Ray. Has this changed since 5.1 as I am sure they were setup correctly in that version.
      You probably used a V6 build prior to 6.0.1009 which had this bug:


      • Fixed bug with importing V5 config files where file extensions with thumbnail settings are imported as file type "HTML page".
      And it probably carried itself over when the config file was saved with the old build. It has been fixed in build 6.0.1009 and after. More info here.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #18
        Originally posted by Ray View Post
        Are you sure the content of your pages with HTML errors were actually significantly less indexed than in build 6.0.1010? Note that while build 6.0.1011 may report more warning messages (because it does a better job of finding and reporting them), it also handles the broken HTML better, such that it will index more content than the previous build. So while there may be more warnings, it may have actually indexed more content. A good way to check is a comparison of the "number of unique words indexed" at the end of indexing, between the two builds.
        I didn't find out if 1011 indexes pages with errors or not, because after I started seeing the warning messages I thought that these pages are omitted like in version 1010. So, I stopped the indexing process and downgraded to 1010.

        Originally posted by Ray View Post
        We've added more functionality to tolerate certain bad HTML scenarios in the next build (6.0.1012), but that may be a few weeks away still. We're trying to find that balance between providing the most proper support for valid HTML (which is important), while not penalizing common broken/invalid HTML too harshly.
        Thank you. This looks promising. I will be waiting for version 1012. Until then 1010 works fine for me.

        Comment


        • #19
          I have checked that there is a new version 1012.
          One of the new features is :
          Added option to toggle "Log HTML warnings" on the "Index Log" configuration panel. By unchecking this option, you can suppress "Suspected invalid HTML..." type messages.
          Be deselecting "Log html warnings" does it only stop to display the error messages or does it make the search engine less strict with html errors?

          Thank you.

          Comment


          • #20
            It only stops displaying the error messages. There is no reason to be any less strict with the handling of HTML errors. We already try to handle errors as tolerantly as we can. Believe me, if we could easily make it more tolerant of HTML errors, and produce correct behaviour, we would!

            We're not enforcing HTML strictness for any sake of promoting standards compliancy or anything like that. Our focus is solely on indexing a large amount of data accurately (or meaningfully) and as quickly as possible. To be as tolerant as most browsers are requires a much more intricate parsing process, and a much slower indexing.

            As explained earlier, our previous implementation which was comparatively "less strict" actually produced buggy behaviour whereby valid HTML was parsed incorrectly. So this is not an option.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #21
              Originally posted by Ray View Post
              It only stops displaying the error messages. There is no reason to be any less strict with the handling of HTML errors. We already try to handle errors as tolerantly as we can. Believe me, if we could easily make it more tolerant of HTML errors, and produce correct behaviour, we would!

              We're not enforcing HTML strictness for any sake of promoting standards compliancy or anything like that. Our focus is solely on indexing a large amount of data accurately (or meaningfully) and as quickly as possible. To be as tolerant as most browsers are requires a much more intricate parsing process, and a much slower indexing.

              As explained earlier, our previous implementation which was comparatively "less strict" actually produced buggy behaviour whereby valid HTML was parsed incorrectly. So this is not an option.
              Web page authors need to be a lot more accurate with html now more than ever, as the more complex it gets, with more bells and whistles, the chances of incorrect rendering increases.

              I check every page before making it live, and I have burnt much midnight oil in the past correcting many mistakes (mine) in the thousands of pages I have.

              Search engine indexing locating these errors is very helpfull but I do not think the developers should be burdened by our mistakes, they have enough to do.

              Just my 2c

              Bob
              Robert Isaac
              Volvo Owners Club

              Comment

              Working...
              X