PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

HTML index error (strange of course)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTML index error (strange of course)

    Hi again,
    allow me one (hoepfully last) stupid question.
    i came across a "funny" HTML error message with some of our pages.
    interesting: The HTML is not perfect but according to various HTML checker
    ok. 1127 error reported in 52644 ....(see log)
    The are several things on the html pages, which i initially thought might be the reason for the problem, but they are ok on other pages

    ...any idea what this might be?...
    Thanks already...
    Greetings ...

    ---------------------
    14:17:40 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00758/IPI00758369.htm, page aborted
    14:17:42 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00759/IPI00759894.htm, page aborted
    14:17:43 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00759/IPI00759928.htm, page aborted
    14:17:47 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00760/IPI00760082.htm, page aborted
    14:17:50 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118096.htm, page aborted
    14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118238.htm, page aborted
    14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118271.htm, page aborted
    14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118296.htm, page aborted
    14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118304.htm, page aborted
    14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118309.htm, page aborted
    14:19:22 - Indexing completed at Tue Mar 13 14:19:22 2007
    14:19:22 - INDEX SUMMARY
    14:19:22 - Files indexed: 52644
    14:19:22 - Files skipped: 4681074
    14:19:22 - Files filtered: 0
    14:19:22 - Files downloaded: 52644
    14:19:22 - Unique words found: 1778195
    14:19:22 - Total words found: 46694865
    14:19:22 - Avg. unique words per page: 33
    14:19:22 - Avg. words per page: 886
    14:19:22 - Start index time: 13:54:59 (2007/03/13)
    14:19:22 - Elapsed index time: 00:24:23
    14:19:22 - Errors: 1127
    14:19:22 - URLs visited by spider: 52644
    14:19:22 - URLs in spider queue: 0
    14:19:22 - Total bytes scanned/downloaded: 1793589489
    14:19:22 - File extensions:
    14:19:22 - .htm indexed: 52394
    14:19:22 - .html indexed: 250
    14:19:22 - Cleaning up memory used for index data... please wait.
    14:19:22 - Finished cleaning up memory.

  • #2
    What HTML checker did you use? The most common and reliable is the W3's Validator at:
    http://validator.w3.org/

    Here is one of your pages put through the validator:
    http://validator.w3.org/check?uri=ht...PI00758369.htm

    It failed validation, and reported 161 errors.

    Despite this, the actual problem that Zoom picked up was not mentioned in the report. The cause of the Zoom error message is actually due to some extremely long URLs in the links on your page.

    Below is one of your long links. (I kid you not, that is the actual full link you have on the page):

    <A HREF="http://smart.embl-heidelberg.de/smart/show_motifs.pl?INCLUDE_SIGNALP=INCLUDE_SIGNALP&DO_ PFAM=DO_PFAM&SEQUENCE=MVALSLKICVRHCNVVKTMQFEPSTAVY DACRVIRERVPEAQTGQASDYGLFLSDEDPRKGIWLEAGRTLDYYMLRNG DILEYKKKQRPQKIRMLDGSVKTVMVDDSKTVGELLVTICSRIGITNYEE YSLIQETIEEKKEEGTGTLKKDRTLLRDERKMEKLKAKLHTDDDLNWLDH SRTFREQGVDENETLLLRRKFFYSDQNVDSRDPVQLNLLYVQARDDILNG SHPVSFEKACEFGGFQAQIQFGPHVEHKHKPGFLDLKEFLPKEYIKQRGA EKRIFQEHKNCGEMSEIEAKVKYVKLARSLRTYGVSFFLVKEKMKGKNKL VPRLLGITKDSVMRVDEKTKEVLQEWPLTTVKRWAASPKSFTLDFGEYQE SYYSVQTTEGEQISQLIAGYIDIILKKKQSKDRFGLEGDEESTMLEESVS PKKSTILQQQFNRTGKAEHGSVALPAVMRSGSSGPETFNVGSMPSPQQQV MVGQMHRGHMPPLTSAQQALMGTINTSMHAVQQAQDDLSELDSLPPLGQD MASRVWVQNKVDESKHEIHSQVDAITAGTASVVNLTAGDPADTDYTAVGC AITTISSNLTEMSKGVKLLAALMDDDVGSGEDLLRAARTLAGAVSDLLKA VQPTSGEPRQTVLTAAGSIGQASGDLLRQIGENETDERFQDVLMSLAKAV ANAAAMLVLKAKNVAQVAEDTVLQNRVIAAATQCALSTSQLVACAKVVSP TISSPVCQEQLIEAGKLVDRSVENCVRACQAATGDSELLKQVSAAASVVS QALHDLLQHVRQFASRGEPIGRYDQATDTIMCVTESIFSSMGDAGEMVRQ ARVLAQATSDLVNAMRSDAEAEIDMENSKKLLAAAKLLADSTARMVEAAK GAAANPENEDQQQRLREAAEGLRVATNAAAQNAIKKKIVNRLEVAAKQAA AAATQTIAASQNAAISNKNPSAQQQLVQSCKAVADHIPQLVQGVRGSQAQ AEDLSAQLALIISSQNFLQPGSKMVSSAKAAVPTVSDQAAAMQLSQCAKN LATSLAELRTASQKAHEACGPMEIDSALNTVQTLKNELQDAKMAAAESQL KPLPGETLEKCAQDLGSTSKGVGSSMAQLLTCAAQGNEHYTGVAARETAQ ALKTLAQAARGVAASTNDPEAAHAMLDSARDVMEGSAMLIQEAKQALIAP GDTESQQRLAQVAKAVSHSLNNCVNCLPGQKDVDVALKSIGEASKKLLVD SLPPSTKPFQEAQSELNQAAADLNQSAGEVVHATRGQSGELAAASGKFSD DFDEFLDAGIEMAGQAQTKEDQMQVIGNLKNISMASSKLLLAAKSLSVDP GAPNAKNLLAAAARAVTESINQLIMLCTQQAPGQKECDNALRELETVKGM LENPNEPVSDLSYFDCIESVMENSKVLGESMAGISQNAKTGGNPKAQHTH DAITEAAQLMKEAVDDIMVTLNEAASEVGLVGGMVDAIAEAMSKLDEGTP PEPKGTFVDYQTTVVKYSKAIAVTAQEMMTKSVTNPEELGGLASQMTTDY GHLALQGQMAAATAEPEEIGFQIRTRVQDLGHGCIFLVQKAGALQVCPTD SYTKRELIECARSVTEKVSLVLSALQAGNKGTQACITAATAVSGIIADLD TTIMFATAGTLNAENGETFADHRENILKTAKALVEDTKLLVSGAASTPDK LAQAAQSSAATITQLAEVVKLGAASLGSNDPETQVVLINAIKDVAKALSD LIGATKGAASKPADDPSMYQLKGAAKVMVTNVTSLLKTVKAVEDEATRGT RALEATIEYIKQELTVFQSKDIPEKTSSPEESIRMTKGITMATAKAVAAG NSCRQEDVIATANLSRKAVSDMLIACKQASFYPDVSEEVRTRALRYGTEC TLGYLDLLEHVLVILQKPTPELKHQLAAFSKRVAGAVTELIQAAEAMKGT EWVDPEDPTVIAETELLGAAASIEAAAKKLEQLKPRAKPKQADETLDFEE QILEAAKSIAAATSALVKSASAAQRELVAQGKVGSIPANAADDGQWSQGL ISAARMVAAATSSLCEAANASVQGHASEEKLISSAKQVAASTAQLLVACK VKADQDSEAMKRLQVMVTDAGGKILLLERAAGNAVKRASDNLVRAAQKAA FGKADDDDVVVKTKFVGGIAQIIAAQEEMLKKERELEEARKKLAQIRQQQ YKFLPTELREDEG"> ACTIVATE: SMART analysis</A>
    Now, I think you would agree that's a pretty long link In fact, the URL is 2301 chracters long. The maximum length for URLs is 2083 characters. This limit is imposed by the Windows Internet API and is enforced in Internet Explorer. Other web browsers and servers have slightly different limits, but a practical limit is still enforced. That link is likely to fail in a variety of browsers and operating systems (I just checked, and it gets truncated if you try to enter that URL in Internet Explorer 7).

    At the moment, Zoom assumes that the page's HTML is broken due to the ridiculously long link, and skips indexing the page. We would recommend looking into why such a long link is needed, and changing your site to use shorter URLs.
    Last edited by Ray; Mar-13-2007, 11:45 PM.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      For a 2nd opinion we also used the more in depth 'CSE HTML Checker'.

      It found 179 HTML errors and warnings on your first page.

      Comment


      • #4
        long links..

        Thanks Ray for the help...
        (w3 validator detects everything ...a bit too much )

        Unfortunately the long links we use are "amino acid sequences" (proteins) we have to "parse" to the corresponding bioinformatic server.

        cutting the link/ protein? Let me see...ah....we would have no eyes (probably)
        ... ...anyway...thanks for your help....we will find a way around the
        "long link problem" ....

        Danke and Greetings....

        Comment


        • #5
          I don't mean to suggest that you should truncate the data. But some data is not suitable to be sent via the URL and this is one such case. It would make more sense for example, if the database/backend did not depend on the sequence as the identifying parameter. Usually, a database would be designed to contain a shorter, internal ID# (unique for your database only) that you can use for example.

          And if large data needs to be sent between pages, typically you should use HTTP PUT (eg. forms) instead of HTTP GET (parameters in URL).

          As mentioned before, the existing website implementation is already broken for Internet Explorer and probably many other web clients. So there may well already be some missing eyes and noses as it is.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment

          Working...
          X