PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

ampersand & html-entity not encoded in urllist.txt

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ampersand & html-entity not encoded in urllist.txt

    Hi,
    We're trying to index vbulletin 3.8.7 and in urllist.txt the ampersand '&' seems to be described as an html entity '& # 38 ;' (without the spaces it doesn't render)


    If I copy and paste one of these links into a browser it doesn't bring up the page, but if I replace '& # 38 ;' with & it works.
    How can I fix this?



    Our zoom installation is only indexing about 1/5 of our entire forum anymore for some reason, would this be related?
    Last edited by boxoffice; Jun-14-2011, 12:35 AM. Reason: wouldnt render hash 38

  • #2
    I had to split the example up over several lines, and I put spaces between the hash 38 -

    Code:
    /showthread.php
    ?s=
    01bce59246981295a95a5f385c0c33af
    & # 3 8 ;
    p=2913896
    I don't mind if the ?s= querystring wasn't there at all, its not necessary. I suppose its a session id.

    Comment


    • #3
      Originally posted by boxoffice View Post
      We're trying to index vbulletin 3.8.7 and in urllist.txt the ampersand '&' seems to be described as an html entity '& # 38 ;' (without the spaces it doesn't render)
      You're right. This is a bug of sorts, there's no official format for "urllist.txt" so it's hard to say what's really expected. The XML sitemap required HTML entity encoded URLs, and we simply made the TXT sitemap (urllist.txt) use the same. But there probably isn't a real need or expectation of this and it should probably be unescaped. We'll change this in the next build release (V6.0 build 1026).

      Originally posted by boxoffice View Post
      Our zoom installation is only indexing about 1/5 of our entire forum anymore for some reason, would this be related?
      Unlikely, the sitemap wouldn't affect the indexing at all. Was it indexing more before?

      It might be a recent Skip List entry that matches too many URLs. You should check your skip list, and your Index Log for skipped reasons,
      Q. Why are some of my pages being skipped by the indexer?

      The very forums here is running that version of vBulletin at the moment, and it indexes fine for us.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        Thanks. To get around the bug I ticked 'use cookies from windows & IE to login and out of webpages' in 'authentication' .. this removed the querystring from the url so it stopped the hash 38 thing.

        I tried indexing again and I watched the logging and I see that it is having 503 errors on certain forum urls. These same urls cause 503 errors every time I do an index.
        Eg
        Code:
        Could not download file: http://www.aaa.au/forums/forumdisplay.php?f=453 (503 HTTP Error)
        I checked the server log (litespeed server) and the 503 error is related to a cookie issue - the vbulletin cookie is getting too big to return in the http header.
        Eg
        Code:
        [111.29.0.111:57127-0#APVH_aaa.com.au] Cookie len: 5823,
        bbsessionhash=4ad0a16b8215727f1b3eed2450be103b;
        bbforum_view=d108eaafbe8ff8a4e8ab745cfa8c5fb0e42bd9e4a-211-%7Bi-211_i-
        1308016173_i-300_i-1308017267_i-69_i-1308016224_i-1_i-1308016033_i-125_i-
        1308027340_i-65_i-1308027354_i-134_i-1308016294_i-78_i-1308016045_i-
        80_i-1308094178_i-361_i-
        1308094240_i-101_i-1308031727_i-291_i-1308016069_i-385_i-
        1308094237_i-362_i-1308038686_i-45_i-1308094170_i-189_i-
        1308106284_i-140_i-1308094149_i-24_i-1308016118_i-129_i-
        1308016118_i-132_i-1308016641_i-343_i-1308094117_i-611_i-
        1308016120_i-661_i-1308016121_i-507_i-1308094128_i-693_i-
        1308016709_i-156_i-1308017177_i-457_i-1308094133_i-212_i-
        1308094134_i-695_i-1308016125_i-692_i-1308016126_i-149_i-
        1308094140_i-264_i-1308019934_i-100_i-1308094146_i-701_i-
        1308016129_i-162_i-1308094148_i-164_i-1308016806_i-285_i-
        1308094152_i-147_i-1308033522_i-703_i-1308016131_i-157_i-
        1308043730_i-2_i-1308016869_i-38_i-1308033599_i-
        44_i-1308016135_i-82_i-1308094175_i-81_i-1308033617_i-508_i-
        1308094181_i-699_i-1308016943_i-127_i-1308094184_i-226_i-
        1308016960_i-83_i-1308016139_i-85_i-1308094192_i-586_i-
        1308016996_i-302_i-1308094195_i-304_i-1308094197_i-714_i-
        1308017042_i-715_i-1308017029_i-711_i-1308017037_i-206_i-
        1308017064_i-403_i-1308094215_i-230_i-1308044014_i-225_i-
        1308016146_i-512_i-1308017098_i-509_i-1308048266_i-292_i-
        1308034149_i-723_i-1308094231_i-320_i-1308094236_i-323_i-
        1308094239_i-88_i-1308094242_i-676_i-1308094246_i-450_i-
        1308017210_i-63_i-1308017230_i-71_i-1308017319_i-
        70_i-1308041637_i-288_i-1308048602_i-99_i-1308094160_i-
        72_i-1308094153_i-84_i-1308094190_i-155_i-1308094160_i-
        95_i-1308094141_i-316_i-1308093476_i-208_i-1308017495_i-294_i-
        1308094229_i-67_i-1308094139_i-138_i-1308094158_i-102_i-
        1308094155_i-75_i-1308017472_i-74_i-1308017479_i-
        90_i-1308094121_i-98_i-1308041701_i-141_i-1308094163_i-324_i-
        1308016304_i-87_i-1308094149_i-163_i-1308019133_i-122_i-
        1308042557_i-73_i-1308019137_i-173_i-1308019138_i-148_i-
        1308094136_i-406_i-1308031745_i-296_i-1308094236_i-391_i-
        1308016541_i-397_i-1308016542_i-399_i-1308016543_i-401_i-
        1308016543_i-144_i-1308094207_i-145_i-1308094207_i-
        94_i-1308094159_i-136_i-1308019276_i-295_i-1308094235_i-346_i-
        1308106285_i-664_i-1308019404_i-338_i-1308106249_i-345_i-
        1308094118_i-344_i-1308094117_i-209_i-1308016656_i-616_i-
        1308094122_i-674_i-1308016669_i-641_i-1308032644_i-621_i-
        1308039034_i-609_i-1308039047_i-567_i-1308043012_i-533_i-
        1308039070_i-523_i-1308039113_i-516_i-1308039087_i-506_i-
        1308039097_i-494_i-1308039120_i-477_i-1308039132_i-464_i-
        1308039141_i-461_i-1308039170_i-456_i-1308043067_i-447_i-
        1308043102_i-443_i-1308039196_i-405_i-1308039242_i-256_i-
        1308039213_i-253_i-1308039232_i-233_i-1308032834_i-179_i-
        1308019743_i-176_i-1308019756_i-168_i-1308019773_i-119_i-
        1308019775_i-114_i-1308019797_i-109_i-1308019785_i-104_i-
        1308019798_i-501_i-1308094124_i-130_i-1308046187_i-237_i-
        1308016699_i-613_i-1308032948_i-525_i-1308016707_i-282_i-
        1308094131_i-284_i-1308019850_i-228_i-1308043322_i-694_i-
        1308016718_i-524_i-1308094132_i-214_i-1308094135_i-213_i-
        1308043412_i-137_i-1308094138_i-449_i-1308094135_i-497_i-
        1308016738_i-154_i-1308094138_i-150_i-1308094140_i-160_i-
        1308094141_i-66_i-1308019926_i-153_i-1308094137_i-151_i-
        1308043498_i-139_i-1308094144_i-728_i-1308016771_i-622_i-
        1308016771_i-529_i-1308033355_i-503_i-1308039860_i-465_i-
        1308033366_i-448_i-1308039874_i-402_i-1308033390_i-313_i-
        1308039887_i-259_i-1308033401_i-219_i-1308033405_i-187_i-
        1308020049_i-184_i-1308020055_i-218_i-1308033411_i-181_i-
        1308016780_i-261_i-1308020075_i-480_i-1308094146_i-
        86_i-1308016789_i-366_i-1308033477_i-93_i-1308094149_i-273_i-
        1308033494_i-146_i-1308094157_i-198_i-1308043729_i-161_i-
        1308040017_i-436_i-1308094162_i-372_i-1308094165_i-708_i-
        1308040032_i-48_i-1308033568_i-54_i-1308040033_i-267_i-
        1308033585_i-9_i-1308040037_i-12_i-1308033570_i-580_i-
        1308033571_i-34_i-1308033578_i-40_i-1308020255_i-696_i-
        1308016856_i-174_i-1308094166_i-26_i-1308094167_i-669_i-
        1308016866_i-681_i-1308016867_i-655_i-1308016868_i-593_i-
        1308033591_i-632_i-1308016874_i-576_i-1308016875_i-
        7_i-1308020326_i-39_i-1308020339_i-27_i-1308094168_i-678_i-
        1308016887_%7D; 
        bbthread_lastview=198a35fab544b76b1fa57bcb5a56dacb77e4b40ea-
        68-%7Bi-412517_i-1308016592_i-412540_i-1308016106_i-412573_i-
        1308016531_i-412444_i-1308019409_i-411084_i-1308016954_i-
        412588_i-1308019078_i-412587_i-1308016919_i-412591_i-
        1308017374_i-412592_i-1308025770_i-412589_i-1308018454_i-
        412598_i-1308018808_i-412159_i-1308031555_i-412596_i-
        1308032335_i-412599_i-1308021172_i-412601_i-1308026715_i-
        412617_i-1308026099_i-412607_i-1308024310_i-412612_i-
        1308023664_i-412597_i-1308020321_i-412621_i-1308026137_i-
        412605_i-1308027316_i-412637_i-1308032135_i-412618_i-
        1308027835_i-412603_i-1308034030_i-412595_i-1308028668_i-
        412602_i-1308026806_i-412627_i-1308031904_i-412632_i-
        1308033308_i-412639_i-1308036179_i-412633_i-1308031003_i-
        412614_i-1308024999_i-412626_i-1308029927_i-412638_i-
        1308031986_i-412630_i-1308036331_i-412655_i-1308042793_i-
        412656_i-1308038736_i-412657_i-1308041223_i-412648_i-
        1308042338_i-412642_i-1308039780_i-412662_i-1308045296_i-
        412666_i-1308041580_i-412667_i-1308044472_i-412675_i-
        1308043882_i-412676_i-1308047217_i-412682_i-1308045506_i-
        412677_i-1308046846_i-412674_i-1308047577_i-412679_i-
        1308046328_i-412683_i-1308046224_i-409864_i-1308051782_i-
        411373_i-1308023987_i-412575_i-1308058583_i-412631_i-
        1308042532_i-412493_i-1308089464_i-412708_i-1308055520_i-
        412114_i-1308064484_i-412609_i-1308029814_i-412733_i-
        1308090127_i-412304_i-1308035263_i-412490_i-1308031211_i-
        412690_i-1308059136_i-412698_i-1308052816_i-412485_i-
        1308031432_i-412371_i-1308023416_i-412145_i-1308091750_i-
        404322_i-1308035039_i-412616_i-1308030518_i-409789_i-
        1308016500_%7D; 
        __utma=3538595.1256908962.1264710487.1264710487.1264710487.1; 
        bblastvisit=1308015973; bblastactivity=0; bbstyleid=1; 
        bbreferrerid=29794; 
        PHPSESSID=5fc3080a5c7e7401abb6fe9a99bbecaa
        Any suggestions?

        Comment


        • #5
          Can't say I've ever heard of a web server failing because of a cookie length.

          Can't say I've ever used Litespeed Server either.

          I presume manually going to that URL from your browser also returns a 503 error? We can't comment since you didn't give us the real URL.

          The problem here is related to Litespeed Server failing with vBulletin. This side of the problem no longer has anything to do with our product (Zoom) and we really aren't the most appropriate people to address this. We would recommend contacting the developers of Litespeed Server or looking in their forums for a hint.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Thanks Ray for your help,
            I will check it out at litespeed's forum.
            FYI servers have a size limit for cookies returned to them. If its over the limit then they throw an error in that instance.

            Comment


            • #7
              Hi Ray

              I had a look at a page http://www.sitemaps.org/protocol.php referenced by google's sitemap help pages. http://www.google.com/support/webmas...rom=40318&rd=1

              The protocol says that you're supposed to use & amp ; not & #38 ; in place of ampersands (spaces put in by me so you can see it) -

              Code:
              Your Sitemap file must be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed in the table below.
              Character 	Escape Code
              Ampersand 	& 	& amp ;
              Single Quote 	' 	& apos ;
              Double Quote 	" 	& quot ;
              Greater Than 	> 	& gt ;
              Less Than 	< 	&lt;

              Comment


              • #8
                Entity encoding exist in both forms, the abbreviated name forms ("amp"), as well as the hexadecimal code point value form ("#38").

                The XML sitemap format simply obeys XML specifications. That part of the page quoted is just a brief explanation of entity encoding and gave those examples for simplicity's sake.

                More information:
                http://www.w3.org/International/questions/qa-escapes
                http://www.montana.edu/readme/xml/xml-entities.html
                http://en.wikipedia.org/wiki/List_of...ity_references
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment

                Working...
                X