PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

UTF-8 problem with ASP script

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • UTF-8 problem with ASP script

    Hi Wrensoft

    I've used Zoom on my site for a year or so with windows-1252 encoding. Now I've launched a re-design that uses UTF-8, but I can't get this to work with Zoom.

    I've selected UTF-8 in the config, and I've made sure that my pages, where the Zoom SERP is included, have the correct encoding and are saved in UTF-8 file format.

    However, the SERP is displaying the raw UTF-8 character pairs instead of the intended UTF-8 character. When it should be showing "spørgsmål" it shows "spørgsmÃ¥l", etc.

    Take a look at the search page here*
    http://www.kosmetiskguide.dk/soeg.asp?zoom_cat%5B%5D=-1&zoom_query=bryster

    (*) Notice how the heading "Søgning" displays "ø" correctly. This is an UTF-encoded string.

    It is only the indexed content that has problems. The surrounding part of the page, including header, footer, etc., displays UTF-8 correctly. The texts contained in settings.asp are HTML encoded, so they display correctly too. Only the content generated by the indexer cannot display the UTF-8 characters.

    I hope you can help, thanks!

    Patrick

  • #2
    The problem is that you have a wrapper page (named "soeg.asp") which is in fact calling our script ("search.asp"). Your wrapper script is corrupting the encoding returned by our script. If you go to the URL of the default script here, you will see that the encoding is originally correct:
    http://www.kosmetiskguide.dk/search.asp?zoom_cat%5B%5D=-1&zoom_query=bryster

    So the question would be, what is "soeg.asp" doing (we don't know this since it is your script)? It seems to have taken the UTF-8 encoding returned by our script, and treated it as windows-1252, and doing a further conversion to UTF-8, thus corrupting the output.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Hi Ray

      Thanks for looking at this issue for me!

      I'm totally clueless here.. I can see what you mean, but I can't understand why your page displays fine on its own, but not wrapped in our other page.

      Our wrapper page is made up of a lot of include files, so pasting the code here wouldn't help much. However, the page doesn't do anything encoding-wise other that what can be seen in the HTML source, namely the meta tag:

      Code:
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
      The asp script doesn't set any charset or codepage with the response object, so nothing is done behind the scenes.

      All our asp files are saved in UTF8 file format, and each page includes several UTF8-encoded files, and displays them without problem.

      I order to solve this strange issue, I tried every possible combination..

      Saved soeg.asp and search.asp both in ANSI and UTF8 file format
      Changing the file format for soeg.asp doesn't do anything. Saving search.asp in UTF-8 makes it fail to display UTF-8 characters even on its own.

      Included these lines at the top of soeg.asp:
      Code:
      Response.CharSet = "UTF-8"
      Response.CodePage = 65001
      Setting the charset this way didn't change anything. Setting the codepage made search.asp fail to display UTF-8 characters even on its own for the rest of my session.

      Commented out this line in search.asp
      Code:
      Response.Charset = Charset
      Did nothing

      Removed the encoding meta tag entirely from soeg.asp, so that this wrapper page would do nothing to the page encoding
      Did nothing

      After having tried every conceivable combination of the above steps, I'm clueless about this. I'll admit, I'm an idiot when it comes to encoding, but the rest of the site works fine with similar wrapper-include page relationships.

      I'm thinking, could it have something to do with what characters are actually indexed from the site, or maybe it is related to the index files?!

      I hope this information can help in finding the cause of this problem!

      Comment


      • #4
        Originally posted by pblasone View Post
        Our wrapper page is made up of a lot of include files, so pasting the code here wouldn't help much.
        Actually, it would be interesting to note exactly how you are including the "search.asp" file. We can't give any meaningful advice unless we can see what's going on in "soeg.asp".

        Originally posted by pblasone View Post
        Saved soeg.asp and search.asp both in ANSI and UTF8 file format
        Changing the file format for soeg.asp doesn't do anything. Saving search.asp in UTF-8 makes it fail to display UTF-8 characters even on its own.
        Changing the format of the script source code will not help this situation. I suspect the problem is with the output produced by the ASP script, not the actual source code (then again, I'm not sure until I can confirm how the scripts are being included/wrapped).

        Changing "search.asp" is not advised, since we can't tell what changes you make and how it has affected the functionality of the script. Was "search.asp" changed at all prior to this issue appearing?

        EDIT: I just noticed you are using an old build (V5.1 build 1007). There was an issue that was addressed a while ago which is similar to this (the fix prevented a scenario where the web server's default encoding forces automatic conversions occurring on ASP unnecessarily... more information here).

        Download and upgrade to the latest version and build of Zoom available here. Re-index and upload your search files (make sure to use the latest default search script), and see if that helps.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          I haven't changed the search.asp. Actually I have made some changes to this file, but I saved it as searcho.asp, which is actually the file we include in soeg.asp at this moment. The changes are just in the HTML of the results and the form layout, so we could achieve the desired presentation.

          I didn't mention this before, because I thought it would just add to the confusion. I uploaded search.asp anyway, but this acts just like searcho.asp in terms of encoding. What is important is, that the uploaded file search.asp is unchanged, and so we should base our debugging effort on this file.

          I guess I'll have to try the upgrade. Before I do, will it remember my config, or should I make some kind of backup?

          Comment


          • #6
            When you install the new version, it will prompt you if you want to overwrite the existing "zoom.zcfg" (the default configuration file). Answer "No" if this is the configuration file you are using. Though it never hurts to have a backup.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Hi Ray

              I installed the new version, and it has now indexed my site and uploaded the files. Nothing has changed, Search.asp still displays correctly on its own, but included in soeg.asp the problem remains.

              You requested that I post the code of soeg.asp, so here goes...

              Code:
              <!--#include virtual="includes/pre_html_inc.asp"-->
              <!--#include virtual="lang/language_div.asp"-->
              <%
              strTitleTag = s859
              %>
              <html>
              	<head>
              	    <!--#include virtual="includes/head_tags_inc.asp"-->
              <style type="text/css">
              
              			.highlight { background: #FFFF40; }
              		.searchheading { margin-left: 10px; font-size: 1.5em; font-weight: normal; }
              		.summary { font-style: italic; }
              		.suggestion { font-size: 100%; }
              		.results { font-size: 100%; margin-left: 10px; margin-right: 30px; }
              		.category { color: #777777; }
              		.sorting { text-align: right; }
              
              		.result_title { font-size: 100%; }		
              		.description { font-size: 100%; color: #008000; }
              		.context { font-size: 100%; }
              		.infoline { font-size: 100%; font-style: normal; color: #13a89d;}
              
              		.zoom_searchform { display: block; background-color: #e5f8f6; border: 1px solid #13a89d; width: 606px; padding: 10px 0 10px 0; }
              		.zoom_results_per_page { font-size: 80%; margin-left: 10px; }
              		.zoom_match { font-size: 80%; margin-left: 10px; margin: 10px 0 0 30px; }				
              		.zoom_categories { font-size: 80%; }
                      .zoom_searchform table { margin: 10px 0 0 30px; }
              		.zoom_searchform ul { margin: 0px; padding: 0px; }
              		.zoom_searchform li { float: left; margin-left: 15px; list-style-type: none; width: 150px; }
              		
              		input.zoom_button {  }
              		input.zoom_searchbox {  }		
              		
              		.result_image { float: left; display: block; }
              		.result_image img { margin: 10px; width: 80px; border: 0px; }
              
              		.result_block { margin-top: 15px; margin-bottom: 15px; clear: left; font-family: 'Trebuchet MS', Verdana; font-size: 10pt; }
              		.result_altblock { margin-top: 15px; margin-bottom: 15px; clear: left; font-family: 'Trebuchet MS', Verdana; font-size: 10pt; }
              		
              		.result_pages { font-size: 100%; font-family: 'Trebuchet MS', Verdana; font-size: 10pt; }
              		.result_pagescount { font-size: 100%; font-family: 'Trebuchet MS', Verdana; font-size: 10pt; }
              		
              		.searchtime { font-size: 80%; }
              		
              		.recommended 
              		{ 
              			background: #DFFFBF; 
              			border-top: 1px dotted #808080; 
              			border-bottom: 1px dotted #808080; 
              			margin-top: 15px; 
              			margin-bottom: 15px; 
              		}
              		.recommended_heading { float: right; font-weight: bold; }
              		.recommend_block { margin-top: 15px; margin-bottom: 15px; clear: left; }		
              		.recommend_title { font-size: 100%; }
              		.recommend_description { font-size: 100%; color: #008000; }
              		.recommend_infoline { font-size: 80%; font-style: normal; color: #808080;}			
              
              </style>
              	</head>
              	<body>
              	<!--#include virtual="includes/temp_upper_inc.asp"-->
              
              <div id="breadcrumb"><a href="/"><%=s978%></a> / <%=s860%></div>
              <h1><%=s860%></h1>
              	
              	<div class="hvidTabBoks">
              		<div class="section">
              
              <!--#include file="search.asp"-->
              
              		</div>
              	</div>
              
              	<!--#include virtual="includes/temp_lower_inc.asp"-->
              </body>
              </html>
              <!--#include virtual="includes/post_html_inc.asp"-->
              This shows how search.asp is included, but not much more than this. Let me know if you want other code segments from the include files!

              Thanks, Patrick

              Comment


              • #8
                The other ASP files you are including before the search script can certainly be affecting the resulting encoding used.

                I am pretty sure the problem is not with our ASP search script not encoding properly when it is used with a #include. I've just tested the latest build doing this, and indexed some of the pages on your site, and the #included results came out fine. It is one of your other ASP scripts included which is changing the encoding settings that causes the behaviour.

                You should try isolating the problem by temporarily removing all of the other #include's besides the one that includes "search.asp". See if the problem goes away. If it does, put back in the other #include's one at a time, and see which one it is that changes the behaviour of the page. Then you might want to let us know what's happening within this include file. You could potentially even isolate it down to a particular line that triggers the change in behaviour.

                If you need to provide more source code, you might want to consider e-mailing us rather than posting it in the forum.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Hi Ray

                  Good idea you had there, don't know why I didn't try this in the first place?!

                  After playing a session of ASP mastermind, I've found out that the encoding breaks if soeg.asp includes a file that is saved with UTF-8 encoding.

                  A solution that comes to mind is then to take the content of the include files and put all code directly into soeg.asp, making it different from the rest of the files for the site.

                  The problem with this, however, is that some of the include files are language files, and they don't work if not saved in UTF8 encoding. So I'll have to come up with something else?!

                  But knowing what is causing this, I hope that you maybe have a suggestion?

                  Thanks!

                  Comment


                  • #10
                    We looked into this a bit more, and the problem is that when ASP finds a UTF-8 BOM (Byte Order Mark) in the ASP script code, it decides to parse your ASP script as UTF-8 encoding, and forcing a conversion on all strings (text) that comes in and out of your script.

                    More information on the UTF-8 BOM can be found at the W3 support page here:

                    FAQ: Display problems caused by the UTF-8 BOM
                    http://www.w3.org/International/questions/qa-utf8-bom

                    The tricky part to this is that the BOM can be hard to "see" since many text editors hide it from the end user. If you have a text editor that is capable of viewing in Hex, however, you will be able to see the "hidden" two bytes in question (it often looks like "" at the start of the file).

                    What we have confirmed however, is that a UTF-8 ASP file without the BOM can be included on a page and not cause the above behaviour. It will not change the encoding of the script's string handling routines.

                    See the section on "Removing the BOM" from the above FAQ page for information on how to do this. We succeeded with editing the page in Dreamweaver, clicking on "Page Properties"->"Title/Encoding" and unchecking the option to "Include Unicode Signature (BOM)", then resaving the file.

                    (Note that some text editors (like UltraEdit) will automatically re-add the BOM into place even if you delete it in hex mode. If you are using a hex editor to do this, be sure to re-check the file in your hex viewer after making the change to be sure that you have successfully removed the BOM.)

                    Hope this is of help.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment


                    • #11
                      Hi Ray

                      I have a text editor that cannot handle UTF-8 and therefore shows the BOM in the beginning of files, so I can easily remove it.

                      Actually I thought that removing this would make the file non-UTF8. From reading the link you gave it seems that this BOM is more or less obsolete?!

                      But I was confused about which files I should remove the BOM from - was it....

                      - the files included in soeg.asp?
                      - soeg.asp itself?
                      - search.asp?

                      Patrick

                      Comment


                      • #12
                        I was under the impression that only one of the files included in your soeg.asp file had UTF-8 encoding, based on your previous post. But if the BOM is evident in more than one of the ASP scripts included in soeg.asp (soeg.asp included), then you need to remove them. "search.asp" does not contain a BOM by default.

                        A UTF-8 BOM is only useful if you plan to edit/open the file using a program which depends on the BOM to determine its encoding. Most programs like Dreamweaver can determine the file encoding by looking at the content, meta charset tag, (or by Project settings). Web browsers will pretty much always depend on the meta charset tag.
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          Hi Ray

                          Wow, that worked perfectly! And removing the BOM didn't affect the rest of the site.

                          Thank you very much for solving this issue for us. We greatly appreciate your very high level of customer service.

                          Patrick

                          Comment

                          Working...
                          X