PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

ASCII Character Codes #38 in urllist.txt

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ASCII Character Codes #38 in urllist.txt

    We are finding that a lot of the urls that we are indexing often have #38 apprearing within the URL when these are looked at in the URL list or even in the delete list.

    They do not appear in the search results once these have been uploaded to our server.

    What does this mean and how is it affecting our indexes and searching?

  • #2
    The hash character "#" is most often used in a URL as a local anchor link. (i.e. an Intra-page link).

    Having a hash character in a URL is normal for many sites. But for example, these links are in fact the same page,
    http://www.example.com/page1.html#38
    http://www.example.com/page1.html#39

    And it is normal that this page (page1.html) only appears in the search index once.

    Comment


    • #3
      Like a dynamically generated bookmark I take it.

      Okay thanks!
      Last edited by RLF; May-02-2007, 03:17 PM.

      Comment


      • #4
        No - this isn't the case - this is appearing in a lot of my indexed pages and at the same time causing a problem. I am e-mailing you a copy of my urllist.txt file. What is happening is that when I do an incremental update it actually sees these files as different and reindexes them and then continues to propagate the #38 in the URL such that it goes something like content.php?pid=4&38;38;mid=90. each incremental index adds the number of files found to the index so each incremental index increases in the file count by 300 or so urls even though NO new pages are indexed.

        Please see my e-mail. I'll refer to this thread in the email.

        Comment


        • #5
          The "#38" referred to in the original post actually turned out to be HTML entities: since all URLs need to be entity-escaped in the XML sitemap, all ampersand characters ("&") should be escaped in their entity form ("&").

          However, we've confirmed that there is a bug in which subsequent attempts at using the Incremental Update feature on such URLs causes the entities to be converted to entities again (thus, "&#38") - which is wrong. We've found that this bug may have also affected other incremental features (eg. add pages, and delete) since all the old URLs that were loaded in from the existing index files were not converted from their entitized form.

          We will fix this in the next release (V5.0 Build 1009). Thanks for bringing it to our attention.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment

          Working...
          X