PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

parts of the url are chopped

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • parts of the url are chopped

    After working fine since four weeks now, zoom search shows an odd behaviour:
    The URLs of the search results are chopped.
    Sometimes only the h of the http is missing, sometimes most of the url is missing.
    I am running zoom search on a windows server. The indexing process is automated and as I said above, everything worked fine for about a month.

    Since I did not change a thing I have no explanation for all this.

    Thanx for your help.

  • #2
    Sounds like you have a corrupted set of index files. This can happen for example,
    - if you only upload half the files
    - if you upload the files using a 3rd party FTP program and select ASCII mode (you need to select binary mode).
    - there is low level corruption on the hard disk

    Try a full re-index & reupload all the files.

    If you still have a problem can you provide additional details. What script are you using? Are you using incremental indexing? What is the URL to your search function, so we can see the problem?

    Comment


    • #3
      I'm running into a similar problem. One of the slashes in the URL goes missing i.e. http:/www.someurl.com

      To see this, please go to http://www.siliconjournal.com and search for thestar.com and look at the results page. All of the results listed as "No Title" have a problem with the url.

      To view the link as it appears on the site (left hand side) visit:

      http://www.siliconjournal.com/index.php?pageNum_rsFeedsList=3&totalRows_rsFeedsL ist=233

      and click on any of the links for thestar.com and they work fine.

      Please note that I deleted the index files, re-indexed the site and uploaded the files to the webserver using Zoom for ftp.

      Any help is appreciated.
      Last edited by TSJ; Feb-23-2008, 03:28 AM.

      Comment


      • #4
        I looked at the existing index and there are 25 damaged urls. Here are another 5 search terms to try and you'll see the results with "No title" have the / missing

        "1UP News and Reviews"

        AVRant.com

        "associated press"

        "cbc top stories"

        marketwatch

        Thanks again

        Comment


        • #5
          These problems from FreeFlow and TSJ are seperate and unrelated issues.

          FreeFlow didn't post the URL to his site, but it is very likely due to corruption of the index files.

          TSJ in my opinion just has a broken link on his site, which the spider follows, to a custom error page (which hides the real 404 error). The custom error page has no tile. Thus you end up with these results in the index. Fixing the broken links on your site should remove these bad pages from the index.

          Comment


          • #6
            The links are pulled from a database and then indexed by the spider. The link from the db works and the link from the spider doesn't.

            These are the only links that the spider can index:

            http://www.siliconjournal.com/links.php

            click on any of them and they work fine. Click on the results returned by the spider and some of the links are broken.

            Comment


            • #7
              TSJ,
              Most of the links on your links.php page are indexed correctly. The few links that don't index correctly are in the following form,
              Code:
              <a href="index.php?feed=http://rss.1up.com/rss?x=1">1UP News and Reviews</a>
              There appears to be a problem with all the URLs that have the double question mark in them. The question mark is used in URLs to delimit the file name from the parameters passed to the file. A few of your URLs have 2 question marks, leading to some ambiguity about the structure of the URL. (Zoom needs to parse the URL in order to convert it to a non relative path among other things).

              The indexer seems to mess up in this case, and constructs an invalid absolute URL (in effect a broken link).

              I am not sure if double question marks in URLs are valid, possibly the 2nd one should be encoded like this,
              Code:
              <a href="index.php?feed=http://rss.1up.com/rss%3Fx=1">1UP News and Reviews</a>
              but valid URL or not, we probably need to have a closer look at the Zoom indexers behaviour in this case.

              And I still believe the problem Freeflow was having is unrelated.

              Comment


              • #8
                Thank you for looking into this and finding the problem. Unfortunately, I can't do anything about the the second question mark as the feed urls are generated by the provider. Replacing the 2nd ? with a % wont work either as the feed wont parse with the url changed.

                If it was only a couple of results that ran into this problem in wouldn't be so bad, however I currently have 233 links and 25 of them are returning with damaged urls in the results. That's over 12% of the results. Once testing is completed, we plan to have over 4000 links in the db. This is going to pose a serious problem moving forward.

                Do you have any other suggestions?

                Thanks again for the great support and sorry for highjacking the thread.

                Comment


                • #9
                  The feed should probably parse, as %3F and '?' are equivalent. In the same way as %20 and the ' ' space character are the same in a URL. It is your index.php file that is accepting the URL, so you should be able to fix it to accept %3F.

                  I haven't tested this, but you could try including full absolute URLs instead of relative URLs. But this is a poor work around even if it did work.

                  Another work around would be to make a code change in index.php to correct the URL.

                  Or make a code change to remove the URL as a parameter completely. Instead of having the double URLs like this
                  http://www.siliconjournal.com/index.php?feed=http://rss.1up.com/rss?x=1
                  you have this
                  http://www.siliconjournal.com/index.php?feedID=4

                  Comment


                  • #10
                    I missed the 3F after the % (%3F) in your previous example. Of course that works! I can easily have the code replace the ? from the second url with %3F.

                    I'll change the entries in the db and get the spider to re-index.

                    Truly amazing support, even on weekends!

                    Comment

                    Working...
                    X