PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Can't scan old site

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can't scan old site

    Awhile back we moved our site to a new url (from www.waxie.com to info.waxie.com). The new url seems to scan fine, there are a few issues, but that is not what this post is about...

    We have PDF's on the old site, that we still want to show up in the search. But since there is a 301 redirect, Zoom will not scan the old site, and it does not work as a starting point. Only certain pages have the 301 redirect on the old site, the PDF's do not have redirects on them. How do I make Zoom scan the site anyway, but just ignore the redirected pages?

    This was not an issues with the Google search we have been paying for previously.

    Please advise! Thanks!

    Chris

  • #2
    I'm not entirely sure from your description, but I think you're saying that on the new site (info.waxie.com), there are redirects to pages on the old site (www.waxie.com) and you would like these links to still be crawled and indexed.

    You can do this by simply editing the Base URL (assuming you're using Spider Mode). Click on the "More" button next to the spider start URL, then click "Edit". The automatically determined base URL at the moment is probably "http://info.waxie.com/", change this to: "http://info.waxie.com/;http://www.waxie.com/"

    Now links to either domains will be considered part of the site and not be skipped as being an "external site". You can add other domains here (semi-colon delimited) if necessary.

    Let us know if this doesn't solve the problem.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Ray,

      old site = www.waxie.com
      new site = info.waxie.com

      Scanning new site not a problem. Scanning old site for content is the problem because there is a 301 redirect on www.waxie.com. It says the url is not valid. Therefor all the PDF's hosted on that old domain are not being scanned and thus do not show up in the search. Is that more clear?

      Please advise!

      Comment


      • #4
        Ray,

        old site = www.waxie.com
        new site = info.waxie.com

        The root page on old site has a 301 redirect to the new site. But there are still other documents on the old site, like PDF's, that we want scanned so that they show up in searches. Currently, it is not possible to scan the old site, Zoom just says the url is invalid. Please advise.

        Comment


        • #5
          It's still not very clear. It would help with some actual examples, instead of just the same information re-phrased.

          First question, is the www.waxie.com site actually online? When you say it is the "old site", are all the files removed, such that an old URL such as www.waxie.com/somefolder/page.html (for example, assume some valid URL for the old site) no longer returns anything? (or a 404 file not found?)

          Second question, with the "301 redirect", I now understand that the root page:

          www.waxie.com/index.html

          is apparently redirecting to

          info.waxie.com/index.html

          But are there other redirects that are in place? What are they redirecting from, and to?

          You said in your original post that you want Zoom to ignore the redirects. Then what do they index? Note that in Spider Mode, we can only index what the server is willing to return. If the server returns a "redirect" and not the page requested, then it cannot access the page that is being redirected. It can, at best, index the page it has been redirected TO (which is what my post was allowing).

          Now, if all your files are actually static and not dynamically generated (in other words, they are not PHP or ASP scripts), then you could consider just indexing a local copy of your site's files using OFFLINE mode. This would bypass any server redirects that are in place, and you can specify the Base URL to whatever you need (e.g. info.waxie.com).
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Ray,

            Apologies for the difficulty, your help is greatly appreciated.

            First question answer: Yes, www.waxie.com is actually online. The home page, and many internal pages, have 301's on them redirecting to the new site. This was done to maintain SEO ranking, and also in case someone had bookmarked a page. There are PDF files though, that still live on www.waxie.com, that we link to, reference, and currently show up in our paid Google search service successfully. Please go here to see this in action... info.waxie.com/search. Then do a search for something like "mop". In those results you will see a mixture of url hits between info.waxie.com and PDF's on www.waxie.com such as... http://www.waxie.com/RCP-HYGEN-Waxie-Clean-Water-System.pdf

            Second question: should be answered in the above paragraph. We don't want the redirected pages to be scanned against the servers "willingness to return". I want all of the (non-301-redirected) PDF's on www.waxie.com to be scanned and show up like the paid Google search service does currently.

            As far as scanning in offline mode, I do not see how I can combine that with a singular scan of info.waxie.com.

            Thanks, Chris

            Comment


            • #7
              OK, I see now. So you intend to have both the old and the new sites co-existing, and it's not a matter of migrating from one to the other and removing the old site in the future, nor do you plan to have all the content from the old site available in the new site. It's actually going to be two sites that will remain and it's more like a primary site and secondary site.

              So as you said, there are some PDFs on www.waxie.com which are not on info.waxie.com.

              Question: how are they linked to? Are there any links at all on info.waxie.com which go to these PDF files? Or is there no way for a user to access these PDF files except via the search engine?

              If there are links from the pages on info.waxie.com which go to the www.waxie.com PDF files, then my first solution proposed in post #2 above will address this.

              If there are no such links on info.waxie.com, then is there any page on www.waxie.com (presumably not the root page since you said this redirects, but presumably other pages on the site do not) which links to these PDF files? If so, you could add this page of links as an additional start point (click on the "More" button next to the start spider URL).

              If there are NO links at all from any live pages that point to these "lost" PDF files, then you can either create such a page (and use it as an additional start point as above), or add each of these PDF links as individual additional start points.

              In that last case, I presume that Google only has the links to these PDF files because it previously indexed your old site (which DID link to these files) and has those links in its database.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment

              Working...
              X