PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Skipping: Blocked by extension list

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Skipping: Blocked by extension list

    Hi there

    I tried indexing my store located here: http://www.astphotos.com/store/

    Now, when I scan, I remove my index page and put a link to
    http://www.astphotos.com/store/themes.php only b/c I don't want to index the browse page and all links thereafter. I also remove the outer shell of my pages to speed things up and only get the content and it's links, in order ot just get the categories, subcategories and products. That aside (I know it sounds screwy), I can't get my site to index the actual pages. It says pages like this:

    Sample Store Page

    Skipping: Blocked by extension list

    Is it looking at all the numbers as extensions? like the astphotos.327911? I tried adding all the numbers I had, but it says the extension list is too long, and it's probably not the way to do it.

    The reason for my giant workaround with the outer templates is I don't want it to index my Shop By Product list, which will be found by any other outer navigation on my site. Anyway, I found the way to get it to the right places now, but now it's skipping the actual pages I want. Anyway around this?

    Why am I not using things like Zoomboost for the headers of those pages? Well, the entire site is 1 template that grabs my store from Cafepress.com and inserts it in, adjusting the URLs and such. I can't boost a page b/c the pages are dynamic. It's the same on the Cafepress stores as well, those pages are dynamic and funky too. I found this workaround instead.

  • #2
    This seems to be a consequence of the URL re-writing that is being done on some of your pages.

    I don't have time to completely examine how you have built your complex site but it appears that a 'normal' URL to a product page on your site is like the following,

    http://www.astphotos.com/store/shop.cgi?storeid=astphotos.538812&page=1&trail=ast photos.247466~All%20Things%20Boston~1|astphotos.31 9617~Boston%20Designs%20and%20Collages~1&st=I%20%2 8Shamrock%29%20Boston

    Although long, Zoom will handle this URL without any problem.

    But some of the time (I am guessing) you are using the Apache mod_rewrite module to transform the URLs to try and hide your store ID. So the above URL becomes,

    http://www.astphotos.com/store/shop.cgi/astphotos.538812?page=1&trail=astphotos.247466~All %20Things%20Boston~1|astphotos.319617~Boston%20Des igns%20and%20Collages~1&st=I%20%28Shamrock%29%20Bo ston

    Normally it is only worth using mod_rewrite if you want to pretend you are not running a database site and completely flatten your URLs. (remove all parameters from them). But in your case you are only removing the Store-ID making the mod_rewrite trick effectively pointless. It is just an unfortunate co-incidence that your store ID now looks like a file name of type '538812'.

    So I see two possible solutions,

    1) Go all the way and transform the entire URL using mod_rewrite. So your URL would then become something like,
    http://www.astphotos.com/store/shop....on/........etc
    But given your long complex URLs I think this sounds like hard work.

    2) Stop using mod_rewrite and just use plain untransformed URLs. This should be easy. But I notice that Google has indexed some of your page. so changing the URLs might effect that, which is probably not what you want.

    I have looked at your site for 30minutes and still don't fully understand everything about its internal workings. So given the complexity of your site I would want to test as much as possible offline before changing your live site.

    There might be other better solutions but it could take several hours of investigating the way your site, scripts, database work.

    -------
    David

    Comment


    • #3
      Hi and thank you for the reply.

      What it is, the script is called CPshop ( http://marty.net ) and what it does is grab from a store like http://www.cafepress.com/astphotos (which in turn is itself rewriting URLs from storeid=astphotos to be more friendly). The script scrapes the Cafepress store and caches parts it needs, like the products, their links, text, etc, but leaves out the custom html from that site. I in turn have my template ready, filled with the same inner store, but the script rewrote it so the links they work on my remote site. There isn't a database, just caching of the important product numbers, text, thumbnails, etc. so it can follow the same path. Everytime I make major changes, add new products, I need to clear the cache on my server and it regrabs from the other site.

      It does have an option for 'clean urls' but you're write, even now, it doesn't seem to be clean for the products. Like you said, google has grabbed my site already, so I will check to see if my links can still work if I turn the rewrite off and try it again.

      The URLs I do set are actually filters on my site. When you see things like all_white_tshirts, it's actually grabbing my cache and filtering out sections with that phrase, so hopefully they'll continue to work. I'll play around and try it out and come back. I want to thank you again for taking the time on a weekend to really follow up and check it all out. I'm very impressed.

      Comment


      • #4
        Situation has come up again, well, never really left . I kinda didn't rely on Zoom for a while, but I really want to now.

        If my main page links to this page:
        http://www.astphotos.com/store/shop....hotos.30759709

        Why does this page get skipped? Shouldn't Zoom be already beyond the server rewriting and just see a page with content on it, and then just index? Well, I get that it's really b/c the extension is a product or store number and not a typical extension, but is there anyway to get Zoom to just index a page if a page is there, regardless of the extension?

        Google should be beyond the original cache now of the long variable-filled URLs to the much nicer and cleaner ones. Is there anyway around this? Can you test it to see? I just really start with
        http://www.astphotos.com/store/themes.php and let it run from there. Thanks!

        Comment


        • #5
          Why does this page get skipped? Shouldn't Zoom be already beyond the server rewriting and just see a page with content on it, and then just index? Well, I get that it's really b/c the extension is a product or store number and not a typical extension, but is there anyway to get Zoom to just index a page if a page is there, regardless of the extension?
          To determine why a page is skipped, turn on Verbose mode and locate the skipping message. There is a reason given in brackets after the URL.

          As you suggested, the reason is most likely because the file extension is not recognized as one which is entered in the 'scan extensions' list. The problem here is that your URL is really misleading in that you are using the format for a file extension to act as a product ID parameter. While you can allow Zoom to "scan files with no extensions", the fact of the matter here is that ".30759709" will be treated as a file extension.

          So you have a few options. One is you can enter these numbers as scan extensions (in the Configuration window -> Scan Options tab). Although if there are many numbers, this might hit the max. number of extensions allowed. Or you can change your URLs to use something in place of the dot for the parameter (eg. "astphotos?30759709").
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment

          Working...
          X