PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Scheduled Index Modification

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scheduled Index Modification

    Hello,

    I would like to find a way to automatically delete certain pages from the index.

    To explain...

    I use Zoom 5.1 Professional edition. I have zoom scheduled to index my website automatically twice a week. One large section of the site is an archive of emails between members. The emails are converted to .html files and saved sequentially.

    The archive structure is such that all files are linked from each of three types of pages:
    .../author.htm
    .../date.htm
    .../subject.htm

    I need to include these three types of pages in the index, as the links to the 30,000+ .html pages they contain are needed. I would like to exclude those three pages from from being returned as results (while still indexing them & following links found on them).

    In the past, I have manually deleted these three types of pages from the index by going to Index > Manage existing index > View or delete pages from existing index.. Now that I have Zoom set to perform indexes automatically server side, I would like to find a way to automatically delete these three types of pages from the index.

    Thank you,
    Austin

  • #2
    You would be better off excluding these pages from indexing rather than trying to delete them afterwards.

    There are several ways you can exclude a page from indexing, but still follow the links on them.

    One way is to use the ZOOMSTOP and ZOOMRESTART tags. You can specify these tags to enclose all content on these pages. Links will still be followed. Since no words are indexed for these pages, they will not return in search results.

    Another way is to add each of these pages as an individual start point. You do this by clicking on the "More" button on the Spider Mode tab. Here, for each start point, you can configure a spider option such that a URL is set to "Follow links only". But if you do this, make sure to specify these individual "follow links only" start points prior to your main start point (for the rest of the website).

    For more information on these options see Chapter 2.1.4 and Chapter 7.5 in the Users Guide:
    http://www.wrensoft.com/zoom/usersguide.html
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Thanks for your reply.

      I run zoom on the server - so the spider options won't work for me. (Or will the spider options still function for offline scanning?)

      If I use the options in 7.5, the ZOOMSTOP & ZOOMSTART commands, will links still be followed between the tags?

      Comment


      • #4
        If I use the options in 7.5, the ZOOMSTOP & ZOOMSTART commands, will links still be followed
        The correct tags are, ZOOMSTOP and ZOOMRESTART tags. As Ray already mentioned, links will still be followed.

        If you are running a web server then you can use spider mode from any machine with TCP/IP, including the server itself.

        Comment


        • #5
          Sorry, I guess I'm confused why I would want to crawl the site via TCP/IP when I can just do it straight on the server?

          Comment


          • #6
            If index files directly from your hard disk, then scripts like, PHP and ASP, will not be executed and you'll be indexing the source code from the script, not the output. You need to hit a web server to execute script files.

            Comment


            • #7
              Originally posted by AustinM View Post
              I run zoom on the server - so the spider options won't work for me. (Or will the spider options still function for offline scanning?)
              Hang on ... if you are using Offline Mode as you seem to be suggesting now... then you don't need "links to be followed" in any way at all as you described in your original post. Offline Mode will find all the files within your folder (as it has direct access to the filesystem and folder hierarchy). You would not need to be concerned with whether links are found, and you can safely just add "author.htm", "date.htm", and "subject.htm" to your skip list.

              You might want to look up the difference between Offline Mode and Spider Mode in the Users Guide.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                Thanks again Ray.

                Here's the problem with that...

                The program that morphs our .eml files into .htm archive files links all the .htm files together with several navigation options. Each .htm file, then, has multiple links to .../date.htm, .../author.htm, and .../subject.htm.

                For .htm and .html files, does Zoom just look at the text on the page (as it appears in a browser), or does zoom also look at the source code for the page (as it appears in an editor or notepad)?

                Comment


                • #9
                  In all cases we try and strip out source code and just index the content. But in the case of scripted pages (PHP, ASP, etc), there will often be no content if you are indexing in offline mode, as the script never gets executed. So you need to use Spider mode if you have dymanic pages.

                  If all your pages are static HTML pages, either spider or offline mode can work.

                  Comment


                  • #10
                    Originally posted by AustinM View Post
                    Here's the problem with that...

                    The program that morphs our .eml files into .htm archive files links all the .htm files together with several navigation options. Each .htm file, then, has multiple links to .../date.htm, .../author.htm, and .../subject.htm.
                    I'm not quite sure why you think these links are a problem for offline mode in your above description... can you elaborate on this? Zoom will only index the content from HTML pages (in Spider Mode, it will look at - not index - the code to find links)

                    Does this program create the .htm files on the fly, that is, is it a PHP, ASP or CGI script that runs on the web server? Or is it just a program you run once, which creates a bunch of HTM files on disk?

                    Supposing it is the latter, then it shouldn't be a problem for the files to be indexed in offline mode because these are considered static HTML files.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment

                    Working...
                    X