PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Small Wishlist Item

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Small Wishlist Item

    I love this product. The only thing I am finding a little cumbersome is managing a larger number of start points. I am slowly trying to build a small vertical search engine for our niche market (maybe I am stretching the intended scalability a bit but I expect to eventually have about a million pages) - why use google co-op when i can build my own vertical engine and have 100% of my own branding? ) and it would be much easier to use this product in terms of working with additional start points if the start points could be managed like a database within the program as opposed to importing/exporting which is forcing me to have to maintain an external text database to manage this.

    I am finding that more often than not due to a little bit of confusion I am having to perform complete reindexes instead of updates with new startpoints.

    I currently have my start pages follow external links and then I run a small script that scans the urlist.txt for unique new urls and I then add these to the start list. I know - it wouldn't take long before there's a big list of start points but a feature that would automatically add these to a start point or base url database and then the user could manually decide whether to flag these base urls for further indexing or delete from database.

    The database would indicate whether the url has already been indexed or not so as to avoid doubling up.

    Would be great if we could see this in a future version.

  • #2
    Have you looked at the Incremental Indexing features? You can add new start points to an existing index (without having to re-index all your sites again). Does this address the problem you are describing?

    You can find out more about Incremental Indexing in our Users Guide:
    http://www.wrensoft.com/zoom/usersguide.html
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      I have thoroughly looked at and used the incremental indexing features and they work well but I do need to handle managing that outside of the program since if I am scanning 200 base urls it's hard to tell if a new url I have added is already on that list.

      If the base URL list had some additional functionality such as the following it would make it much nicer to use.

      1. Make the spider url list a little more of a "live" database. For each item on the base URL list - Some kind of indicator that it has been indexed.

      2. The ability to sort this list would be nice too. Helps in managing a larger list.

      3. An option in the INDEX_AND_FOLLOW_ALL spidering to automatically add new domains found to some kind of "TO CONSIDER" list and the user can then browse these and add to spider list or not. I understand why the spider does not index the next level of links but a lot of useful domains are found that I would then want to add to the spider list and further index. At the moment I have written a small script that extracts all the unique domains found from the follow-all, compares them with an exported list from the spider list and then creates a new file that I browse for new domains I would like to spider.

      Would be nice if this could all be handled within the program.

      This is just some feedback from a user who wants to use the product for a relatively comprehensive vertical search engine and what would make it a little more useable. I've only had the product for about 5 days and have probably already spent about 40 hours with it and these are just a few items that have come up to increase useability.

      Comment


      • #4
        Zoom will already skip any new start points which you may attempt to add (via the "Incremental Add Start Points" feature), if it has already been indexed.

        Originally posted by RLF View Post
        1. Make the spider url list a little more of a "live" database. For each item on the base URL list - Some kind of indicator that it has been indexed.
        In theory, all of the start points in the list are supposed to have been successfully indexed. The only other possibility is if a start point failed to index completely, because half way through, the server went down or there were connection problems on either end. In which case, it depends on your definition of what's successfully indexed or what's not. I don't know if this is what you are trying to cater for?

        If an invalid URL was entered and not indexed, it would have been completely omitted from the list.

        Originally posted by RLF View Post
        2. The ability to sort this list would be nice too. Helps in managing a larger list.
        The order of the list is important. It is the order which the start points will be indexed. In some cases, you need to make sure that a URL precedes another because they cover the same site. Allowing users to sort this list for other reasons could make this confusing (they'd have to realize they're sorting in a different order than the indexing order). While we can see it being helpful in your case, note however, that there is a "search" function (the magnifying glass button) which lets you look for a particular start point quickly.

        Also, for managing your URLs, there is the "View pages from existing index" window (click "Index"->"Manage existing index"->"View or delete pages...") with a filtering functionality which may be of help when you want to quickly check if a URL is already indexed.

        Originally posted by RLF View Post
        3. An option in the INDEX_AND_FOLLOW_ALL spidering to automatically add new domains found to some kind of "TO CONSIDER" list and the user can then browse these and add to spider list or not. I understand why the spider does not index the next level of links but a lot of useful domains are found that I would then want to add to the spider list and further index. At the moment I have written a small script that extracts all the unique domains found from the follow-all, compares them with an exported list from the spider list and then creates a new file that I browse for new domains I would like to spider.
        This is interesting but a little out of scope for the majority of our users.

        However, we are aware that such tools would be useful for people trying to build large vertical search engines spanning across thousands of different websites. In fact, we have done custom developments in the past, for customers who wanted additional tools to harvest URLs for external websites. The harvesting tool searched across the web, creating a list of sites matching a criteria, which could then be fed into Zoom for indexing. We have considered expanding on this and turning it into a product or optional extension for Zoom, and it is still a possibility if there is enough user interest for it.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X