PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Categorizing Question: match pattern? (PDF)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorizing Question: match pattern? (PDF)

    Hi.
    QUESTION 1:
    I'm having trouble with the match patterns for categorizing.
    I am trying to categorize alot of PDF files

    I would like to know how to create a match pattern so that it can categorize the PDF files,
    From reading some of the threads
    I read that you had to somehow change each filename of the PDF file so that it as a distinct or pattern.

    So if I wanted to categorize all the Surgery Tools in one category, (they're all PDF), then I would have to make sure each filename of the PDF had a pattern name, like
    ABC_surgerytools.pdf
    DEF_surgerytools.pdf
    GHI_surgerytools.pdf

    and have the match pattern as "surgerytools.pdf" ?
    below is what I am looking at and I'm not sure what to do in order to categorize the PDF files.
    http://img330.imageshack.us/my.php?i...ategoryah7.jpg


    QUESTION 2:Also, I am having trouble searching up the PDFs on my site.
    I ran the ZOom search indexer and it said it found one PDF file I used as a test.
    When I try to type in anything with PDF, or the meta information of the PDF file, it still does not show.
    I used www.aplus.net and I have uploaded the PDF file into the folder I am indexing.


    Thanks in advance!

  • #2
    Originally posted by sunohc View Post
    So if I wanted to categorize all the Surgery Tools in one category, (they're all PDF), then I would have to make sure each filename of the PDF had a pattern name, like
    ABC_surgerytools.pdf
    DEF_surgerytools.pdf
    GHI_surgerytools.pdf

    and have the match pattern as "surgerytools.pdf" ?
    That would work. Note that you can also match by the folder/path name of the URL, so you don't necessarily have to match by the filename or rename your files if they are already under a categorizable folder or URL path.

    below is what I am looking at and I'm not sure what to do in order to categorize the PDF files.
    http://img330.imageshack.us/my.php?i...ategoryah7.jpg
    Your screenshot shows that you only have a match pattern of ".pdf". This would only match pages containing ".pdf" anywhere in the URL (most of the time it'll be the file extension).

    Tip: Do you have any other categories specified? A file can only belong to a single category. So if a file is already grouped by a category earlier up the list of categories, then it will not have a chance to be categorized for a latter category. Refer to the Help file (click the "Help" button) or the Users Guide chapter on Categories if you have not read them yet:
    http://www.wrensoft.com/zoom/usersguide.html

    QUESTION 2:Also, I am having trouble searching up the PDFs on my site.
    I ran the ZOom search indexer and it said it found one PDF file I used as a test.
    When I try to type in anything with PDF, or the meta information of the PDF file, it still does not show.
    I used www.aplus.net and I have uploaded the PDF file into the folder I am indexing.
    Can you tell us what the URL is to the PDF file in question, and where it is linked on your site.

    Does this PDF file contain actual text data (as opposed to an image file which has been scanned in for example without any OCR layer)? Do you have "Use meta information from plugins when available" enabled (on the "Scan options" tab of the Configuration window)?

    I presume you are using Spider mode indexing. If so, Zoom will only find the PDF file if it is linked on your webpages, not simply by uploading the PDF file to the folder. But since you said that Zoom reported it found and indexed the file successfully, this should not be the problem. If Zoom reports any problems/errors when indexing this file, make sure to note what the problem is (eg. if the file is password protected for example).
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      thanks for the reply Ray!

      That helped alot!

      I presume you are using Spider mode indexing. If so, Zoom will only find the PDF file if it is linked on your webpages, not simply by uploading the PDF file to the folder. But since you said that Zoom reported it found and indexed the file successfully, this should not be the problem. If Zoom reports any problems/errors when indexing this file, make sure to note what the problem is (eg. if the file is password protected for example).
      I am using spider mode indexing, but is there another way?
      The PDF files are actually not on the site, but just uploaded using FTP.
      I found that it actually doesn't even recognize any of the PDFs I put on using FTP, is it cause of Spider Mode?

      Does this PDF file contain actual text data (as opposed to an image file which has been scanned in for example without any OCR layer)? Do you have "Use meta information from plugins when available" enabled (on the "Scan options" tab of the Configuration window)?
      Its got actually text, and I have fixed it all in the configuration window...
      As well as the Meta Information.

      Comment


      • #4
        Originally posted by sunohc View Post
        I am using spider mode indexing, but is there another way?
        The PDF files are actually not on the site, but just uploaded using FTP.
        I found that it actually doesn't even recognize any of the PDFs I put on using FTP, is it cause of Spider Mode?
        Spider mode depends on the links it can find on your web pages, to locate the other files on your website. If these files are not linked on your website, they will not be found in Spider mode. See this FAQ for more information:
        http://www.wrensoft.com/zoom/support...spider_finding

        The alternative is Offline mode. This indexes files in a local folder on your hard disk, and it will index all the files under a given folder. See the section on Offline mode in the Users Guide:
        http://www.wrensoft.com/zoom/usersguide.html

        So if you have a copy of the files you want to index on your hard disk, you should use Offline mode to do so. You should use spider mode if you have to index dynamically generated pages (eg. PHP/ASP pages).
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Hi.
          Thanks for all the help.
          Things are starting to work now , thanks to you

          However, I am still having troubles with teh category part.
          I think I misunderstood what you meant by the "categorizable folder or URL path."
          Is this meaning the folder in my FTP control (hosting control panel?)
          or my folders that are on my hardrive?

          I'm having trouble, how to configure the "match pattern".

          SO , say I have a file that has an exact internet Url of...

          http://www.website.com/pdf/ABC_SUB.pdf

          http://www.website.com/pdf/DEF_SUB.pdf

          What would be the match pattern?


          http://www.website.com/pdf/ ?
          or SUB.pdf?
          WOuld that distinguish this file from the other PDFs?

          Or would the ".pdf" stand for ALL PDFs regardless?

          Comment


          • #6
            The category pattern is matched against the full URL (as they will appear on your website), not the folders on your hard drive (unless they are the same).

            A pattern of ".pdf" will match all PDF files (because they would all have ".pdf" in the URL)

            A pattern of "SUB.pdf" will only match PDF files with "SUB" at the end of their filename.

            A pattern of "http://www.website.com/pdf/" will match all files under the "PDF" folder on your website.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              ah hah.
              Thank you.
              This is great customer support ^^
              I also have another problem, and I think this could be fixed easily.
              When I search just "pdf" in the search engine, all the pdf files appear, but also, the "index of [abc]" will appear as well.
              the url will usually look like this:
              http://www.website.net/pdf/ABC/

              or realistically:
              http://www.orscrubs.net/pdf/

              it'll have one of those files as a result.

              and it has no file extension.

              In the configuration window, I disable the box where it says, "scan files with no extension" and yet it still seems to catch those.

              Am I supposed to disable the directory of that in the FTP control, or is there a way to have ZOOm not search those pages.
              Thanks alot Ray!

              Comment


              • #8
                I presume you are using Spider mode. Are you using http://www.orscrubs.net/pdf/ as your start URL? If so, you can simply click on "More" in the Spider mode tab, and then "Edit" the selected start point, and change it from "Index page and follow internal links" to "Follow links only". This prevents the URL of your start point from being indexed. It will only follow the links from it.

                For more information on excluding directory listings in Spider mode, see this other recent thread:
                http://www.wrensoft.com/forum/showthread.php?t=1203
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment

                Working...
                X