PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

How to increase the weight of a PDF file name?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to increase the weight of a PDF file name?

    I'm guessing that this question might be a bit complex since it may affect all indexing so I've included a great deal of info...

    I have a variety of PDFs that have been indexed and that cross reference each other. I used MS Word 2007 metadata features to add the documents Title, Description and Keywords to the coverpage then in converting the PDFs that data carried over in the respective metadata for the PDF. While the value of keywords in a PDF are questionable, this has worked well with Google.

    My site has about 7500 files (by Zoom Search results) and 65 PDFs. When a person searches for the name of a PDF, it would be helpful if that was at the top of the list. This is my problem. For example:

    My link text is: "PPP-B-601(H) - Cleated Plywood Box"
    My document name is: "http://www.woodencrates.org/standards/PPP-B-601.pdf"
    And my search term is: "PPP-B-601" or "PPP-B-601(H)"
    I have no adjustment for URL length.

    This term is all over throughout my site and is referenced multiple times in other PDFs but I need the actual document to come up 1st in a search result. Currently this particular one comes up #6 with positions 1 to 5 also being PDFs.

    Is there a way that I can weight this so the proper PDF (based on file name) comes up at the top of the search without generally affecting the results of web page searches?

    (With my weighting, it may be important to note that all my web pages use the Page Title as the <h1> tag and the meta Description as the <h2> tag.)



    Thanks in advance for any help.

  • #2
    You can find details of how the rankings work here,
    Q. How do I make some pages appear higher up in my search results? How does Zoom's page score system work?

    But in the file name is PPP-B-601.pdf, then try searching for the full file name, PPP-B-601.pdf, including the file extension. While the text PPP-B-601 might appear in other files, the full name is unlikely to?

    For this to work you need to have the "." dot joining words. You can check this on the indexing options configuration window.

    You might also want to boost the file name to +5

    You could also add the file name into the keyword meta data.

    Another option would to be make a category just for PDF files. This would cull ~99% of your results based on the numbers you gave above.

    Comment


    • #3
      I had a look at your site (by guessing the search page URL).

      I think what hurts this scenario further is the fact that your PDF file (http://www.woodencrates.org/standards/PPP-B-601.pdf) actually only contains ONE single instance of the keyword "PPP-B-601" in the whole document (and it is only in the body).

      Instead, it contains 3 occurrences of the keyword "PPP-B-601H" (note the extra letter bolded) elsewhere in the body, and once in the title of the document. Also an occurrence of "PPP-B-601G". These are not the same as searching for "PPP-B-601" (although there is a way to enable substring/partial match, but this is generally not recommended due to the way it would affect normal english words: "cat" would match "category", etc.)

      And because you have "dots" enabled for joining words, the filename is indexed as "PPP-B-601.pdf" and not "PPP-B-601". So your current boosting of the filename is not helping searches with the keyword "PPP-B-601", only searches for "PPP-B-601.pdf".

      Either way, it is evident that your other results would outrank this document in terms of relevance because of all of this.

      I think you need to decide whether you mean (or need) "PPP-B-601H" and "PPP-B-601G" to be the same as "PPP-B-601". And whether you want the distinction with searching for the full filename (with the ".pdf" extension or not).

      If you disable dots from joining words, "PPP-B-601" will be indexed from the filename (being broken off from ".pdf") and it'll be boosted by your weighting configuration. BUT given how little else this word appears in the document, it doesn't guarantee it'll be #1.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        I don't think either answer will work...

        Users won't necessarily know that the document exists on the site. Their search for the reference to that document (PPP-B-601...) needs to present the actual document first, in part so they know it does exist so adding PDF to the search request wouldn't likely happen.

        Also, I can't explicitly add the revision number such as 'H' to the filename. This convention was created before computers and my not using it allows me to modify the base document by adding revisions while allowing that users bookmarks take them to the most current version. The use of the revision number generally assumes a person has a binder and they throw the old copy away so the 'H' only indicates that they throw out the 'G' version.

        My need to weigh the PDF file names heavily would be similar to indexing an actual movie along with thousands of reviews of it. The name of the movie may never be used in the transcript so it would only appear in the title, but if someone were to search for the movie name, the one (and only one) occurrence of the movie should reasonably occur first while any reviews mentioning it can easily be found starting in the #2 position. So, like this example, the movie itself would need to have a different presidence than the reviews/standard web pages.

        Unlike html pages, the content has very little relevance to the filename itself here. This document, PPP-B-601, for example would be considered more of a parent or senior document so any other documents that reference it would likely be of little to no benefit. I think the big difference here is between textual content and technical documentation.

        I think my problem really lies in the weighting relationship between these PDFs and the rest of the site. A search for PPP-B-601 would result in useful html pages but it would be most valuable for the desired PDF to show first.

        Since my first post, I read in the help files about .disc files (although I can't find the screen that it's on - I know I've seen it.) I'm hesitant to use it though because over time I could have thousands of these documents so something more automated would be useful.

        Any new ideas?

        Thanks much!

        Also, I'd like to say that even with this issue I have to deal with, it's a great product. It's miles above anything else that I've seen in the price range.

        Comment


        • #5
          Ignore that part about not being able to find the .desc section in the program

          Comment


          • #6
            Originally posted by woodencrates View Post
            Users won't necessarily know that the document exists on the site. Their search for the reference to that document (PPP-B-601...) needs to present the actual document first, in part so they know it does exist so adding PDF to the search request wouldn't likely happen.
            In which case, the first thing you should do is disable "Dots" from joining words ("Configure"->"Indexing options") and re-index and see how that helps. You will then at least get "PPP-B-601" indexed from the filename as you wanted. And this will be boosted accordingly. So try this first.

            I'm not sure if you understood what I said before about this, but basically, since you currently have dots enabled for joining, "PPP-B-601" is not currently indexed from your filename so all the weighting in the world isn't helping.

            Originally posted by woodencrates View Post
            Unlike html pages, the content has very little relevance to the filename itself here. This document, PPP-B-601, for example would be considered more of a parent or senior document so any other documents that reference it would likely be of little to no benefit. I think the big difference here is between textual content and technical documentation.

            I think my problem really lies in the weighting relationship between these PDFs and the rest of the site. A search for PPP-B-601 would result in useful html pages but it would be most valuable for the desired PDF to show first.
            Given the above requirements, I think your weightings should be changed to reflect this. That is, your filename should be +5 boost, and you can widen the gap with the other aspects of the page by dropping them into deboost values.

            Again, make sure you disable Dots from indexing first, take a look at your new results, and then decide how your weightings should change to assist in adjusting your results to your preference.

            Another thought - regarding your current weighting settings, and what you said here:

            Originally posted by woodencrates View Post
            (With my weighting, it may be important to note that all my web pages use the Page Title as the <h1> tag and the meta Description as the <h2> tag.)
            It would seem to me that you shouldn't have "Heading" set to "+3 Boost" if this is the case. It sounds like you're saying page titles and meta descriptions are duplicated on the page in the form of headings, and what you have done is essentially boosted these parts of the page -- which further assist in making your HTML pages (with these words throughout the content) outrank your PDF file with the single matching filename.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment

            Working...
            X