PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Filtering category text from URLs?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering category text from URLs?

    I'm setting up a search for a site that has 100ish PDFs which they want searchable by topic (category). Topics are stored in a many-to-many database. I've created a script which outputs all the PDF links with topic information attached, like so:
    http://site.com/paper1.pdf?topic=1&topic=4
    http://site.com/paper2.pdf?topic=2
    http://site.com/paper3.pdf?topic=1&topic=3&topic=5

    Zoom spiders this list, categorization works perfectly, and the search is great. The only problem is, I've had to turn off meta information in the PDF options because they've been distilling from Word documents, and with the options they've been using, search results look like:

    Paper1Title-Aug31.doc
    Lorem ipsum...

    Paper2Title-Sept14.doc
    Lorem ipsum...

    Which is confusing, since they're PDFs. So instead, I'm pulling in the filename for the title in the search results. Which would be fine, except that now it's showing all the category info in the title:

    Paper2Filename.PDF?topic=2
    Lorem ipsum...

    Is there a way to filter out everything after the '?' (ie, the artificial GET variables?) I'm trying to avoid asking them to redistill all of their PDFs.

  • #2
    Maybe a better solution would be to use .desc files to override the PDF meta data? See the users guide for details.

    Or you could just update the meta data in the PDF files so that it is correct. I think you can do this without re-creating the PDF.

    There is no way to filter out part of the file name if you are using the file name as the title.

    Comment


    • #3
      Originally posted by mmoyes View Post
      I'm setting up a search for a site that has 100ish PDFs which they want searchable by topic (category). Topics are stored in a many-to-many database. I've created a script which outputs all the PDF links with topic information attached, like so:
      http://site.com/paper1.pdf?topic=1&topic=4
      http://site.com/paper2.pdf?topic=2
      http://site.com/paper3.pdf?topic=1&topic=3&topic=5
      So does that mean you are actually using a PHP (or similar) script here, but with URL rewrite to make them appear as PDFs? And that your script is actually serving the PDF file?

      Because if this is the case, you may be able to specify a Content-Disposition header in your script so as to specify the real filename (ie: "paper1.pdf" without the "?topic=1" parameters) of the PDF. See this previous thread for details:
      http://www.wrensoft.com/forum/showth...=2045#post7685

      I'm not sure if your URL rewriting would be an issue or not, as it may mislead Zoom to think they are PDF files to begin with and not look for the header (although Zoom generally looks at the content-type specified first). Nonetheless, it might be worth giving that a go.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        You're right...

        You were right: adding the fake GET variables (filename.pdf?category=name) confused the spider, so it didn't recognize it as a PDF. (As an aside, if anyone tries this: If you happen to end up generating DESC files, which is what I did, it won't substitute your <title> in the results.) Of course, if you're generating .desc files anyway, you can just include the ZOOMCATEGORY meta tag in your .desc file.

        Thanks for your help with this. We're up and running.

        Cheers.

        Comment

        Working...
        X