I'm setting up a search for a site that has 100ish PDFs which they want searchable by topic (category). Topics are stored in a many-to-many database. I've created a script which outputs all the PDF links with topic information attached, like so:
http://site.com/paper1.pdf?topic=1&topic=4
http://site.com/paper2.pdf?topic=2
http://site.com/paper3.pdf?topic=1&topic=3&topic=5
Zoom spiders this list, categorization works perfectly, and the search is great. The only problem is, I've had to turn off meta information in the PDF options because they've been distilling from Word documents, and with the options they've been using, search results look like:
Paper1Title-Aug31.doc
Lorem ipsum...
Paper2Title-Sept14.doc
Lorem ipsum...
Which is confusing, since they're PDFs. So instead, I'm pulling in the filename for the title in the search results. Which would be fine, except that now it's showing all the category info in the title:
Paper2Filename.PDF?topic=2
Lorem ipsum...
Is there a way to filter out everything after the '?' (ie, the artificial GET variables?) I'm trying to avoid asking them to redistill all of their PDFs.
http://site.com/paper1.pdf?topic=1&topic=4
http://site.com/paper2.pdf?topic=2
http://site.com/paper3.pdf?topic=1&topic=3&topic=5
Zoom spiders this list, categorization works perfectly, and the search is great. The only problem is, I've had to turn off meta information in the PDF options because they've been distilling from Word documents, and with the options they've been using, search results look like:
Paper1Title-Aug31.doc
Lorem ipsum...
Paper2Title-Sept14.doc
Lorem ipsum...
Which is confusing, since they're PDFs. So instead, I'm pulling in the filename for the title in the search results. Which would be fine, except that now it's showing all the category info in the title:
Paper2Filename.PDF?topic=2
Lorem ipsum...
Is there a way to filter out everything after the '?' (ie, the artificial GET variables?) I'm trying to avoid asking them to redistill all of their PDFs.
Comment