PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

PHP using CGI index - Hitting 300,000 unique word limit

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PHP using CGI index - Hitting 300,000 unique word limit

    I have a DVD product that I am using the Zoom indexer for, but I need to be able to run scripts in PHP. I have a huge index of around 6,000 pdfs that I will be indexing, and some of them are very large. With the PHP indexer I keep running into the limits and am unable to index everything that I need.

    My question, can I use the CGI indexes via a PHP script, or is the data formatted so differently as to make it impossible?

    Thanks a ton,
    Sean

  • #2
    6000 PDF files is not that many. Is surprises me a little that you are hitting some type of limit in PHP. Which limit are you hitting? Maybe you can just increase the limit?

    No you can not use a set of index files that where generated with the PHP option turned on, with the CGI script.

    With PHP the binary data format in the index is different compared to the CGI. (For example, 2 byte page number in PHP, 4 bytes in CGI).

    What you could do is,

    1) Just use the CGI. I don't know how you are running PHP on a DVD. Maybe using Server2Go or MicroWeb? But there is probably no reason why you can't mix PHP and CGI on your DVD.

    2) Wrap the CGI script in PHP code, if option 1) was not possible for some reason. You could use PHP code like this,
    <?php
    $QSTRING = $_SERVER['QUERY_STRING'];
    while (list ($header, $value) = each ($HTTP_GET_VARS))
    {
    $QSTRING = $QSTRING.'&'.$header.'='.$value;
    }
    virtual("/cgi-bin/search.cgi".'?'.$QSTRING);
    ?>

    ------
    David

    Comment


    • #3
      Originally posted by Wrensoft
      Which limit are you hitting? Maybe you can just increase the limit?
      I am hitting the limit on the unique words. I have maxed it out already. And I cannot change the number of words per file scanned or risk not scanning enough of a file for it to be useful.

      Originally posted by Wrensoft
      No you can not use a set of index files that where generated with the PHP option turned on, with the CGI script.
      Actually, what I mean to ask is the opposite. Can I use the index files generated witht he CGI option turned on with the PHP script. It sounds like this is not possible. Am I correct in that assumption?

      Originally posted by Wrensoft
      1) Just use the CGI. I don't know how you are running PHP on a DVD.
      A homegrown solution, but very similar to what the other outfits are doing.

      Originally posted by Wrensoft
      But there is probably no reason why you can't mix PHP and CGI on your DVD.
      So, I will probably try to run the CGI using your "virtual method" and see where I come out. I just thought I might see if it were easier than that, and who can blame anyone for trying to make a project go faster.

      Thanks,

      Sean

      Comment


      • #4
        I am hitting the limit on the unique words
        The limit is 300,000 words. But there are only about 50,000 words in the English language. So it is worth a bit of investigation to know where the other 250,000 words are coming from. Are they huge lists of part numbers? Maybe multiple foreign languages?

        Can you have a look in the zoom_dictionary.zdat file with a good text editor and see what words are in this file. Maybe you are indexing rubbish or binary data (which would appear as rubbish).

        Can I use the index files generated witht he CGI option turned on with the PHP script. It sounds like this is not possible. Am I correct in that assumption?
        Yes, this is correct. The only way to do this is to call the CGI from PHP. Or just call the CGI directly from HTML without PHP.

        -----
        David

        Comment


        • #5
          Originally posted by Wrensoft
          The limit is 300,000 words. But there are only about 50,000 words in the English language. So it is worth a bit of investigation to know where the other 250,000 words are coming from. Are they huge lists of part numbers? Maybe multiple foreign languages?

          Can you have a look in the zoom_dictionary.zdat file with a good text editor and see what words are in this file. Maybe you are indexing rubbish or binary data (which would appear as rubbish).
          Well, you appear to be right. There is a lot of garbage in there, and it seems to be binary data from the PDF files. Is this how the Zoom indexer views the data in the PDFs? Isn't it parsing out "words" that have only numbers in them, and those that have special characters? My index included things like *(/[ as a word, which is totally ludicrous.

          I'm sure if I were to fix this simple problem I would be able to use the PHP indexer without any problem.

          Thanks,
          Sean

          Comment


          • #6
            On closer look to the dictionary file that there is a -1 next to many of the non-allowed words as well as some of the allowed words, which may denote that they are not listed or indexed. Like "To" and "a" and "lineprinter".

            Am I correct about this identifier? Are these words (there are thousands in my file) counted towards the 300,000 unique words, especially since they reside in this file?

            Thanks,
            Sean

            Comment


            • #7
              Originally posted by seangates
              There is a lot of garbage in there, and it seems to be binary data from the PDF files. Is this how the Zoom indexer views the data in the PDFs? Isn't it parsing out "words" that have only numbers in them, and those that have special characters? My index included things like *(/[ as a word, which is totally ludicrous.
              This is not normal. Make sure you have the Standard or Professional Edition of Zoom, and that you have correctly installed the PDF plugin on this page:
              http://www.wrensoft.com/zoom/plugins.html

              If you are indexing in Spider mode, check if your web server is returning the correct Content-Type (aka MIME type) for the PDF files being indexed. This is particularly often incorrect when people have PDF files served via a server-side script, such as a PHP file like "download.php?fileid=123" or similar.

              If your site is online, perhaps you can give us a URL to the PDF files in question and we can check if there is something unusual happening.

              Alternatively, you could e-mail us (at zoom [at] wrensoft [dot] com) one of the PDF files which causes this problem.

              Originally posted by seangates
              On closer look to the dictionary file that there is a -1 next to many of the non-allowed words ... Are these words (there are thousands in my file) counted towards the 300,000 unique words...
              No, they do not count towards the number of unique words.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                Originally posted by Ray
                This is not normal. Make sure you have the Standard or Professional Edition of Zoom, and that you have correctly installed the PDF plugin ...
                I am using Professsional Edition and have installed the PDF as directed.

                Originally posted by Ray
                If you are indexing in Spider mode ...
                I am indexing in Offlind Mode.

                Originally posted by Ray
                If your site is online ...
                Totally offline. The index will be used for a distributed DVD.

                Originally posted by Ray
                Alternatively, you could e-mail us (at zoom [at] wrensoft [dot] com) one of the PDF files which causes this problem.
                I'll do that today.

                Thanks!
                Sean

                Comment


                • #9
                  Originally posted by Ray
                  This is not normal. Make sure you have the Standard or Professional Edition of Zoom, and that you have correctly installed the PDF plugin ...
                  I am using Professsional Edition and have installed the PDF as directed.
                  Originally posted by Ray
                  If you are indexing in Spider mode ...
                  I am indexing in Offline Mode.
                  Originally posted by Ray
                  If your site is online ...
                  Totally offline. The index will be used for a distributed DVD.
                  Originally posted by Ray
                  Alternatively, you could e-mail us (at zoom [at] wrensoft [dot] com) one of the PDF files which causes this problem.
                  I cannot send the PDFs since they are proprietary and very sensitive to our business. But I'll email the dictionary file today.

                  Thanks!
                  Sean

                  Comment


                  • #10
                    Got the file.

                    You said in an earlier post that you index files contained 300,000 unique words. The file you sent us however contains <20,000 words. So I am a bit confused.

                    You also claimed the file was full garbage binary data. I could find no evidence of this in the file you sent. It was all readable text. So this makes me even more confused?!?!

                    There was some punctuation characters being indexed, but nothing abnormal.

                    In short, there is nothing wrong with the file you sent.

                    -----
                    David

                    Comment


                    • #11
                      I only sent a truncated version. I apologize about that. To the best of my knowledge this was binary data, so you can tell I'm not much of a low-level programmer. I'll rerun the indexer and send you the full dictionary file when it is completed. I won't be able to run it until Monday, so please stay tuned.

                      Also, sorry about any confusion.

                      Thanks,
                      Sean

                      Comment


                      • #12
                        I sent off the dictionary file just a couple of minutes ago. 5mb, and it reached the 300,000 word limit again.

                        As a side note, the documents being searched are very technical in nature, with a lot of tables of data: error codes, acronyms, function codes, etc.

                        Please let me know if you need anything else.

                        Thanks,
                        Sean

                        Comment


                        • #13
                          Got the new file. with 300,000 Unique words.

                          I think you really have 300,000 words. I think you have this many unique words becuase

                          1) You are indexing English and French documents. CRÉDIT, AUTRES, D'OCCASION, IMPÔTS. This will double your count as you have two entire languages.

                          2) You apear to indexing technical doucments that contain lots of hexadecimal numbers. 4C15, 4C21, 2A04, 2B04, 3C04, FF7FFFFF, etc...

                          3) You are using a huge number of made up words, often with hyphens. So you have words like, COMPENSATION-CLERICAL, BMW-SALARIES-CLERICAL, MIN-SALARIES-CLERICAL, C-HONDA, PARTS-WARRANTY-HONDA, Lienhldr3St1, Lienhldr3State, etc... Maybe becuase you are indexing some software source code?

                          4) You are using a huge number of different abreivations across many documents. e.g. MECH-CUST, MECH-EXP, LSE-BLDG-INS.

                          5) You have a huge number of product part numbers being indexed. At least I assume they are part numbers. Maybe they are line numbers from source code?

                          6) There are a huge number of nonsense words. GLUHFWRU, FKRVHQ. VKXW, FHUWLILHG, DVHUVWDWLRQ. etc.. Maybe they contain encrypted text, maybe there is a problem with the PDF converter, maybe the PDFs contain UUENCODED data or Rot13 data. It is not clear to me where these words are coming from without seeing the PDF files

                          Can you send us an example file, like 212898_b13573.pdf, 184835_68781-21.pdf or fsg_kenworth.pdf. If you can't send it can you broadly tell us what kind of text / numbers are in these files.

                          ------
                          David

                          Comment


                          • #14
                            Originally posted by Wrensoft
                            Can you send us an example file, like 212898_b13573.pdf, 184835_68781-21.pdf or fsg_kenworth.pdf. If you can't send it can you broadly tell us what kind of text / numbers are in these files.
                            Well Dave, I've been humbled. I didn't realize the extent of these documents until you pointed me to some perfect examples.

                            These three docs are all very cluttered (and I mean VERY) with codes, database records (with iterating IDs) and other meta-data used in our products. Soooooo, I still have to index it, and I still have a dilemma. So, my next questions go like this:

                            Can I do any regular expression skipping? Say skipping numbers longer than 4 digits, and numbers with digits that occur three consecutive times (i.e. 111 or 777)?

                            Am I now limited to the CGI index?

                            Thanks, and I apologize for my pig-headedness earlier. It's all the stress, I tell ya'!

                            Sean

                            Comment


                            • #15
                              Say skipping numbers longer than 4 digits, and numbers with digits that occur three consecutive times
                              No, the skipping options are not this sophisticated. The cloest you could come would be to put the following in your skip list.
                              *111
                              *222
                              *333
                              *444
                              *555
                              *666
                              *777
                              *888
                              *999
                              *000

                              And I don't think it would be the final solution in any case. It would not help remove the "nonsense words".

                              If you really need to index all this stuff you'll need to switch to using the CGI option. If you have this many words you'll want the speed of the CGI option in any case.

                              I quickly looked back through the old posts, you never actually said what the problem with using the CGI version was?

                              -----
                              David

                              Comment

                              Working...
                              X