Hi,
I have a site with a large number of Office 97 - 2010 documents (doc, docx, ppt, pptx, xls and xlsx). I am indexing the site offline and had initially set the option to Extract metadata for the Office 2007 file types, and I am not indexing content. However, the resulting database seems to be corrupt in that the same document title is being used multiple times in the search results for unrelated files. For example, assume that there are 10 results being returned, then the title could be the same for most or all of the results even though the summary and the URL are different. Switching off the option to Extract metadata from Office 2007 files fixes the issue, but I would rather it worked because switching it off reveals another issue (see below).
If I switch off the option to Extract metadata, there seems to be a corruption of the summary text which is displayed in the search results but also present in the zdat files. I can see that the summary text has come from the text from within the document, but figures have been inserted/added into the summary. For Word docx files, these figures seem to be either "4207510 -948690 0 0" or "493395 36830 0 0 2238375 36830 0 0 4013835 36830 0 0". They appear where there is a carriage return or where the text has been laid out in a table. For example, a two-line heading which is presented as:
Age
Data Profile
in the Word document, is being extracted and displayed in the summary as "Age 4207510 -948690 0 0 Data Profile". Then there is a 4-column table where the text is presented as:
Level | Usage | Compilation | Product Availability
Person | Prospecting | Modelled | ConsumerView
and this is being extracted and displayed in the summary as "Level Usage Compilation Product Availability 493395 36830 0 0 2238375 36830 0 0 4013835 36830 0 0 ConsumerView Monthly Person Prospecting Modelled". Note that the order of the text is different to how it is presented in the document.
I also noticed that the summary text is not from where I would have expected it. In the example, the heading has been extracted followed by text within a table at the bottom of a page. But there are more meaningful introductory paragraphs which could have been used but haven't. The same is true of other Office-based files.
Any thoughts or solutions on the above? I know that I could create ".desc" files to solve the issue, but I have hundreds of files which I would need to create these for, and also set up a MIME type of the server, so I really don't want to go down this route, if at all possible. If I could solve the first issue, and have Extract metadata switched on, without iteration of the titles then this would be a step in the right direction. I have a rather tight deadline to meet this week, so could really do with an early solution or fix.
Many thanks,
Russ
I have a site with a large number of Office 97 - 2010 documents (doc, docx, ppt, pptx, xls and xlsx). I am indexing the site offline and had initially set the option to Extract metadata for the Office 2007 file types, and I am not indexing content. However, the resulting database seems to be corrupt in that the same document title is being used multiple times in the search results for unrelated files. For example, assume that there are 10 results being returned, then the title could be the same for most or all of the results even though the summary and the URL are different. Switching off the option to Extract metadata from Office 2007 files fixes the issue, but I would rather it worked because switching it off reveals another issue (see below).
If I switch off the option to Extract metadata, there seems to be a corruption of the summary text which is displayed in the search results but also present in the zdat files. I can see that the summary text has come from the text from within the document, but figures have been inserted/added into the summary. For Word docx files, these figures seem to be either "4207510 -948690 0 0" or "493395 36830 0 0 2238375 36830 0 0 4013835 36830 0 0". They appear where there is a carriage return or where the text has been laid out in a table. For example, a two-line heading which is presented as:
Age
Data Profile
in the Word document, is being extracted and displayed in the summary as "Age 4207510 -948690 0 0 Data Profile". Then there is a 4-column table where the text is presented as:
Level | Usage | Compilation | Product Availability
Person | Prospecting | Modelled | ConsumerView
and this is being extracted and displayed in the summary as "Level Usage Compilation Product Availability 493395 36830 0 0 2238375 36830 0 0 4013835 36830 0 0 ConsumerView Monthly Person Prospecting Modelled". Note that the order of the text is different to how it is presented in the document.
I also noticed that the summary text is not from where I would have expected it. In the example, the heading has been extracted followed by text within a table at the bottom of a page. But there are more meaningful introductory paragraphs which could have been used but haven't. The same is true of other Office-based files.
Any thoughts or solutions on the above? I know that I could create ".desc" files to solve the issue, but I have hundreds of files which I would need to create these for, and also set up a MIME type of the server, so I really don't want to go down this route, if at all possible. If I could solve the first issue, and have Extract metadata switched on, without iteration of the titles then this would be a step in the right direction. I have a rather tight deadline to meet this week, so could really do with an early solution or fix.
Many thanks,
Russ
Comment