As a follow-up, and to conclude for the time being, I've included at the end of this message the results for two successive runs of the same offline configuration over the same data set, the only difference being in the number of threads used, nine for the first run, one for the second.
For some reason multiple-thread runs produce slightly better results: in some but not all cases, about 3 to 9 percent more files were reported as being indexed, than in single-thread mode. In both modes, the same 127 files were reported as "Errors" for both runs. It is not clear what the status of the other unindexed files was. For example, plain text files, 706 indexed in multiple-thread mode, 704 in single thread mode.
Among the "hard-core set" of 127 problematic files:
- About 40 Excel files from a particular sub-group of files stored on our GitHub server turned out not to be real Excel files at all, but symbolic links that when downloaded from our GitHub server appear on the local file system as normal Excel files, but only 1 KB in size.
- About 50 Word files from an particular sub-group of files stored on our SVN server are old Word .doc files which can be opened locally, but not indexed. This may be related to their format, as they were all created with legacy versions of Word about 10 years ago.
- All remaining files are various Excel, Word, or PowerPoint files, mostly unrelated, and stored in various places on our SharePoint server. Local copies of these can be opened, but not indexed, for reasons that remain unclear to us.
Log Summary Report:
20:13:42 - Start indexing (offline mode) at Tue Oct 1 20:13:42 2019
20:48:32 - Indexing completed at Tue Oct 1 20:48:32 2019
20:48:32 - INDEX SUMMARY
20:48:32 - Files indexed: 27204
20:48:32 - Files skipped: 79821
20:48:32 - Files filtered: 1594
20:48:32 - Emails indexed: 0
20:48:32 - Unique words found: 779655
20:48:32 - Variant words found: 584214
20:48:32 - Total words found: 52207012
20:48:32 - Avg. unique words per page: 28.66
20:48:32 - Avg. words per page: 1919
20:48:32 - Peak physical memory used: 1421 MB
20:48:32 - Peak virtual memory used: 41013 MB
20:48:32 - Errors: 127
20:48:32 - Total bytes scanned/downloaded: 4068560597
20:48:32 - File extensions:
20:48:32 - .asp indexed: 0
20:48:32 - .aspx indexed: 0
20:48:32 - .cc indexed: 1587
20:48:32 - .cenv indexed: 160
20:48:32 - .cgi indexed: 0
20:48:32 - .crdl indexed: 1144
20:48:32 - .doc indexed: 436
20:48:32 - .docm indexed: 39
20:48:32 - .docx indexed: 125
20:48:32 - .h indexed: 1428
20:48:32 - .htm indexed: 8418
20:48:32 - .html indexed: 6755
20:48:32 - .one indexed: 346
20:48:32 - .onetoc2 indexed: 132
20:48:32 - .pdf indexed: 507
20:48:32 - .php indexed: 0
20:48:32 - .php3 indexed: 0
20:48:32 - .php4 indexed: 0
20:48:32 - .png indexed: 1651
20:48:32 - .ppt indexed: 2
20:48:32 - .pptm indexed: 14
20:48:32 - .pptx indexed: 627
20:48:32 - .py indexed: 2405
20:48:32 - .txt indexed: 706
20:48:32 - .vsd indexed: 161
20:48:32 - .vsdx indexed: 2
20:48:32 - .wmf indexed: 1
20:48:32 - .xls indexed: 18
20:48:32 - .xlsm indexed: 261
20:48:32 - .xlsx indexed: 279
22:12:54 - Start indexing (offline mode) at Tue Oct 1 22:12:54 2019
23:03:12 - Indexing completed at Tue Oct 1 23:03:12 2019
23:03:12 - INDEX SUMMARY
23:03:12 - Files indexed: 26982
23:03:12 - Files skipped: 79780
23:03:12 - Files filtered: 1592
23:03:12 - Emails indexed: 0
23:03:12 - Unique words found: 779605
23:03:12 - Variant words found: 584169
23:03:12 - Total words found: 51634751
23:03:12 - Avg. unique words per page: 28.89
23:03:12 - Avg. words per page: 1913
23:03:12 - Peak physical memory used: 569 MB
23:03:12 - Peak virtual memory used: 9729 MB
23:03:12 - Errors: 127
23:03:12 - Total bytes scanned/downloaded: 4022275435
23:03:12 - File extensions:
23:03:12 - .asp indexed: 0
23:03:12 - .aspx indexed: 0
23:03:12 - .cc indexed: 1587
23:03:12 - .cenv indexed: 160
23:03:12 - .cgi indexed: 0
23:03:12 - .crdl indexed: 1144
23:03:12 - .doc indexed: 400
23:03:12 - .docm indexed: 39
23:03:12 - .docx indexed: 120
23:03:12 - .h indexed: 1428
23:03:12 - .htm indexed: 8418
23:03:12 - .html indexed: 6755
23:03:12 - .one indexed: 346
23:03:12 - .onetoc2 indexed: 132
23:03:12 - .pdf indexed: 500
23:03:12 - .php indexed: 0
23:03:12 - .php3 indexed: 0
23:03:12 - .php4 indexed: 0
23:03:12 - .png indexed: 1501
23:03:12 - .ppt indexed: 2
23:03:12 - .pptm indexed: 14
23:03:12 - .pptx indexed: 611
23:03:12 - .py indexed: 2405
23:03:12 - .txt indexed: 704
23:03:12 - .vsd indexed: 161
23:03:12 - .vsdx indexed: 2
23:03:12 - .wmf indexed: 1
23:03:12 - .xls indexed: 18
23:03:12 - .xlsm indexed: 255
23:03:12 - .xlsx indexed: 279
For some reason multiple-thread runs produce slightly better results: in some but not all cases, about 3 to 9 percent more files were reported as being indexed, than in single-thread mode. In both modes, the same 127 files were reported as "Errors" for both runs. It is not clear what the status of the other unindexed files was. For example, plain text files, 706 indexed in multiple-thread mode, 704 in single thread mode.
Among the "hard-core set" of 127 problematic files:
- About 40 Excel files from a particular sub-group of files stored on our GitHub server turned out not to be real Excel files at all, but symbolic links that when downloaded from our GitHub server appear on the local file system as normal Excel files, but only 1 KB in size.
- About 50 Word files from an particular sub-group of files stored on our SVN server are old Word .doc files which can be opened locally, but not indexed. This may be related to their format, as they were all created with legacy versions of Word about 10 years ago.
- All remaining files are various Excel, Word, or PowerPoint files, mostly unrelated, and stored in various places on our SharePoint server. Local copies of these can be opened, but not indexed, for reasons that remain unclear to us.
Log Summary Report:
20:13:42 - Start indexing (offline mode) at Tue Oct 1 20:13:42 2019
20:48:32 - Indexing completed at Tue Oct 1 20:48:32 2019
20:48:32 - INDEX SUMMARY
20:48:32 - Files indexed: 27204
20:48:32 - Files skipped: 79821
20:48:32 - Files filtered: 1594
20:48:32 - Emails indexed: 0
20:48:32 - Unique words found: 779655
20:48:32 - Variant words found: 584214
20:48:32 - Total words found: 52207012
20:48:32 - Avg. unique words per page: 28.66
20:48:32 - Avg. words per page: 1919
20:48:32 - Peak physical memory used: 1421 MB
20:48:32 - Peak virtual memory used: 41013 MB
20:48:32 - Errors: 127
20:48:32 - Total bytes scanned/downloaded: 4068560597
20:48:32 - File extensions:
20:48:32 - .asp indexed: 0
20:48:32 - .aspx indexed: 0
20:48:32 - .cc indexed: 1587
20:48:32 - .cenv indexed: 160
20:48:32 - .cgi indexed: 0
20:48:32 - .crdl indexed: 1144
20:48:32 - .doc indexed: 436
20:48:32 - .docm indexed: 39
20:48:32 - .docx indexed: 125
20:48:32 - .h indexed: 1428
20:48:32 - .htm indexed: 8418
20:48:32 - .html indexed: 6755
20:48:32 - .one indexed: 346
20:48:32 - .onetoc2 indexed: 132
20:48:32 - .pdf indexed: 507
20:48:32 - .php indexed: 0
20:48:32 - .php3 indexed: 0
20:48:32 - .php4 indexed: 0
20:48:32 - .png indexed: 1651
20:48:32 - .ppt indexed: 2
20:48:32 - .pptm indexed: 14
20:48:32 - .pptx indexed: 627
20:48:32 - .py indexed: 2405
20:48:32 - .txt indexed: 706
20:48:32 - .vsd indexed: 161
20:48:32 - .vsdx indexed: 2
20:48:32 - .wmf indexed: 1
20:48:32 - .xls indexed: 18
20:48:32 - .xlsm indexed: 261
20:48:32 - .xlsx indexed: 279
22:12:54 - Start indexing (offline mode) at Tue Oct 1 22:12:54 2019
23:03:12 - Indexing completed at Tue Oct 1 23:03:12 2019
23:03:12 - INDEX SUMMARY
23:03:12 - Files indexed: 26982
23:03:12 - Files skipped: 79780
23:03:12 - Files filtered: 1592
23:03:12 - Emails indexed: 0
23:03:12 - Unique words found: 779605
23:03:12 - Variant words found: 584169
23:03:12 - Total words found: 51634751
23:03:12 - Avg. unique words per page: 28.89
23:03:12 - Avg. words per page: 1913
23:03:12 - Peak physical memory used: 569 MB
23:03:12 - Peak virtual memory used: 9729 MB
23:03:12 - Errors: 127
23:03:12 - Total bytes scanned/downloaded: 4022275435
23:03:12 - File extensions:
23:03:12 - .asp indexed: 0
23:03:12 - .aspx indexed: 0
23:03:12 - .cc indexed: 1587
23:03:12 - .cenv indexed: 160
23:03:12 - .cgi indexed: 0
23:03:12 - .crdl indexed: 1144
23:03:12 - .doc indexed: 400
23:03:12 - .docm indexed: 39
23:03:12 - .docx indexed: 120
23:03:12 - .h indexed: 1428
23:03:12 - .htm indexed: 8418
23:03:12 - .html indexed: 6755
23:03:12 - .one indexed: 346
23:03:12 - .onetoc2 indexed: 132
23:03:12 - .pdf indexed: 500
23:03:12 - .php indexed: 0
23:03:12 - .php3 indexed: 0
23:03:12 - .php4 indexed: 0
23:03:12 - .png indexed: 1501
23:03:12 - .ppt indexed: 2
23:03:12 - .pptm indexed: 14
23:03:12 - .pptx indexed: 611
23:03:12 - .py indexed: 2405
23:03:12 - .txt indexed: 704
23:03:12 - .vsd indexed: 161
23:03:12 - .vsdx indexed: 2
23:03:12 - .wmf indexed: 1
23:03:12 - .xls indexed: 18
23:03:12 - .xlsm indexed: 255
23:03:12 - .xlsx indexed: 279
Comment