Indexing/Searching thousands of PDFs via network?


  • Hi all,

    I work for part of Western Michigan University and we're having a bit of trouble searching some documents.  What we have is about 60000 PDF files with about 1.5M pages combined on a network SMB share.  We're trying to search these documents for keywords in the actual content, the bookmarks, and the comments.  We're also looking for features like searching within a search and the ability to select only certain files instead of entire directories.

    With that being said I've been look at different ways to search these documents I-Filters(Adobe and FoxIt), PDF X-Change viewer, FoxIt PDF reader, Adobe Acrobat Pro and a couple others that have slipped my mind.  With how many and how large the documents are it takes about fifteen hours to conduct one search of the entire collection provided the software doesn't crash before it finishes.  Indexing these documents would greatly reduce the search times probably to a couple of minutes but Windows Search combined with a PDF I-Filter, is simplest, forces the searches to be done on the server.  We have about 30 client machines which from any of them should be able to search these files.

    So I guess what I'm looking for is a way to index on the server and have the client machines read the index to find keywords within the PDFs.  Or even have a back-end server process the PDFs indexed or non to be pushed to a front-end interface.  I looked at ElasticSearch for this but I'm not entirely sure how to implement ElasticSearch to read PDFs.

    Has anyone gone through this and found a solution?  Any ideas would be useful.

    Also, searching within a search and saving searches would be very nice.

    Wednesday, February 20, 2013 7:48 PM