none
Search indexing of BDC information containing database fileds and PDF files RRS feed

  • Question

  • We are building a new application using SharePoint including a customized search interface. The information that we want to make available through SharePoint Search comes from two sources. We have many PDF files that we want to be indexed and made available for full text search. For each of the PDF files, we have related metadata information that is stored separately in an Oracle database table. We would like to pass this information to the search crawler so that the PDF file along with its corresponding metadata information is treated as a single item.

    I have created a webservice that will cycle through the Oracle database for each item to be indexed. The metadata information in Oracle contains a pointer to the PDF file on the file system. Within the webservice, I have been able to package together the required metadata along with the binary stream of the PDF document. All works well through this point.

    I have also successfully created an application definition file mapping the metadata information and the binary stream for the BDC. The problem that I encounter is when the search crawler begins its work. It essentially does not know how to handle the PDF file (byte array). 

    The reasoning behind this is that a full text search might return several thousand matches, then it might be helpful for the user to fine tune the search results by selecting values that match certain metadata fileds from the database.

    1. Is it even possible to pass this type of packaged information into the BDC is a way that the crawler can index both the metadata and the PDF file content as a single unit? If so, where can I find an example of how the ADF should be formatted for the search crawler to property interpret the file and content?

    2. Is this a job for a custom protocol handler? Where can I find a good code example of what a custom protocol handler should look like, preferably in C#?

    3. Are the other options to consider which would make it easier to link the database content to the pdf file when searching?

    4. Our environment consists of MOSS 2007, will this type of merging of data work for this environment? I have seen the BCS for 2010 which appears that it might meet the requirements, although we are not yet ready to upgrade.

    5. Are there other technical articles or resources that would help to complete the task? I have seen some products and tools available from other vendors that sound like they would provide the functionality so it seems like it is possible to do, I am just having a hard time finding the right resources online to help.

    Thanks in advance for your help. If mor detailed information is necessary, please let me know.

    John

    Wednesday, January 6, 2010 4:48 PM

All replies

  • Hi John, This is an interesting scenerio. To me it appears that with metadata in oracle and pdf in filesystem it would not be possible to crawl them as a single item. Above that I am not even aware if we can crawl PDF files on file system.

    1. Is it even possible to pass this type of packaged information into the BDC is a way that the crawler can index both the metadata and the PDF file content as a single unit? If so, where can I find an example of how the ADF should be formatted for the search crawler to property interpret the file and content?

    Not sure if this is possible, but to give it a try i will create a Doc Lib with some columns and try to look into the content db for thr format in which it is stored there. I will do this because search can crawl PDF files stored in doc lib. So if i replicate the same format in my webservice, may be it works.

    2. Is this a job for a custom protocol handler? Where can I find a good code example of what a custom protocol handler should look like, preferably in C#?

    No idea.

    3. Are the other options to consider which would make it easier to link the database content to the pdf file when searching?

    There is one option, but it will be a total shift from the architecture you mentioned and the data will be redundant. You can create doc Lib in sharepoint with your metadata as its columns. Then upload the doc from file system to sharepoint and metadata from oracle. Search will work like a charm to handle metadata and pdf as a single item. But data will be reduntant if you need to keep the pdf on file system and oracle data for some other purpose.

    4. Our environment consists of MOSS 2007, will this type of merging of data work for this environment? I have seen the BCS for 2010 which appears that it might meet the requirements, although we are not yet ready to upgrade.

    It seems to me too that 2010 can handle this because in that we create list with reference to the data in DB .

    5. Are there other technical articles or resources that would help to complete the task? I have seen some products and tools available from other vendors that sound like they would provide the functionality so it seems like it is possible to do, I am just having a hard time finding the right resources online to help.
    No idea.

    I know my reply might not solve your problem, but its is an interesting issue and thought of sharing how I feel about it. Do keep this thread updated on how you advancements.

    Cheers
    Nitin Sablok
    Tuesday, January 12, 2010 6:51 PM