locked
Problem is occurring in docx format; it is not printing “New Line” character in extracted txt using IFilter (offfiltx.dll) while with doc file IFilter (OffFilt.dll) is working fine. RRS feed

  • Question

  • Problem: Problem is occurring in docx format; it is not printing “New Line” character in extracted txt using IFilter (offfiltx.dll) while with doc file IFilter (OffFilt.dll) is working fine.

    .

    Environment: -

    Operating System: Windows XP SP2/7

    Language: C#

    MS Office Version: - MS Office 2007

     

    .

    Problem Description: -

    We havedocx file with new line character, and we are processing this file in IFilter for extracting text, and it is giving output with concatenation of lines.

    Docx file format (Sample.docx)

    Test this music

    Word processing

    Testing docx file

    Output: - Test this music Word processing Testing docx file

    .

    Requirement: - We have requirement to get following text in particular format with New Line from docx because client is using docx format only.

    Test this music

    Word processing

    Testing docx file

    .

    Attempt: -

    We have tried a lot after changing IFilter configuration, but it is not giving required output. Then we saved same file in doc format (Sample.doc), which is giving required output.

    Because it is application specific problem, kindly assist to resolve issue on priority. We are sharing IFilter paths for extracting text for doc and docx.

    Doc Filter Location: - %systemroot%\system32\OffFilt.dll

    Docx Filter Location: - <Drive>:\PROGRA~1\COMMON~1\MICROS~1\Filters\offfiltx.dll

    .

    Code Snippet for setting property of filter

    .

    internal static IFilter LoadAndInitIFilter(string fileName, string extension)

            {

                IFilter filter = LoadIFilter(extension);

                if (filter == null)

                    return null;

                IPersistFile persistFile = (filter as IPersistFile);

                if (persistFile != null)

                {

                    persistFile.Load(fileName, 0);

                    IFILTER_FLAGS flags;

                    IFILTER_INIT iflags =

                                IFILTER_INIT.CANON_HYPHENS |

                                IFILTER_INIT.CANON_PARAGRAPHS |

                                IFILTER_INIT.CANON_SPACES |

                                IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |

                                IFILTER_INIT.HARD_LINE_BREAKS |

                                IFILTER_INIT.FILTER_OWNED_VALUE_OK;

                    if (filter.Init(iflags, 0, IntPtr.Zero, out flags) == IFilterReturnCode.S_OK)

                        return filter;

                }

                Marshal.ReleaseComObject(filter);

                return null;

            }

    .

    Kindly assist to resolve this issue and also let us know if any input is required.

    For any help, we would be really thankful.

    Monday, November 24, 2014 6:40 AM

All replies

  • Problem: Problem is occurring in docx format; it is not printing “New Line” character in extracted txt using IFilter (offfiltx.dll) while with doc file IFilter (OffFilt.dll) is working fine.

    .

    Environment: -

    Operating System: Windows XP SP2/7

    Language: C#

    MS Office Version: - MS Office 2007

     

    .

    Problem Description: -

    We havedocx file with new line character, and we are processing this file in IFilter for extracting text, and it is giving output with concatenation of lines.

    Docx file format (Sample.docx)

    Test this music

    Word processing

    Testing docx file

    Output: - Test this music Word processing Testing docx file

    .

    Requirement: - We have requirement to get following text in particular format with New Line from docx because client is using docx format only.

    Test this music

    Word processing

    Testing docx file

    .

    Attempt: -

    We have tried a lot after changing IFilter configuration, but it is not giving required output. Then we saved same file in doc format (Sample.doc), which is giving required output.

    Because it is application specific problem, kindly assist to resolve issue on priority. We are sharing IFilter paths for extracting text for doc and docx.

    Doc Filter Location: - %systemroot%\system32\OffFilt.dll

    Docx Filter Location: - <Drive>:\PROGRA~1\COMMON~1\MICROS~1\Filters\offfiltx.dll

    .

    Code Snippet for setting property of filter

    .

    internal static IFilter LoadAndInitIFilter(string fileName, string extension)

            {

                IFilter filter = LoadIFilter(extension);

                if (filter == null)

                    return null;

                IPersistFile persistFile = (filter as IPersistFile);

                if (persistFile != null)

                {

                    persistFile.Load(fileName, 0);

                    IFILTER_FLAGS flags;

                    IFILTER_INIT iflags =

                                IFILTER_INIT.CANON_HYPHENS |

                                IFILTER_INIT.CANON_PARAGRAPHS |

                                IFILTER_INIT.CANON_SPACES |

                                IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |

                                IFILTER_INIT.HARD_LINE_BREAKS |

                                IFILTER_INIT.FILTER_OWNED_VALUE_OK;

                    if (filter.Init(iflags, 0, IntPtr.Zero, out flags) == IFilterReturnCode.S_OK)

                        return filter;

                }

                Marshal.ReleaseComObject(filter);

                return null;

            }

    .

    Kindly assist to resolve this issue and also let us know if any input is required.

    For any help, we would be really thankful.


    Wednesday, November 26, 2014 3:33 PM
  • Hi,

    This question is being discussed in Windows Desktop Development forum now:

    https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/60de8e65-23cd-4bce-ad4d-786fb23d4c50/problem-is-occurring-in-docx-format-it-is-not-printing-new-line-character-in-extracted-txt-using?forum=windowsgeneraldevelopmentissues#b5480072-c249-46d1-b8fa-555db13174d9

    Regards,

    Melon Chen
    TechNet Community Support


    It's recommended to download and install Configuration Analyzer Tool (OffCAT), which is developed by Microsoft Support teams. Once the tool is installed, you can run it at any time to scan for hundreds of known issues in Office programs.
    Monday, December 8, 2014 6:37 AM