none
MS Word - Mass text conversion hanging on some files

    Question

  • So I am trying to pull the text out of a lot of documents and upload the text to Azure. The script does work quite well but occasionally it hits a file that just freezes MS Word, I have tried performing the process manually on these files and it freezes for me then too. Here is part of the script:

    ## Word object
    if (!($continue)) {
        if ($pdf){
            $files = (Get-ChildItem -force -recurse $documentFolder -include *.pdf).fullname
        }
        else {
            $files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
        }
    
        $files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
        $i=0; Get-Content $documentFile -ReadCount $interval | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
    }
    
    
    
    foreach ($list in (Get-ChildItem ( join-path $PSScriptRoot "\FileLists\*" ) -include *.txt )) {
    
        $word = New-Object -ComObject word.application
        $word.Visible = $false
        $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
        $word.DisplayAlerts = 0
    
    
        Write-Output ""
        Write-Output "################# Parsing $list"
        Write-Output ""
    
        $query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
                  VALUES "
    
        foreach ($file in (Get-Content $list)) {
            if ($file -like "*-*" -and $file -notlike "*~*") {
    
                Write-Output "Processing: $($file)"
                Try {
                    $doc = $word.Documents.OpenNoRepairDialog($file, $false, $false, $false, "ttt")
    
                    if ($doc) {
                        $fileName = [io.path]::GetFileNameWithoutExtension($file)
                        $fileName = $filename + ".txt"
                        $doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
                        $doc.Close()
    
    
                        $4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
                        $text = Get-Content -raw "$env:TEMP\$fileName"
                        $text = $text.replace("'", "''")
    
                        $query += "
                        ('$text', $4ID),"
                        Remove-Item -Force "$env:TEMP\$fileName"
    
                        <# Upload to azure #>
                        $query = $query.Substring(0,$query.Length-1)
                        $query += ";"
    
                        $params = @{
                            'Database' = $TRIS5DATABASENAME
                            'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
                            'Username' = $AdminLogin
                            'Password' = $InsecurePassword
                            'query'    = $query
                        }
                        Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
    
                        $query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
                          VALUES "
                    }
                }
                Catch {
                    Write-Host "$($file) failed to process" -ForegroundColor RED;
                    continue
                }
            }
        }
        Remove-Item -Force $list.FullName
    
    
        Write-Output ""
        Write-Output "Uploading to azure"
        Write-Output ""
    
    
        $setQuery = "
        UPDATE tmp_CachedText SET tNID = OriginalREsumeID FROM Candidate WHERE tOID = OLDID;
        UPDATE document SET CachedText = tCachedtext FROM tmp_CachedText WHERE document.ID = tNID;
        DELETE FROM tmp_CachedText;"
    
        <# Upload to azure #>
        $params = @{
            'Database' = $TRIS5DATABASENAME
            'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
            'Username' = $AdminLogin
            'Password' = $InsecurePassword
            'query'    = $setQuery
        }
        Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
    
    
        $word.Quit()
        TASKKILL /f /PID WINWORD.EXE
    }

    So basically it creates a large list of files each containing x amount of file paths ( based on the $interval parameter ), uploads them to a table in an Azure SQL database and then updates the documents table. The problem is, is that some files just stop word from being able to save as .txt so it hangs at the following line:

    $doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)

    I just wanted to know if there was any way around this, it happens quite often and at the moment I can only really force close the MS Word process to get the script to continue but that skips any other files in the document file. It also means I have to keep monitoring it which isn't ideal as this whole process will take weeks on the number of files I have.

    Tuesday, November 6, 2018 4:44 AM

All replies

  • First why are you writing this to word but saving it as text.  There is no need for Word unless your files are all doc or docx files.

    I fixed some obvious issues but cannot understand what this is supposed to do and cannot test without your system.

    Do not keep recreating the word object. Just close the document and reopen the next document. DOing this repeatedly will cause corruption and conflict with Word.

    Keep all constants out of your loop so it is clean and easy to understand.

    ## Word object
    if (!($continue)) {
        if ($pdf){
            $files = (Get-ChildItem -force -recurse $documentFolder -include *.pdf).fullname
        }else {
            $files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
        }
    
        $files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
        $i=0
        Get-Content $documentFile -ReadCount $interval | 
            ForEach-Object{
                $i++
                Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")
            }
    }
    
    $setQuery = @'
        UPDATE tmp_CachedText SET tNID = OriginalREsumeID FROM Candidate WHERE tOID = OLDID;
        UPDATE document SET CachedText = tCachedtext FROM tmp_CachedText WHERE document.ID = tNID;
        DELETE FROM tmp_CachedText;
    '@
    $querytmplt = "INSERT INTO tmp_CachedText (tCachedText, tOID) VALUES ('{0}','{1}')"
    
    
    $word = New-Object -ComObject word.application
    $word.Visible = $false
    $saveFormat = [Microsoft.Office.Interop.Word.WdSaveFormat]::wdFormatText
    
    $params = @{
        Database = $TRIS5DATABASENAME
        ServerInstance = $($AzureServerInstance.FullyQualifiedDomainName)
        Username = $AdminLogin
        Password = $InsecurePassword
        ErrorAction = 'Stop'
    }
    
    foreach ($list in (Get-ChildItem ( join-path $PSScriptRoot 'FileLists\*.txt' ))) {
    
        $word.DisplayAlerts = 0
    
    
        Write-Output ""
        Write-Output "################# Parsing $list"
        Write-Output ""
    
        foreach ($file in (Get-Content $list)) {
            if ($file -like '*-*' -and $file -notlike '*~*') {
    
                Write-Output "Processing: $($file)"
                Try {
                    $doc = $word.Documents.OpenNoRepairDialog($file, $false, $false, $false, 'ttt')
    
                    if ($doc) {
                        $fileName = ([io.FileInfo]$file).Basename + '.txt'
                        $doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
                        $doc.Close()
    
    
                        $4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
                        $text = Get-Content -raw "$env:TEMP\$fileName"
                        $text = $text.replace("'", "''")
    
                        $query = $querytmplt -f $text, $4ID
                        Remove-Item -Force "$env:TEMP\$fileName"
    
                        <# Upload to azure #>
                        Invoke-Sqlcmd @params -Query $query
                    }
                }
                Catch {
                    Write-Host "$($file) failed to process" -ForegroundColor RED
                }
            }
        }
        
        Remove-Item -Force $list.FullName
    
        Write-Output ""
        Write-Output "Uploading to azure"
        Write-Output ""
    
        <# Upload to azure #>
        Invoke-Sqlcmd @params -Query $setQuery
    }
    
    $word.Quit()

    I also have shown some techniques to simplify code.


    \_(ツ)_/


    • Edited by jrv Tuesday, November 6, 2018 5:35 AM
    Tuesday, November 6, 2018 5:31 AM
  • Thanks for the reply. The reason for it is we are moving from a document index search to a database full-text search for documents so every existing document needs to have the text pulled out and uploaded to the database. Every file will be either .doc, .docx or .PDF and this is the best way I could find to extract the text from all 3 types.

    What you have done above is very helpful so thank you, it has sped up the process a bit. This is running over millions of documents and from memory the last time I tried this I had the creation of the word com object outside the loop entirely and after it ran for an hour or so I would get errors pointing to normal.dotm default template.




    • Edited by OwainEsau Tuesday, November 6, 2018 6:16 AM
    Tuesday, November 6, 2018 6:00 AM
  • If you use MSSearch service it can search the text in all Office file types and index it.  Also PDF files and most other file types can be indexed and efficiently searched.

    You post is corrupted. Please disable Grammerly for this site.   Contact Grammerly for a fix for this issue.


    \_(ツ)_/

    Tuesday, November 6, 2018 6:05 AM
  • We are moving away from MS search, documents are in sharepoint now and the searching is done via full text from an SQL database rather than a call to MSSearch.
    Tuesday, November 6, 2018 6:18 AM