none
Search a PDF and return specific text RRS feed

  • Question

  • I need to be able to search a PDF for about 200 different reference numbers that I know to return a value I do not know. Examples of the reference numbers:

    • ABC-12-012
    • ABC-012-86
    • ABC-0512-10

    Where the reference number will always:

    1. Be at the beginning of a line
    2. Follow the word "References:"
    3. Start with ABC-
    4. Between each hyphen could be varied counts of numeric characters.

    The data that I need is actually several lines above the "Reference". It is a series of dotted numbers followed by a description.  It resembles "9.8.1 Appendix A" but could just as easily be "9.1 Appendix D" or "9.2.8.63.4 Appendix C".

    Also, in case it matters, the known reference may not show up in every .PDF.

    Thanks for any help on this!

    Sample Text:

    ________________________________

    9.8.1 Appendix A

    Description:
    This is where a description would be.  there could be another header as well.

    Additional Information:
    One or more additional sections may exist between the 9.8.1 Appendix A (which is the text I need) and the ABC-0012-083 which is what I know to search for.

    References:
    ABC-0012-083

    9.8.2 Addendum 9

    Description:
    This is where a description would be.  there could be another header as well.

    Additional Information:
    One or more additional sections may exist between the 9.8.1 Appendix A (which is the text I need) and the ABC-0012-083 which is what I know to search for.

    References:
    ABC-021-19

    ________________________________

    Wednesday, February 26, 2014 10:45 PM

Answers

  • Give this a try. I've moved the existing code into a function to make it easier to manipulate the resulting objects however you'd like (in this case, by formatting them into table without all the extra spaces, and sending the output to a file, as you requested.) I also modified it to use the PdfTextExtractor class, which is what worked for me in the previous project I mentioned.

    function Get-ReferencesFromPdf
    {
        [CmdletBinding()]
        param (
            [Parameter(Mandatory = $true)]
            [string]
            $Path
        )
    
        $Path = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($Path)
    
        try
        {
            $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $Path
        }
        catch
        {
            throw
        }
    
        $number = ''
    
        for ($page = 1; $page -le $reader.NumberOfPages; $page++)
        { 
            $lines = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page) -split "\r?\n"
     
            foreach ($line in $lines)
            {
                switch -Regex ($line)
                {
                    '^\s*(\d+(?:\.\d+)+)'
                    {
                        $number = $matches[1]
                        break
                    }
    
                    '^\s*ABC-\d'
                    {
                        New-Object psobject -Property @{
                            References = $line.Trim()
                            Number = $number
                        }
    
                        break
                    }
                }
            }
        }
    
        $reader.Close()
    }
    
    Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll
    Add-Type -Path C:\Scripts\PdfToText\itextsharp.dll
    
    Get-ReferencesFromPdf -Path '.\test.pdf' |
    Format-Table -AutoSize |
    Out-File -FilePath '.\output.txt'

    • Marked as answer by Sure-man Monday, March 3, 2014 10:33 PM
    Friday, February 28, 2014 10:08 PM

All replies

  • Hi Sure-man,

    Based on my research, to parse the .pdf file you need load some other thing like pdfreader in powershell, which can be found here:

    Using Powershell to Parse a PDF file

    And the script below may be helpful for you, I have not tested, and this will extract the words between "exist between the" and "and the".

    Add-Type -Path .\itextsharp.dll
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
    for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
     $lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
     foreach ($line in $lines) {
      if ($line -match 'exist between the(.+?)and the')
    {
        $matches[1]
    }
    }
    }
    I hope this helps.
    Thursday, February 27, 2014 10:49 AM
    Moderator
  • Perhaps this is beyond my skill level.  I downloaded the source files from the "Using Powershell to Pars a PDF file" link.  Extracted and added them to the C:\Scripts directory.  Added my PDF to the same directory, and even named it test.pdf to keep the script as close to original as possible.  Opened powershell and powershell ISE as administrator created and ran the script you provided, and got the error below.

    _______________________________________________________
    Directory: C:\Scripts\PdfToText
    Mode                LastWriteTime     Length Name
    ----                -------------     ------ ----
    da---         2/27/2014  10:00 AM            Properties
    -a---         2/27/2014  10:18 AM    3567616 itextsharp.dll
    -a---         2/27/2014  10:51 AM        356 My_PDF_parser.ps1
    -a---         2/27/2014   9:45 AM        466 My_PDF_parser_v1.ps1
    -a---         2/27/2014  10:18 AM      10420 PDFParser.cs
    -a---         2/27/2014   9:59 AM       2286 PdfToText.csproj
    -a---         2/27/2014   9:59 AM        230 PdfToText.csproj.user
    -a---         2/27/2014  10:18 AM       2649 PowerShell.PDF.csproj
    -a---         2/27/2014   9:59 AM       1383 Program.cs
    -a---         2/26/2014   4:43 PM    2245919 Test.pdf
    PS C:\Scripts\PdfToText> .\My_PDF_parser.ps1
    _______________________________________________________

    Add-Type : Could not load file or assembly 'file:///C:\Scripts\PdfToText\itextsharp.dll' or one of its
    dependencies. Operation is not supported. (Exception from HRESULT: 0x80131515)
    At C:\Scripts\PdfToText\My_PDF_parser.ps1:1 char:1
    + Add-Type -Path .\itextsharp.dll
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : NotSpecified: (:) [Add-Type], FileLoadException
        + FullyQualifiedErrorId : System.IO.FileLoadException,Microsoft.PowerShell.Commands.AddTypeCommand
     New-Object : Cannot find type [iTextSharp.text.pdf.pdfreader]: make sure the assembly containing this type is loaded.
    At C:\Scripts\PdfToText\My_PDF_parser.ps1:2 char:11
    + $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
    +           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : InvalidType: (:) [New-Object], PSArgumentException
        + FullyQualifiedErrorId : TypeNotFound,Microsoft.PowerShell.Commands.NewObjectCommand

    _______________________________________________________

    At this time, I would have results faster by doing the searches manually, but would still like to know how to do this as I suspect this will be a recurring request.

    Thursday, February 27, 2014 4:22 PM
  • iTextSharp.dll is a third-party assembly that you'd need to download from http://sourceforge.net/projects/itextsharp/.  In the other thread, the PowerShell code assumes that it's located in your current working directory.

    Thursday, February 27, 2014 4:32 PM
  • iTextSharp.dll is a third-party assembly that you'd need to download from http ://sourceforge.net/projects/itextsharp/ .  In the other thread, the PowerShell code assumes that it's located in your current working directory.

    Thank you for the response David.  I had already downloaded iTextSharp.dll and placed it in the same directory as the script and pdf (illustrated in the "directory" section of my previous post).
    • Edited by Sure-man Thursday, February 27, 2014 4:41 PM
    Thursday, February 27, 2014 4:40 PM
  • Ah, I'd forgotten about this error.  Run this command and you should be all set (assumes you're running PowerShell 3.0 or later, for the Unblock-File cmdlet):

    Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll

    That "Operation is not supported" error pops up when you try to load an assembly that's still flagged as having been downloaded.
    Thursday, February 27, 2014 5:55 PM
  • Ah, I'd forgotten about this error.  Run this command and you should be all set (assumes you're running PowerShell 3.0 or later, for the Unblock-File cmdlet):

    Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll

    That "Operation is not supported" error pops up when you try to load an assembly that's still flagged as having been downloaded.

    David, is there any down side to unblocking on every use? Or perhaps a way to query if it needs unblocking? My customers are constantly bombarded with downloaded patch files and such, and just adding this to automatically unblock everything behind the scenes would be sweet. Assuming it works the same for EXEs, MSIs, MSPs, etc.

    EDIT: Dang, just caught the PS3 ref. I am forced to support V2 only. Sad trombone.

    Gordon


    Thursday, February 27, 2014 6:09 PM
  • The information about whether a file is blocked or not is based on its Zone.Identifier alternate data stream.  PowerShell 2.0 didn't have any built-in ways of dealing with these streams, but there were several workarounds (calling cmd.exe from PowerShell, which can be used to empty out the Zone.Identifier stream, or using the streams.exe command-line utility, etc).  There are examples of these techniques on the web:

    http://stackoverflow.com/questions/1617509/unblock-a-file-with-powershell

    http://thewayeye.net/2012/march/2/bulk-removing-zoneidentifier-alternate-data-streams-downloaded-windows-files

    Thursday, February 27, 2014 6:30 PM
  • I added:

           Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll

    And the script ran.  So, I would call that a success!

    Now, how do I get the findings when I have the known values in a text file and the output in a new text file. 

    Example:

    ABC-12-012
    ABC-012-86
    ABC-0512-10

    I want the results in a seperate text file (or .csv if possible) so I can then create a spreadsheet.  In the example text (between the lines), GREEN is what I want, RED is what I have:

    Example text within PDF (note the text in the pdf does not have the lines)
    _________________________________________

    9.8.1Appendix A

    Description:
    This is where a description would be.  there could be another header as well.

    Additional Information:
    One or more additional sections may exist between the 9.8.1 Appendix A (which is the text I need) and the ABC-0012-083 which is what I know to search for.

    References:
    ABC-0012-083

    9.8.2 Addendum 9

    Description:
    This is where a description would be.  there could be another header as well.

    Additional Information:
    One or more additional sections may exist between the 9.8.1 Appendix A (which is the text I need) and the ABC-0012-083 which is what I know to search for.

    References:
    ABC-021-19
    _________________________________________

    Then, I would like to see it (where \t is a tab) in a new text (or .csv) like this:

    ABC-0012-083   \t    9.8.1
    ABC-021-19       \t    9.8.2

    Could I make it any more complicated, probably...  Thanks Anna and David for your insight!

    Thursday, February 27, 2014 9:12 PM
  • Hi Sure-man,

    The script may be not the best one, just for your reference, which filter content based on the pdf lines you posted:

    $output = @()
    
    for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
    $lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
    foreach($line in $lines){
    if($line.length -gt 5){
    if ($line.substring(0,5) -match "[0-9].[0-9].[0-9]"){
       $number=$line.Substring(0,5)}
    
    if ($line.substring(0,5) -match "ABC-[0-9]"){
       $ABC = $line
    $Object = New-Object PSObject                                       
    $Object | add-member Noteproperty References     $ABC                
    $Object | add-member Noteproperty number         $number              
    $output += $object}
    }
    }
    }

    Best Regards,

    Anna

    Friday, February 28, 2014 3:06 AM
    Moderator
  • Thanks again Anna.  So I reran the script and it completes without error.  It's not clear to me if it's producing any output (assuming there is any) because after running the script, it just gives me a ps command prompt. At this point looks it like this:

    _________________________________________

    Unblock-File -PathC:\Scripts\PdfToText\iTextSharp.dll

    Add-Type -Path .\itextsharp.dll

    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"

    $output = @()

    for ($page = 1; $page -le $reader.NumberOfPages; $page++) {

    $lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"

    foreach($line in $lines){

    if($line.length -gt 5){

    if ($line.substring(0,5) -match "[0-9].[0-9].[0-9]"){

       $number=$line.Substring(0,5)}

    if ($line.substring(0,5) -match "ABC-[0-9]"){

       $ABC = $line

    $Object = New-Object PSObject

    $Object | add-member Noteproperty References     $ABC

    $Object | add-member Noteproperty number         $number

    $output += $object}

    }

    }

    }

    _________________________________________

    I should also clarify points from my original post and subsequent example(s):

    1. Since it looks like the line.substring range is (0,5) because in my example there seems to be 5 lines between the have and need,  but that is not always the case.  In fact there could be dozens of lines and section breaks between them.
    2. Since it looks like the -match "[0-9].[0-9].[0-9]" is looking for any 3 single digits seperated by 2 decimals, I should point out that the actual numbers could have as many as 9 decimals/periods and each number in between could be up to 4 numeric characters.  I assume there's a regular expression that could take this variation into account, but my regex writing techniques are even more limited than my powershell experience; limited to new line and carriage return (\r\n) [Thank you notepad++).  I can somewhat translate a regex, but writing one is a work in progress.
    Friday, February 28, 2014 3:03 PM
  • The SubString refers to the first 5 characters of each line, not the number of lines between the "number" and "reference".  Also, Anna's code was building an array called $output, but never actually displayed it on screen.

    I've tweaked Anna's code a bit to account for the comments you made.  Give this a try:

    Unblock-File -PathC:\Scripts\PdfToText\iTextSharp.dll
    
    Add-Type -Path .\itextsharp.dll
    
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
    
    $number = ''
    
    for ($page = 1; $page -le $reader.NumberOfPages; $page++)
    { 
        $lines = [char[]]$reader.GetPageContent($page) -join "" -split "\r?\n"
     
        foreach($line in $lines)
        {
            switch -Regex ($line)
            {
                '^\s*(\d+(?:\.\d+)*)'
                {
                    $number = $matches[1]
                    break
                }
    
                '^\s*ABC-\d'
                {
                    New-Object psobject -Property @{
                        References = $line.Trim()
                        Number = $number
                    }
                }
            }
        }
    }
    

    Friday, February 28, 2014 3:24 PM
  • The SubString refers to the first 5 characters of each line, not the number of lines between the "number" and "reference".  Also, Anna's code was building an array called $output, but never actually displayed it on screen.

    I've tweaked Anna's code a bit to account for the comments you made.  Give this a try:

    Unblock-File -PathC:\Scripts\PdfToText\iTextSharp.dll
    
    Add-Type -Path .\itextsharp.dll
    
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
    
    $number = ''
    
    for ($page = 1; $page -le $reader.NumberOfPages; $page++)
    { 
        $lines = [char[]]$reader.GetPageContent($page) -join "" -split "\r?\n"
     
        foreach($line in $lines)
        {
            switch -Regex ($line)
            {
                '^\s*(\d+(?:\.\d+)*)'
                {
                    $number = $matches[1]
                    break
                }
    
                '^\s*ABC-\d'
                {
                    New-Object psobject -Property @{
                        References = $line.Trim()
                        Number = $number
                    }
                }
            }
        }
    }

    First, thank you David for explaining the substring and the array.
    The code ran, with the addition of a space between Path and C (for the benefit of anyone else who may try this), but again, without displaying an output.
    Friday, February 28, 2014 3:37 PM
  • Hmm... if no output was displayed in my version, then the lines of text aren't being represented the way we expected.  Here's how I tested it, using a copy and paste from your orignial post:

    $lines = @'
    9.8.1Appendix A
     
    Description:
     This is where a description would be.  there could be another header as well.
     
    Additional Information:
     One or more additional sections may exist between the 9.8.1 Appendix A (which is the text I need) and the ABC-0012-083 which is what I know to search for.
     
    References:
     ABC-0012-083
     
    9.8.2 Addendum 9
     
    Description:
     This is where a description would be.  there could be another header as well.
     
    Additional Information:
     One or more additional sections may exist between the 9.8.1 Appendix A (which is the text I need) and the ABC-0012-083 which is what I know to search for.
     
    References:
     ABC-021-19
    
    '@ -split '\r?\n'
    
    foreach($line in $lines)
    {
        switch -Regex ($line)
        {
            '^\s*(\d+(?:\.\d+)*)'
            {
                $number = $matches[1]
                break
            }
    
            '^\s*ABC-\d'
            {
                New-Object psobject -Property @{
                    References = $line.Trim()
                    Number = $number
                }
            }
        }
    }

    As you can see, I removed the PDF-related code and just tested the string manipulation portions of it successfully.

    At this point, I'd suggest dumping the text being returned by the iTextSharp library to a file so you can examine it and figure out what's going on.  Try this code to create that file:

    Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll
    
    Add-Type -Path .\itextsharp.dll
    
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
    
    & {
        for ($page = 1; $page -le $reader.NumberOfPages; $page++)
        { 
            [char[]]$reader.GetPageContent($page) -join "" -split "\r?\n"
        }
    } |
    Set-Content -Path .\testPdfText.txt

    On a side note, you only need to call Unblock-File once.  It doesn't necessarily have to remain in the script, unless you just want to double check each time it runs.
    Friday, February 28, 2014 6:18 PM
  • Running the second script you provied (to create the testPdfText.txt) produced a text file, but there are random parenthesis around text which makes searching them impossible.  Example:

     EMC  /P <</MCID 28>> BDC BT
    /F1 12 Tf
    1 0 0 1 72.024 456.07 Tm
    [(Dig)4(it)-3(ally)4( signin)-3(g)4( S)-3(M)-4(B co)3(mm)4(unica)-2(ti)-3(on w)3(ill r)4(ed)3(uc)-6(e t)-3(he p)-3(r)4(oba)-3(b)-2(ili)-2(ty o)3(f )3(a su)10(ccess man i)-3(n t)-3(he )] TJ
    ET
    BT
    1 0 0 1 72.024 442.03 Tm
    [(mid)5(d)5(le at)-4(ta)-3(ck)5( b)-2(et)-3(w)4(ee)-3(n t)-3(he S)-3(M)-4(B cl)3(ie)-3(nt)-3( )10(an)-3(d)5( serv)6(er.)-2( )] TJ
    ET
    BT
    1 0 0 1 327.19 442.03 Tm
    [( )] TJ
    ET


    So, to skirt that, I simply opened the PDF, CTL+A, CTRL+C and CTRL+V into a new unicode .txt file...  That cleaned everything up and now i am ready to treat the content like any other file.  What should my script look like now?
    • Edited by Sure-man Friday, February 28, 2014 8:24 PM Adding more content.
    Friday, February 28, 2014 8:16 PM
  • I see.  Looks like you're getting the raw PDF markup language, instead of readable text, with this example code.

    I've worked with iTextSharp before, for a very similar requirement (extracting email addresses from PDF files), but I don't have the code handy right now.  I'll reply to this thread again later this evening after I get home and can look up how I did it.

    What I do remember is that iTextSharp's documentation is practically non-existent (though you can find a bit more information about the original iText library, which was for Java instead of .NET.)  It was a bit of a headache to figure out how to get it working at first.

    Friday, February 28, 2014 8:27 PM
  • I took the entire contents of the unicode .txt file i created (copy and pasting the pdf into a txt file), pasted it into your sample code (between the $lines = @' and '@ -split '\r?\n'), which was 4000 lines, but it worked!  the only issue I see, after a quick perusal of the output is it's taking the page number in place of the dotted number I am interested in.  Suggestions?
    Friday, February 28, 2014 8:31 PM
  • Depends.  Can the numbers you're looking for contain digits with no . characters, such as "1407"?  If so, it's going to be very difficult to distinguish page numbers from section numbers.

    On the other hand, if you're always looking for "Something.Something[.Something[.Something]]", it's easy enough to tweak the regex pattern so that least one period is required.  It would match "1.4", but not "1".

    To do that, change the regex pattern as follows:

    '^\s*(\d+(?:\.\d+)*)'
    
    # becomes:
    
    '^\s*(\d+(?:\.\d+)+)'
    

    All I did was change the * character to a + instead.
    Friday, February 28, 2014 9:11 PM
  • PERFECT!  '^\s*(\d+(?:\.\d+)+)' worked.

    Now, how do i get rid of the 4000 lines (between the$lines = @' and '@ -split '\r?\n'), and call the file containing the PDF (now just text) data?

    Also, output the entire finding to a textfile (or CSV) instead of the screen.  Ideally without all the spaces that I see when i run the script now.

    Friday, February 28, 2014 9:28 PM
  • Give this a try. I've moved the existing code into a function to make it easier to manipulate the resulting objects however you'd like (in this case, by formatting them into table without all the extra spaces, and sending the output to a file, as you requested.) I also modified it to use the PdfTextExtractor class, which is what worked for me in the previous project I mentioned.

    function Get-ReferencesFromPdf
    {
        [CmdletBinding()]
        param (
            [Parameter(Mandatory = $true)]
            [string]
            $Path
        )
    
        $Path = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($Path)
    
        try
        {
            $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $Path
        }
        catch
        {
            throw
        }
    
        $number = ''
    
        for ($page = 1; $page -le $reader.NumberOfPages; $page++)
        { 
            $lines = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page) -split "\r?\n"
     
            foreach ($line in $lines)
            {
                switch -Regex ($line)
                {
                    '^\s*(\d+(?:\.\d+)+)'
                    {
                        $number = $matches[1]
                        break
                    }
    
                    '^\s*ABC-\d'
                    {
                        New-Object psobject -Property @{
                            References = $line.Trim()
                            Number = $number
                        }
    
                        break
                    }
                }
            }
        }
    
        $reader.Close()
    }
    
    Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll
    Add-Type -Path C:\Scripts\PdfToText\itextsharp.dll
    
    Get-ReferencesFromPdf -Path '.\test.pdf' |
    Format-Table -AutoSize |
    Out-File -FilePath '.\output.txt'

    • Marked as answer by Sure-man Monday, March 3, 2014 10:33 PM
    Friday, February 28, 2014 10:08 PM
  • Give this a try. I've moved the existing code into a function to make it easier to manipulate the resulting objects however you'd like (in this case, by formatting them into table without all the extra spaces, and sending the output to a file, as you requested.) I also modified it to use the PdfTextExtractor class, which is what worked for me in the previous project I mentioned.

    function Get-ReferencesFromPdf
    {
        [CmdletBinding()]
        param (
            [Parameter(Mandatory = $true)]
            [string]
            $Path
        )
    
        $Path = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($Path)
    
        try
        {
            $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $Path
        }
        catch
        {
            throw
        }
    
        $number = ''
    
        for ($page = 1; $page -le $reader.NumberOfPages; $page++)
        { 
            $lines = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page) -split "\r?\n"
     
            foreach ($line in $lines)
            {
                switch -Regex ($line)
                {
                    '^\s*(\d+(?:\.\d+)+)'
                    {
                        $number = $matches[1]
                        break
                    }
    
                    '^\s*ABC-\d'
                    {
                        New-Object psobject -Property @{
                            References = $line.Trim()
                            Number = $number
                        }
    
                        break
                    }
                }
            }
        }
    
        $reader.Close()
    }
    
    Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll
    Add-Type -Path C:\Scripts\PdfToText\itextsharp.dll
    
    Get-ReferencesFromPdf -Path '.\test.pdf' |
    Format-Table -AutoSize |
    Out-File -FilePath '.\output.txt'

    Running that produced the output.txt file (which is empty) perhaps because of the following error (about 50 times):

    Unable to find type [iTextSharp.text.pdf.parser.PdfTextExtractor]: make sure that the assembly containing this type is loaded.

    At line:25 char:9

    +         $lines = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage( ...

    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        + CategoryInfo          : InvalidOperation: (iTextSharp.text...dfTextExtractor:TypeName) [], RuntimeException

        + FullyQualifiedErrorId : TypeNotFound

    Thanks for your help David. 

    Friday, February 28, 2014 10:28 PM
  • That's odd. The same code works fine for me.  I ran this as a test, and got the expected output:

    Add-Type -Path '.\itextsharp.dll'
    
    $reader = New-Object iTextSharp.text.pdf.PdfReader("$pwd\SomeTestFile.pdf")
    
    for ($page = 1; $page -le $reader.NumberOfPages; $page++)
    {
        $lines = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page) -split '\r?\n'
    
        for ($i = 0; $i -lt $lines.Count; $i++)
        {
            'Page {0}, Line {1}: {2}' -f $page, ($i + 1), $lines[$i]
        }
    }
    
    $reader.Close()

    I seem to remember reading that the PdfTextExtractor class wasn't always part of the iTextSharp libraries.  Is it possible that you're using an old version of iTextSharp.dll?  I'm using version 5.4.4.0, which may not be the most recent anymore (but was current as of late last year.)

    Saturday, March 1, 2014 1:04 AM
  • I tried it again and got the same error.  I have iTextSharp version 5.5.0. 

    After re-running it, I found PdfTextExtractor.cs and copied it (from C:\Scripts\PdfToText\iTextSharp\text\pdf\parser) to the same directory that the ps script, text.pdf, and itextsharp.dll are in (C:\Scripts\PdfToText\iTextSharp).  Reran the most recent script and got the same errors.

    Saturday, March 1, 2014 3:02 AM
  • I don't know what to tell you. It works fine for me, and I verified this with version 5.5.0 as well.  I used the iTextSharp.dll file from the itextsharp-dll-core.zip file contained in itextsharp-all-5.5.0.zip, downloaded from http://sourceforge.net/projects/itextsharp/files/latest/download
    Saturday, March 1, 2014 4:31 AM
  • Hi Sure-man,

    I’m writing to just check in to see if the suggestions were helpful. If you need further help, please feel free to reply this post directly so we will be notified to follow it up.

    If you have any feedback on our support, please click here.

    Best Regards,

    Anna

    TechNet Community Support

    Monday, March 3, 2014 2:03 AM
    Moderator
  • I don't know what to tell you. It works fine for me, and I verified this with version 5.5.0 as well.  I used the iTextSharp.dll file from the itextsharp-dll-core.zip file contained in itextsharp-all-5.5.0.zip, downloaded from http ://sourceforge.net/projects/itextsharp/files/latest/download

    I downloaded a new instance of iTextSharp (following your link); moved the iTextSharp.dll to my scripts directory and reran the code from Saturday, March 01, 2014 1:04 AM.  Got a new error this time.  Does this have anything to do with the OS as Win7?  I found some references to this exception 0x80131515 about files being blocked in Win7, but it doesn't appear to be blocked (based on properties).

    Add-Type : Could not load file or assembly 'file:///C:\Scripts\PdfToText\itextsharp.dll' or one of its

    dependencies. Operation is not supported. (Exception from HRESULT: 0x80131515)

    At line:1 char:1

    + Add-Type -Path '.\itextsharp.dll'

    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        + CategoryInfo          : NotSpecified: (:) [Add-Type], FileLoadException

        + FullyQualifiedErrorId : System.IO.FileLoadException,Microsoft.PowerShell.Commands.AddTypeCommand

     

    New-Object : Cannot find type [iTextSharp.text.pdf.PdfReader]: make sure the assembly containing this type is

    loaded.

    At line:3 char:11

    + $reader = New-Object iTextSharp.text.pdf.PdfReader("$pwd\Test.pdf")

    +           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        + CategoryInfo          : InvalidType: (:) [New-Object], PSArgumentException

        + FullyQualifiedErrorId : TypeNotFound,Microsoft.PowerShell.Commands.NewObjectCommand

     

    You cannot call a method on a null-valued expression.

    At line:15 char:1

    + $reader.Close()

    + ~~~~~~~~~~~~~~~

        + CategoryInfo          : InvalidOperation: (:) [], RuntimeException

        + FullyQualifiedErrorId : InvokeMethodOnNull

    Monday, March 3, 2014 10:13 PM
  • Give this a try. I've moved the existing code into a function to make it easier to manipulate the resulting objects however you'd like (in this case, by formatting them into table without all the extra spaces, and sending the output to a file, as you requested.) I also modified it to use the PdfTextExtractor class, which is what worked for me in the previous project I mentioned.

    function Get-ReferencesFromPdf
    {
        [CmdletBinding()]
        param (
            [Parameter(Mandatory = $true)]
            [string]
            $Path
        )
    
        $Path = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($Path)
    
        try
        {
            $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $Path
        }
        catch
        {
            throw
        }
    
        $number = ''
    
        for ($page = 1; $page -le $reader.NumberOfPages; $page++)
        { 
            $lines = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page) -split "\r?\n"
     
            foreach ($line in $lines)
            {
                switch -Regex ($line)
                {
                    '^\s*(\d+(?:\.\d+)+)'
                    {
                        $number = $matches[1]
                        break
                    }
    
                    '^\s*ABC-\d'
                    {
                        New-Object psobject -Property @{
                            References = $line.Trim()
                            Number = $number
                        }
    
                        break
                    }
                }
            }
        }
    
        $reader.Close()
    }
    
    Unblock-File -Path C:\Scripts\PdfToText\iTextSharp.dll
    Add-Type -Path C:\Scripts\PdfToText\itextsharp.dll
    
    Get-ReferencesFromPdf -Path '.\test.pdf' |
    Format-Table -AutoSize |
    Out-File -FilePath '.\output.txt'

    Even after getting the error I mentionedabout 20 minutes ago, I went ahead and retried this script and it worked without error.  I am satisfied with that.  I've marked this as the answer since I got what I requested.

    Thanks David Wyatt and AnnaWY for your perseverance!  I provided feedback (positive) to the link AnnaWY included yeseterday.  You guys are rockstars!

    Monday, March 3, 2014 10:35 PM
  • Hi Sure-man,

    I'm glad we could be helpful, and also thanks for your positive feedback =)

    Tuesday, March 4, 2014 6:07 AM
    Moderator