none
Using Powershell to Parse a PDF file

    Question

  • Hi All,

    I'm new to the scripting world and am loving how powershell is making my life easier.

    One task that I am still doing manually however is comparing some numbers we get from a supplier in pdfs to ones that I have in our database. Ideally I would be able to extract the information using regular expressions in powershell.

     

    I have found this blog post here: http://www.beefycode.com/post/ConvertFrom-PDF-Cmdlet.aspx

    Which looks like an ideal solution for me, however I'm not an IT major, so C# is a little beyond me. I tried downloading C# Visual Studio express 2010 to compile the cmdlet but couldn't seem to get the Cmdlet working in powershell.

    Does anybody know of an alternate solution, or could someone explain how to get the above Cmdlet working?

     

    Wednesday, September 29, 2010 12:20 AM

Answers

  • This was an awesome little task!  I was able to get it working based on this project.

    You need the itextsharp library from here

    Then you can run the following code to extract the text.... Mind you I just kind of pieced it together and only tested it with two pdfs, but it seems to work fine, and will either be good enough for you or you should be able to figure out the rest:

    Add-Type -Path .\itextsharp.dll
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
    
    for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
     $lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
     foreach ($line in $lines) {
      if ($line -match "^\[") {   
       $line = $line -replace "\\([\S])", $matches[1]
       $line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
      }
     }
    }
    I'm pretty sure there will be pdfs this will have problems with, but modifying the regexes should fix it up.  This definitely deserves a deeper look.  If it winds up formatting incorrectly for your pdfs and you can't figure out the regexes just send a link to the pdf, if possible.
    write-host ((0..56)|%{if (($_+1)%3 -eq 0){[char][int]("116111101110117102102064103109097105108046099111109"[($_-2)..$_] -join "")}}) -separator ""
    • Edited by Tome Tanasovski Thursday, September 30, 2010 2:00 AM
    • Marked as answer by IamMred Tuesday, October 5, 2010 2:37 AM
    Wednesday, September 29, 2010 7:03 PM
  • btw, to see the pdf in it's raw format just inspect $line in the loop.  You'll see all the garbage that's in there that I'm stripping out.
    write-host ((0..56)|%{if (($_+1)%3 -eq 0){[char][int]("116111101110117102102064103109097105108046099111109"[($_-2)..$_] -join "")}}) -separator ""
    • Marked as answer by IamMred Tuesday, October 5, 2010 2:37 AM
    Wednesday, September 29, 2010 7:04 PM

All replies

  • This was an awesome little task!  I was able to get it working based on this project.

    You need the itextsharp library from here

    Then you can run the following code to extract the text.... Mind you I just kind of pieced it together and only tested it with two pdfs, but it seems to work fine, and will either be good enough for you or you should be able to figure out the rest:

    Add-Type -Path .\itextsharp.dll
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
    
    for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
     $lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
     foreach ($line in $lines) {
      if ($line -match "^\[") {   
       $line = $line -replace "\\([\S])", $matches[1]
       $line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
      }
     }
    }
    I'm pretty sure there will be pdfs this will have problems with, but modifying the regexes should fix it up.  This definitely deserves a deeper look.  If it winds up formatting incorrectly for your pdfs and you can't figure out the regexes just send a link to the pdf, if possible.
    write-host ((0..56)|%{if (($_+1)%3 -eq 0){[char][int]("116111101110117102102064103109097105108046099111109"[($_-2)..$_] -join "")}}) -separator ""
    • Edited by Tome Tanasovski Thursday, September 30, 2010 2:00 AM
    • Marked as answer by IamMred Tuesday, October 5, 2010 2:37 AM
    Wednesday, September 29, 2010 7:03 PM
  • btw, to see the pdf in it's raw format just inspect $line in the loop.  You'll see all the garbage that's in there that I'm stripping out.
    write-host ((0..56)|%{if (($_+1)%3 -eq 0){[char][int]("116111101110117102102064103109097105108046099111109"[($_-2)..$_] -join "")}}) -separator ""
    • Marked as answer by IamMred Tuesday, October 5, 2010 2:37 AM
    Wednesday, September 29, 2010 7:04 PM
  • Thanks Tome! That is working like a charm, the regexes you provided miss out the data I am trying to capture unfortunately but I was able to find it in the raw data, so I should be able to put together some regexes to pull it out, I can't post the actual pdfs I'm working on unfortunately, the boss would probably not approve, but you have got me 90% of the way there. Much appreciated!

     

     

    Tuesday, October 5, 2010 2:55 AM
  • Hi Tome,

    I'm having trouble duplicating your success. 

    When I run the script, there is an output of "[] 0 d" several times, and "[2 3] 11d".

    Do you know how I can extract my specific data from a PDF file? 

    Tuesday, December 20, 2016 8:17 PM