none
PS to capture data from file to use in rename process RRS feed

  • Question

  • I'm attempting to take a group of specifically named PDF files, capture specific pieces of data within the file (using a control file of the phrase items I need), and use these data pieces to create a new file name. I'm having some issues with properly catching the data elements within the PDF. Below is the code I've gleaned from the forums to get me as far as I am. If you can assist, I would appreciate it.

    To identify the files which match the control search criteria (this works well):

    Set-StrictMode -Version latest
    Set-ExecutionPolicy unrestricted -Scope process
    
    $tdydate    = get-date -format d
    $path     = Split-Path -parent $MyInvocation.MyCommand.Definition
    #$path     = "\\phpds\phpapda\mvh him roi"
    $files    = Get-Childitem $path DB_16877_P_*.PDF -Recurse | Where-Object { !($_.psiscontainer)}
    $controls = Get-Content ($path + "\control_file.DB_16877_P")
    $output   = $path + "\output_DB_16877_P.log"
    
    
    Function getStringMatch
    {
      # Loop through all DB_16877_P_*.PDF files in the $path directory
      Foreach ($file In $files)
      {
        # Loop through the search strings in the control file
        ForEach ($control In $controls)
        {
          $result = Get-Content $file.FullName | Select-String $control -quiet -casesensitive
          If ($result -eq $True)
          {
            $match = $file.FullName
            $filedt = $file.GetCreationDate
            "Match on string :  $control  in file :  $match   date : $filedt " | Out-File $output -Append
          }
        }
      }
    }
    
    getStringMatch

    I used a separate PS (for testing only) for the second step to determine the data elements and output it for review:

    Set-StrictMode -Version latest
    Set-ExecutionPolicy unrestricted -Scope process
    #get-executionpolicy -list
    
    $tdydate    = get-date -format d
    $path     = Split-Path -parent $MyInvocation.MyCommand.Definition
    #$path     = "\\phpds\phpapda\mvh him roi"
    $files    = Get-Childitem $path DB_16877_P_*.PDF -Recurse | Where-Object { !($_.psiscontainer)}
    $controls = Get-Content ($path + "\control_file.DB_16877_P")
    $output   = $path + "\output_DB_16877_P.csv"
    
    
    # Create an array for results
      $results = @()
    
      # Loop through the project directory
      Foreach ($file In $files) 
      { 
        # load the content once
        $content = Get-Content $file.FullName 
    
        # Check all keywords
        ForEach ($control In $controls) 
        { 
          # find the line containing the control string
          $result = $content | Select-String $control -casesensitive 
          If ($result) 
          { 
            # tidy up the results and add to the array
            $line = $result.Line -split ":"
            $results += New-Object PSObject -Property @{
                FileName = $file.FullName 
                Control = $line[0].Trim()
                Value = $line[1].Trim()
            }
          } 
        } 
      } 
    
      
      # return the results
      $results
    
      #Output Results array to CSV
      $results | Export-Csv $output -NoTypeInformation

    Results are  (notice the garbage data in the Value column, and please ignore the poor column formatting):

    Control                                                  FileName                                                 Value

    BT /F1 220 Tf 0 g 1800 -7218 Td(MRN     X:\DB_16877_P_40.PDF                    <692>) Tj ET Q                        

    BT /F1 220 Tf 0 g 1800 -6712 Td(MRN     X:\DB_16877_P_41.PDF                    <281>) Tj ET Q


    All I really want for this value column is the actual MRN (the example shows a number surrounded by < >, but a correction to the file creation is going to take care of that issue).  I need to get this value put into storage, then use it to rename the file to the new naming structure.

    Wednesday, July 2, 2014 2:53 PM

Answers

  • You can use the -replace operator and a regex to do this.

    '<692>) Tj ET' -replace '\s*<(\d+)>.+','$1' 
    692


    In your code, you should be able to use that when you set $value:

    $Value = $line[1]  -replace '\s*<(\d+)>.+','$1'

    The $1 is a "backreference", which designates the first capture group (inside the parens) in the regex.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "



    • Proposed as answer by jrv Thursday, July 3, 2014 12:31 PM
    • Edited by mjolinorModerator Thursday, July 3, 2014 12:37 PM
    • Marked as answer by Cubepirategm Thursday, July 3, 2014 1:43 PM
    Thursday, July 3, 2014 12:19 PM
    Moderator

All replies

  • You cannot read the contents of most PDF files.  The contents are many times encrypted.  Some control fields are visible but the overll structure is NOT text, it is binary.

    To learn how to read these files post your questions in the Adobe developers forum.


    ¯\_(ツ)_/¯

    Wednesday, July 2, 2014 6:51 PM
  • I understand your thinking, but have to respectfully disagree, at least partially.  In the Results information shown in my original post, I am seeing the actual MRN value (in these examples:  <692> and <281>).  These are not binary values.  Please review and give additional response.  Thanks.

    Data posts from the actual PDF files being scanned:

    MRN: <692>

    MRN: <281>

    Thursday, July 3, 2014 11:50 AM
  • You might be able to extract some data on older PDF formats using RegEx.

    [void]('<692>) Tj ET Q ' -match '\<(\d+)\>');$matches[1]


    ¯\_(ツ)_/¯

    Thursday, July 3, 2014 12:07 PM
  • You can use the -replace operator and a regex to do this.

    '<692>) Tj ET' -replace '\s*<(\d+)>.+','$1' 
    692


    In your code, you should be able to use that when you set $value:

    $Value = $line[1]  -replace '\s*<(\d+)>.+','$1'

    The $1 is a "backreference", which designates the first capture group (inside the parens) in the regex.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "



    • Proposed as answer by jrv Thursday, July 3, 2014 12:31 PM
    • Edited by mjolinorModerator Thursday, July 3, 2014 12:37 PM
    • Marked as answer by Cubepirategm Thursday, July 3, 2014 1:43 PM
    Thursday, July 3, 2014 12:19 PM
    Moderator
  • Rob -

    Value lost its leading '$' in your example.


    ¯\_(ツ)_/¯

    Thursday, July 3, 2014 12:32 PM
  • Thanks.  Copy/paste error induced by caffeine deficiency.  I'll go try to fix that :).

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Thursday, July 3, 2014 12:39 PM
    Moderator
  • Thanks.  Copy/paste error induced by caffeine deficiency.  I'll go try to fix that :).

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    No - its the wweb site.  Whenever I place a $ on the beginning of a line the web site removes it. Add a space in front and it remains.  It is intermittent.  Some days it does it every time.  Today it doesn't for me.  I believe it actually depends on what is in the comment before the one being  entered.  I know mismatched quotes cause issues for the whole page.

    No space -

    value

    Space

     $value


    ¯\_(ツ)_/¯


    • Edited by jrv Thursday, July 3, 2014 12:44 PM
    Thursday, July 3, 2014 12:44 PM