none
Reading large text files

    Question

  • I've seen several ideas about processing large text files with powershell.

         (gc $LogFile -Read 1000000 |%{$_} |Measure-Object).Count
    is faster than
         (gc $LogFile).Count

    not only am I not sure their is still not a better a way, once I get the total line number at that point, I need to process all new lines shortly after:

         (gc $LogFile |Select -Skip $Count |Select-String $Regex)

    so I end up no further ahead. With v2, is their a better way to accomplish the above two procedures for large files?

    Thanks!

    Saturday, June 9, 2012 9:12 PM

Answers

  • A faster method would be:

    ([io.file]::ReadAllLines('c:\scripts\compnames.txt')).count


    Grant Ward, a.k.a. Bigteddy

    • Marked as answer by Ritmo2k Sunday, June 10, 2012 4:16 PM
    Sunday, June 10, 2012 4:54 AM

All replies

  • A faster method would be:

    ([io.file]::ReadAllLines('c:\scripts\compnames.txt')).count


    Grant Ward, a.k.a. Bigteddy

    • Marked as answer by Ritmo2k Sunday, June 10, 2012 4:16 PM
    Sunday, June 10, 2012 4:54 AM
  • Thanks Bigteddy.
    Sunday, June 10, 2012 4:17 PM
  • Thanks Grant, that is orders of magnitude faster.


    Richard Mueller - MVP Directory Services

    Sunday, June 10, 2012 9:13 PM
  • You can use StreamReader to read a file starting from an arbitrary position.

    From what I understand, you want to read a file, then that file is appended with some lines, and then you want to only read those appended lines. If that's the case, the first thing that needs to be done is to figure out where the appended lines will be located in the file. The shortest way is just to get the length of the file:

     
    $filePath = '.\test.txt'
    $startAppendedLines = (Get-Item $filePath).Length
    

    I'm making the assumption that the file length in bytes is equal to the last position of a file. If this isn't always the case, you can do something like this instead:

     
    $filePath = '.\test.txt'
    $fullPath = Resolve-Path $filePath
    $stream = New-Object System.IO.FileStream -ArgumentList $fullPath, 'Open'
       $startAppendedLines = $stream.Seek(0, 'End')
    $stream.Close()
    
    

    After lines of text have been appended to the file, we can position StreamReader to start reading lines of text starting from $startAppendedLines. Because there are some possible issues that must be dealt with when working with files (errors, encoding, etc.), I put calls to StreamReader inside a function called Get-TextFileContent. Then we can just call that function like this:

     
    Get-TextFileContent $filePath $startAppendedLines
    

    Here is the function:

    # Get the contents of the file starting from a position
    # specified as a byte offset from the beginning of the file
    #
    function Get-TextFileContent
    {
        Param([string]$FilePath, [int64]$StartPosition=0)
        
        try   {$fullPath = Resolve-Path $FilePath -ErrorAction Stop}
        catch {throw "Could not resolve path $FilePath"}
        
        if ($startPosition -lt 0) {
            $startPosition = 0
        }
    
        try   {$stream = New-Object System.IO.FileStream -ArgumentList $fullPath, 'Open' -ErrorAction Stop}
        catch {throw "Could not open file $fullPath"}
    
        $streamEnd = $stream.Seek(0, 'End')
        $streamStart = $stream.Seek(0, 'Begin')
        
        if (($streamStart -le $StartPosition) -and ($StartPosition -le $streamEnd)) {
    
            $reader = New-Object System.IO.StreamReader -ArgumentList $stream, $true
    
            # Get the reader to recognize the encoding by reading the first line 
            # of the text file. I found this to be necessary with Unicode files.
            #
            $reader.BaseStream.Seek(0, 'Begin') | Out-Null
            $reader.ReadLine() | Out-Null
            $reader.DiscardBufferedData()
    
            # Start reading file from start position
            #
            $reader.BaseStream.Seek($StartPosition, 'Begin') | Out-Null
            while (-not $reader.EndOfStream) {
                $reader.ReadLine()
            }
    
            $reader.Close()
        }
    
        # not sure if this is necessary or if $reader.Close() above also takes care of this
        #
        $stream.Close()
    }
    

    You can use the function above to start reading a file from an arbitrary position, but just be careful when doing this with files that contain characters represented by more than one byte, such as Unicode. You can probably speed up the function by changing how StreamReader reads the text file. Also, for really large files, adding TotalCount and ReadCount parameters would also be good, but this post is already long enough.

     

    Monday, June 11, 2012 2:14 AM
  • Cool stuff Teddy, this is significantly faster than Get-Content

    Jaap Brasser
    http://www.jaapbrasser.com

    Monday, June 11, 2012 7:59 AM
  • Yup, I uncovered the need to leverage this class as IO.File seems to have issues with files that are open, such as the log in question as I found out.
    Thanks for the detailed write up!
    Monday, June 11, 2012 12:28 PM
  • Cool stuff Teddy, this is significantly faster than Get-Content

    Jaap Brasser
    http://www.jaapbrasser.com

    I came across it when I was first learning Powershell, and was amazed at the difference, especially on large text files.  I still use Get-Content for most things, though, as these are usually small input files, and the performance difference is not noticible.

    Grant Ward, a.k.a. Bigteddy

    Monday, June 11, 2012 1:38 PM
  • I'm getting great results with this method when working on large files.

    I've got a script which goes through every file in a directory and counts the rows, works great.

    My only problem is that Powershell gradually eats up all available system memory until it hits 100%, then it starts paging out to disk. Once it starts to do that then performance hits the floor and it starts to take 1 hours to count rows in a file instead of 30 seconds.

    It seems that io.file is loading files into memory, counting them, then loading the next file into memory, counting it, and so on, but the memory is never released.

    Since I'm counting rows in over 900GB of files, this is a problem.

    Is there any way to make it release the file from memory after it's counted each file?

    Friday, December 21, 2012 3:00 PM
  • Try this:

    $recordcounts = @{}
    foreach ($file in Get-ChildItem <filespec>){
      get-content $file.fullname -ReadCount 10000 |
        foreach {$recordcounts[$_.fullname] += $_.count}
        }

    I've been able to get faster read times on large files using that than streaming if I tune the readcount to match the files.  You want to find a "sweet spot" where it's reading the maximum number of records at a time without having to start paging.

    Memory usage doing a streaming read will increase with the size of the file.  Using readcount, it increases to accomodate the size needed for the readcount, and stops there.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    • Edited by mjolinor Friday, December 21, 2012 3:18 PM
    Friday, December 21, 2012 3:14 PM
  • intersting. Reminds me of a discussion a few years back about the fastest way to read large files in VBScript. The .readall method turned out to be inefficient for large files, and the solution was to use the .read method and specify the number of characters to read as the length of the file. This apparently caused the file to be read directly into the variable rather than buffering it through a system io buffer of limited size.

    Al Dunbar -- remember to 'mark or propose as answer' or 'vote as helpful' as appropriate.

    Friday, December 21, 2012 4:03 PM
  • I wrote this experimenting with it.  It goes through a sequence of readcounts looking for the amount of pagefile change for each one.  Windows memory management seems to cause some erratic results (occasional spurrious negative readings), but I found that numbers at the end of a sequence of 0 change increases produce the best results.  The performance timing seems to go in "steps" that I think correspond to memory allocation thresholds that trigger increases in PF allocation.

    Function TrimWorkingSet {
    param([int] $procid)
      $sig = @"
    [DllImport("kernel32.dll")]
    public static extern bool SetProcessWorkingSetSize( IntPtr proc, int min, int max );
    "@
    $apptotrim = (get-process -Id $procid).Handle
    Add-Type -MemberDefinition $sig -Namespace User32 -Name Util -UsingNamespace System.Text -PassThru
    [User32.Util]::SetProcessWorkingSetSize($apptotrim,-1,-1)
    }
    $f = '<filespec>'
    .{foreach ($i in @(.01,0)){
      [GC]::Collect()
      TrimWorkingSet $PID | Out-Null
      $rc = 100 * $i
      $start_pf = Get-Process -id $pid | select -ExpandProperty PagedMemorySize64
      $time = (Measure-Command {gc $f -r $rc }).TotalMilliseconds
      $end_pf = Get-Process -id $pid | select -ExpandProperty PagedMemorySize64
      $used_pf = $end_pf - $start_pf
      'Read Count {0}  {1} MS PageFile Change {2}' -f $rc,$time,$used_pf
    }
    foreach ($i in 1..100){
      [GC]::Collect()
      TrimWorkingSet $PID | Out-Null
      $rc = 100 * $i
      $start_pf = Get-Process -id $pid | select -ExpandProperty PagedMemorySize64
      $time = (Measure-Command {gc $f -r $rc }).TotalMilliseconds
      $end_pf = Get-Process -id $pid | select -ExpandProperty PagedMemorySize64
      $used_pf = $end_pf - $start_pf
      'Read Count {0}  {1} MS PageFile {2}' -f $rc,$time,$used_pf
      }
     }
     #


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Friday, December 21, 2012 4:33 PM