none
PowerShell - How to speed up file read and write? Can it be buffered?

    Question

  • We get huge map files (up to 1,3 Gb) of geo data.  Out of these massive northing, easting, height files, I just need a small area from these huge files (around 50 - 80 Mb Worth of data).

    After a lot of Research on these forums, I managed to put together the script below.  The filter below reads the file, checks if the coordinates are within what I wish for, and if so, Writes the line to another file.

    But this script takes days to run on huge files... I am not sure what slows it Down, if it is the read, the echo or the Write (my bet is the Write).  But can I modify this script so it goes faster? Maybe chunk load and Write (if there is such a thing)?

    $sr = New-Object System.IO.StreamReader("C:\NEHfiles\CountyX.NEH")                              
    while($line = $sr.ReadLine()){
                   if($line.substring(0,7) -gt "6542880" -and $line.substring(0,7) -lt "6544143" -and $line.substring(11,6) -gt "612779" -and $line.substring(11,6) -lt "613873"){
                                                echo $Line
                                                $line >> Area.txt
                   }
    }

    • Moved by Bill_Stewart Thursday, January 02, 2014 8:44 PM Question outside reasonable forum scope
    Thursday, November 14, 2013 3:01 PM

All replies

  • Is your file sorted in any way?  If not, then you're not going to get much better performance than you've already got, as you really do need to read and check every line of the file.

    If the data file is sorted or structured in some way that allows you to know exactly which bytes or lines to read, then you can improve performance somewhat by either breaking out of your loop early (not calling ReadLine() on the rest of the file, once you've got what you need), or possibly making use of FileStream.Seek (http://msdn.microsoft.com/en-us/library/system.io.filestream.seek(v=vs.110).aspx) to start reading the file from a specific byte, rather than reading all of the lines before that point.

    You can do some of your own buffering as well, if you'd like, but I don't know how much performance gain you'll see.  You'd use this method instead of ReadLine():  http://msdn.microsoft.com/en-us/library/9kstw824(v=vs.110).aspx , and write your own code to extract lines from that buffer.

    Thursday, November 14, 2013 3:20 PM
  • A couple of thoughts: PowerShell is a good general purpose tool but it's designed to save time on script writing and accept that it mgiht run a bit slower than 'proper' code. If you need performance then break this out into C#.

    If you want good performance then break it out to C++.

    Now, let's see if we can't reduce the amount of processing:

    At the moment your code is nice and neat but unless the compiler is smarter than i think you might get better performance by nesting your tests, this nests only the first of the tests:

    $sr = New-Object System.IO.StreamReader("C:\NEHfiles\CountyX.NEH")                              
    while($line = $sr.ReadLine()){
                   if($line.substring(0,7) -gt "6542880")
                      {
                          if ($line.substring(0,7) -lt "6544143" -and $line.substring(11,6) -gt "612779" -and $line.substring(11,6) -lt "613873"){
                                                Write-Host $Line
                                                $line >> Area.txt
                       }
                   }
    }

    That way it won't bother taking the rest of the substrings.

    The second one is that you can always load the $line variable into an array and only write that out at the end.

    $sr = New-Object System.IO.StreamReader ("C:\NEHfiles\CountyX.NEH")

    #New results array variable:

    $results = @() while($line = $sr.ReadLine()){ if($line.substring(0,7) -gt "6542880" -and $line.substring(0,7) -lt "6544143" -and $line.substring(11,6) -gt "612779" -and $line.substring(11,6) -lt "613873"){ #Now just add the $line object to the array

    $results += $line } } $results >> "Area.txt"

    If it dies half way through then you're stuffed and you can't read part of it as it goes but that'll work.

    Beyond that you could try spliting the file into sections and threading the process. That's not too bad but it's not a novice task either.

    Finally what's the RAM load on your box looking like? It might be that you need more memory to hold everything and that you're paging it to disk.

    Thursday, November 14, 2013 4:15 PM
  • At the moment your code is nice and neat but unless the compiler is smarter than i think you might get better performance by nesting your tests, this nests only the first of the tests:

    $sr = New-Object System.IO.StreamReader("C:\NEHfiles\CountyX.NEH")                              
    while($line = $sr.ReadLine()){
                   if($line.substring(0,7) -gt "6542880")
                      {
                          if ($line.substring(0,7) -lt "6544143" -and $line.substring(11,6) -gt "612779" -and $line.substring(11,6) -lt "613873"){
                                                Write-Host $Line
                                                $line >> Area.txt
                       }
                   }
    }

    That way it won't bother taking the rest of the substrings.


    PowerShell does short-circuit evaluation, so that part shouldn't be necessary. If the expression on the left of an -and operator evaluates to $false, the rest of the expressions aren't executed.
    Thursday, November 14, 2013 4:39 PM
  • I wasn't sure, thanks for the tip.

    Thursday, November 14, 2013 4:43 PM
  • On a side note, you may find that it's better in the long run to import these files into a SQL database.  Properly configured, a SQL database with these values can be queried much faster than anything you're likely to develop yourself in a script.
    Thursday, November 14, 2013 4:59 PM
  • Thanks for the replies.

    The files is not ordered.  Its just the northing, easting and height taken by a boat going in lines up and Down waterways.  Not straight North, east lines, but back and forth, and all over the place, so the data is not in any order.

    My scripting skills was good enough to find and open Powershell.. And get that script going, and it Works fine :) (but a bit slow) But how to open the C++ Shell..  I need to do a bit of Research how I can open the C++ script window.

    I do have MS Access.  And Access can read and Write text files... I think I give that a shoot. 

    Friday, November 15, 2013 12:01 PM
  • C++ isn't a scripting language.  It's a programming language that must be compiled to an executable before you run it, and doesn't sound like it would be a good option for you to attempt at this point.

    I've read up on the StreamReader class a bit, and it already performs buffered reads (with a buffer size of 2KB by default, I believe).  You can specify a larger internal buffer size in the constructor, if you want, but there's no need to manage an external buffer yourself.  Here's an example using a 4 megabyte buffer; no other lines of your code would need to be changed.

    $sr = New-Object System.IO.StreamReader("C:\NEHfiles\CountyX.NEH", [Text.Encoding]::Default, $true, 4MB)   

    I don't know how much of a difference this will make with your performance, though.  Since there's no order to the records in the file, you just have to process the entire thing.  Reading and processing 1.3GB of data takes some time.  If you can import the files into a database of some sort (Access might be ok; SQL would be better), you'll definitely have some better options.
    Friday, November 15, 2013 3:42 PM