locked
Breaking Multiple large text files into smaller files with 1M rows max RRS feed

  • Question

  • Hi everyone,

    I am new to Powershell and this forum and would appreciate your help.

    I regularly receive multiple text files with >1M rows each of which I need split the files into chunks of 1M rows maximum. Someone told me to check out Powershell for this and I managed to grab the below from the internet and alter it to my needs.

    Any advice is very much appreciated. 

    Josh

    Settings:

    All source files are in the same folder. I wish for the code to loop through each of the source files in the folder, and divide them into multiple files containing subsets of the source file's rows (partitioned by the row's value in the final column). The final column contains a text string followed by a number. The output files will be named based on the value in the final column.

    Unresolved issue:

    I've tested the below code on a folder that has two source files. It runs and cuts the first source file into smaller files saving them where requested but then throws an error and does not loop through the second source file in the folder.

    error is as follows:

    new-object : Exception calling ".ctor" with "1" argument(s): "Access to the path
    'C:\Users\josh\Desktop\testing\powershell\results' is denied."
    At line:14 char:12
    +     $reader = new-object IO.StreamReader($src)  # Reader for input
    +               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : InvalidOperation: (:) [New-Object], MethodInvocationException
        + FullyQualifiedErrorId : ConstructorInvokedThrowException,Microsoft.PowerShell.Commands.NewObjectCommand

    Sample source data (this sample should create two output files named DEenglish1 and DEenglish2 both containing 1 row):

    COLUMN1 COLUMN2 PARTITIONCOLUMN

    XXXX YYYY DEenglish1

    XXXX YYYY DEenglish2

    Code so far:

    $fileDirectory = "C:\Users\josh\Desktop\testing\powershell";


    foreach($file in Get-ChildItem $fileDirectory)
    {
    $src = "C:\Users\josh\Desktop\testing\powershell\" + $file.name  # Source file
    $dst = "C:\Users\josh\Desktop\testing\powershell\results\{0}.txt" # Output file(s)

    $reader = new-object IO.StreamReader($src)  # Reader for input

    $header = Get-Content -Path $src | select -First 1 # Get the header row

    $ht = @{}
    $line = $reader.ReadLine() # Skip the first line, it's alread in $header

    while(($line = $reader.ReadLine()) -ne $null){ # Loop the input
    $match = [regex]::match($line, '(?i)(\w+\d)$') # Look for row that ends with text followed by number
    if($match.Success){

      $outFileName = $match.Groups[0].value # defines the output file name

      if(-not $ht.ContainsKey($outFileName)) { # Output file is not yet in hashtable
    $ht.Add($outFileName, (new-object Text.StringBuilder) )
    [void]$ht[$outFileName].Append($header)
    [void]$ht[$outFileName].Append([Environment]::NewLine)
      } else { # Append data to existing file
    [void]$ht[$outFileName].Append($line)
    [void]$ht[$outFileName].Append([Environment]::NewLine)
      }
    }
    }
    $reader.Close() # Close the input file

    # Dump the hashtable contents to individual files
    $ht.GetEnumerator() | % { 
    set-content $($dst -f $_.Name) ($_.Value).ToString() 

    }

    Wednesday, January 16, 2019 7:20 PM

Answers

  • Is the code you posted what you found on the Internet? Or is it code you've subsequently modified? I ask because the posted code doesn't fulfill the "chunks of 1M rows maximum" requirement.

    Also, the "New-Object" exception states the exception occurs on line 14. Line 14 in the posted code contains "$reader.ReadLine()". The New-Object cmdlet appears on line 9.

    The code you posted seems to work with limited test data. However, I'd suggest removing this line:

    $header = Get-Content -Path $src | select -First 1 # Get the header row

    and modifying the 1st "$line = $reader.ReadLine()" to "$header = $reader.ReadLine()".

    There's no need to mix Powershell cmdlets with straight-up .Net code.

    Also, the code never adds the 1st line of the input file to the hash. It only adds the header!

    Try this (corrected) code and see if it works any better (adjust the top line to agree with you directory structure):

    $fileDirectory = "C:\test";
    
    foreach($file in Get-ChildItem -Path $fileDirectory -filter "*.csv")
    {
        $src = $fileDirectory + '\' + $file.name    # Source file
        $dst = $fileDirectory + "\results\{0}.txt"  # Output file(s)
    
        $reader = new-object IO.StreamReader($src)  # Reader for input
    
        $ht = @{}
        $header = $reader.ReadLine()                # store the header for later use
    
        while(($line = $reader.ReadLine()) -ne $null){     # Loop the input
            $match = [regex]::match($line, '(?i)(\w+\d)$') # Look for row that ends with text followed by number
            if($match.Success){
                $outFileName = $match.Groups[0].value      # defines the output file name
                if(-not $ht.ContainsKey($outFileName)) {   # Output file is not yet in hashtable
                    $ht.Add($outFileName, (new-object Text.StringBuilder) )
                    [void]$ht[$outFileName].Append($header)
                    [void]$ht[$outFileName].Append([Environment]::NewLine)
                    [void]$ht[$outFileName].Append($line)
                    [void]$ht[$outFileName].Append([Environment]::NewLine)
                } else { # Append data to existing file
                    [void]$ht[$outFileName].Append($line)
                    [void]$ht[$outFileName].Append([Environment]::NewLine)
                }
            }
        }
        $reader.Close() # Close the input file
        $reader.Dispose()
    
        # Dump the hashtable contents to individual files
        $ht.GetEnumerator() | ForEach {
            set-content $($dst -f $_.Name) ($_.Value).ToString() 
        } 
    } 


    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    • Marked as answer by jt-89 Thursday, January 24, 2019 7:57 PM
    Wednesday, January 16, 2019 10:40 PM

All replies

  • Start with this.

    $name= 'newnamebase'
    Get-Content -ReadCount 1000000 $filename |
         ForEach-Object {$i=1} {
              $newfile = "$name$i.ext"
              $i++
              $_ | Out-File $newfile
         }

    Splits any file into smaller files of 1M lines or less.


    \_(ツ)_/

    Wednesday, January 16, 2019 7:46 PM
  • Is the code you posted what you found on the Internet? Or is it code you've subsequently modified? I ask because the posted code doesn't fulfill the "chunks of 1M rows maximum" requirement.

    Also, the "New-Object" exception states the exception occurs on line 14. Line 14 in the posted code contains "$reader.ReadLine()". The New-Object cmdlet appears on line 9.

    The code you posted seems to work with limited test data. However, I'd suggest removing this line:

    $header = Get-Content -Path $src | select -First 1 # Get the header row

    and modifying the 1st "$line = $reader.ReadLine()" to "$header = $reader.ReadLine()".

    There's no need to mix Powershell cmdlets with straight-up .Net code.

    Also, the code never adds the 1st line of the input file to the hash. It only adds the header!

    Try this (corrected) code and see if it works any better (adjust the top line to agree with you directory structure):

    $fileDirectory = "C:\test";
    
    foreach($file in Get-ChildItem -Path $fileDirectory -filter "*.csv")
    {
        $src = $fileDirectory + '\' + $file.name    # Source file
        $dst = $fileDirectory + "\results\{0}.txt"  # Output file(s)
    
        $reader = new-object IO.StreamReader($src)  # Reader for input
    
        $ht = @{}
        $header = $reader.ReadLine()                # store the header for later use
    
        while(($line = $reader.ReadLine()) -ne $null){     # Loop the input
            $match = [regex]::match($line, '(?i)(\w+\d)$') # Look for row that ends with text followed by number
            if($match.Success){
                $outFileName = $match.Groups[0].value      # defines the output file name
                if(-not $ht.ContainsKey($outFileName)) {   # Output file is not yet in hashtable
                    $ht.Add($outFileName, (new-object Text.StringBuilder) )
                    [void]$ht[$outFileName].Append($header)
                    [void]$ht[$outFileName].Append([Environment]::NewLine)
                    [void]$ht[$outFileName].Append($line)
                    [void]$ht[$outFileName].Append([Environment]::NewLine)
                } else { # Append data to existing file
                    [void]$ht[$outFileName].Append($line)
                    [void]$ht[$outFileName].Append([Environment]::NewLine)
                }
            }
        }
        $reader.Close() # Close the input file
        $reader.Dispose()
    
        # Dump the hashtable contents to individual files
        $ht.GetEnumerator() | ForEach {
            set-content $($dst -f $_.Name) ($_.Value).ToString() 
        } 
    } 


    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    • Marked as answer by jt-89 Thursday, January 24, 2019 7:57 PM
    Wednesday, January 16, 2019 10:40 PM
  • Thank you both.

    The second code worked great, only had to change the $fileDirectory -filter "*.csv" to $fileDirectory -filter "*.txt". I was able to split 25M rows across two files into files with 1M rows within 21 minutes.

    With regards to the 1M rows limitation, that is what I use the last column for. It has a number suffix that goes up by one for each increment of 1M rows (i.e. row 1 to 999999 states 1, 1M to 1999999 states 2, etc.).

    I will inspect the first code above to learn how I can do things in Powershell without helper columns.

    Thanks a lot.

    Thursday, January 17, 2019 4:02 PM
  • Try this code. It should be faster (and less of a memory hog) because it isn't using hashes and arrays to store the data.

    If you count the header record then set $maxRecords to 999999 instead of 1000000 because the code only counts data records.

    $fileDirectory = "C:\test";
    
    $maxRecords = 2
    
    foreach($file in Get-ChildItem -Path $fileDirectory -filter "*.csv")
    {
        $src = $fileDirectory + '\' + $file.name        # Source file
        $dst = $fileDirectory + "\results\{0}-{1}.txt"  # Output file(s)
    
        $reader = new-object IO.StreamReader($src)      # Reader for input
    
        $header = $reader.ReadLine()                    # store the header for later use
    
        $lineNumber = 0                                 # keep track of record count
        $fileSuffix = 0                                 # keep track of file name suffix
    
        while(($line = $reader.ReadLine()) -ne $null){     # Loop the input
            if ($lineNumber -eq 0)
            {
                Set-Content ($dst -f $file.BaseName,$fileSuffix) -Value $header
            }
            Add-Content ($dst -f $file.BaseName,$fileSuffix) -Value $line
            if (++$lineNumber -eq $maxRecords)
            {
                $lineNumber = 0
                $fileSuffix++
            }
        }
        $reader.Close() # Close the input file
        $reader.Dispose()
    
    } 


    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    Thursday, January 17, 2019 7:23 PM
  • Hi Rich, I've ran it with $maxRecords set to 999999 and it works as expected.

    Unfortunately it does not seem to be faster as it took 48min to create only 1 segment of 1M rows vs the 21min for 25 segments with the original code you had provided.

    In any case thank you for the help - I appreciate the guidance. 

    Friday, January 18, 2019 12:06 PM
  • Hi,

    Was your issue resolved?

    If you resolved it using our solution, please "mark it as answer" to help other community members find the helpful reply quickly.

    If you resolve it using your own solution, please share your experience and solution here. It will be very beneficial for other community members who have similar questions.

    If no, please reply and tell us the current situation in order to provide further help.

    Best Regards,

    Lee


    Just do it.

    Tuesday, January 22, 2019 6:45 AM
  • done. thank you
    Thursday, January 24, 2019 7:57 PM
  • Replacing the "Set-Content" and "Add-Content" with an IO.StreamWriter object and "$obj.WriteLine($header) and "$obj.WriteLine($line)" (as you did with the StreamReader) would probably correct that slowness.

    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    Thursday, January 24, 2019 8:07 PM