locked
Combine Multiple CSV Files into Large CSV Files RRS feed

  • Question

  • I have a BUNCH of csv files that I'd like to condense into fewer files however I'd like to create a new file each time a file reaches 1,000,000 rows. So I'd end up with something like "outputfile-1.csv" that contains a million rows, then "output-2.cvs" containing the next 1 million rows and so on. How can this be accomplish?

    I have (gc *.csv) | set-content output.cvs

    And that works but I don't know how to make it only do 1 million rows per csv file.


    John Marcum | Microsoft MVP - Enterprise Client Management
    My blog: System Center Admin | Twitter: @SCCM_Marcum | Linkedin: John Marcum

    Monday, August 31, 2015 3:56 PM

Answers

  • $csv = Get-ChildItem *.csv | Select -Expand FullName | Import-Csv
    $rows = 1000000
    $i = 0
    $filenum = 1
    do {
        $filename = "output{0:D3}.csv" -f $filenum
        $csv[$i..(($i+=$rows) - 1)] | Export-Csv $filename -NoType
        $filenum++
    } 
    until ($i -ge $csv.Count - 1)

    • Marked as answer by John Marcum Tuesday, September 1, 2015 2:28 PM
    Monday, August 31, 2015 7:00 PM

All replies

  • Import the CSV then extract the rows

    $csv=Import-Csv filename
    $csv[0..(1000000-1)]|Export-Csv file1


    \_(ツ)_/

    Monday, August 31, 2015 4:14 PM
  • I have a TON of csv so could I do this rather than specifying a file name?

    And if so how to append a number to the end of each output file? Like output1.csv, output2.csv etc

    $csv=(gc *.csv)

    $csv[0..(1000000-1)]| set-content output.cvs


    John Marcum | Microsoft MVP - Enterprise Client Management
    My blog: System Center Admin | Twitter: @SCCM_Marcum | Linkedin: John Marcum

    Monday, August 31, 2015 4:37 PM
  • No.  YOu cannot use text operations to manage a CSV file:

    If all files are identical in structure you can do this:

    $csv=Impoprt-Csv  *.csv


    \_(ツ)_/

    Monday, August 31, 2015 6:25 PM
  • $csv = Get-ChildItem *.csv | Select -Expand FullName | Import-Csv
    $rows = 1000000
    $i = 0
    $filenum = 1
    do {
        $filename = "output{0:D3}.csv" -f $filenum
        $csv[$i..(($i+=$rows) - 1)] | Export-Csv $filename -NoType
        $filenum++
    } 
    until ($i -ge $csv.Count - 1)

    • Marked as answer by John Marcum Tuesday, September 1, 2015 2:28 PM
    Monday, August 31, 2015 7:00 PM
  • Maybe what I am trying to do is simply not possible? I tried the code Leif-Arne posted. I left it running overnight and 17 hours after I started it nothing has happened and Powershell is "not responding".

    John Marcum | Microsoft MVP - Enterprise Client Management
    My blog: System Center Admin | Twitter: @SCCM_Marcum | Linkedin: John Marcum

    Tuesday, September 1, 2015 12:11 PM
  • I think you need a database or a bigger machine with more memory.

    SQLExpress  can do millions of records on even small machines.

    The problem with CSVs, arrays and other scripting structures is that they exist only in memory. This cases swapping to happen and slows the whole thing down to a crawl.

    Always use a database for large datasets.

    Back to the drawing board, heh?


    \_(ツ)_/

    Tuesday, September 1, 2015 12:22 PM
  • Hi John,

    jrv is absolutely right, that this is really the job for a database. For the sake of PowerShell technique though, here's an example on how you can do this without overloading your memory usage:

    The basic concept is to keep it all within a single pipeline, so that the lines of text flow through and get emitted by the final function as they come. This means, all the data may only flow through the process part of each involved function/cmdlet. Here's an example:

    function Select-Set
    {
    	[CmdletBinding()]
    	Param (
    		[Parameter(ValueFromPipeline = $true)]
    		$InputObject,
    		
    		[int]
    		$Count
    	)
    	
    	Begin
    	{
    		$list = New-Object PSObject -Property @{ List = @() }
    	}
    	
    	Process
    	{
    		foreach ($Object in $InputObject)
    		{
    			$list.List += $Object
    			if ($list.List.Length -ge $Count)
    			{
    				$list
    				$list = New-Object PSObject -Property @{ List = @() }
    			}
    		}
    	}
    	
    	End
    	{
    		if ($list.Length -gt 0)
    		{
    			$list
    		}
    	}
    }
    
    function Set-SerialContent
    {
    	[CmdletBinding()]
    	Param (
    		[Parameter(ValueFromPipeline = $true)]
    		$InputObject,
    		
    		[string]
    		$Path,
    		
    		[string]
    		$Name,
    		
    		[string]
    		$Extension
    	)
    	
    	Begin
    	{
    		$count = 0
    	}
    	
    	Process
    	{
    		foreach ($Object in $InputObject)
    		{
    			$Object.List | Set-Content "$Path\$($Name)$($count).$($Extension)"
    			$count++
    		}
    	}
    }
    
    
    Get-Content "*.csv" | Select-Set -Count 1000000 | Set-SerialContent -Path "C:\Output" -Name "Csv_" -Extension "csv"

    Please note, that this is not designed to be the most aesthetic code, but a proof of concept. The two helper functions are designed according to the basic rule I explained above: All objects passed through only do so on the pipeline. By dint of necessity Select-Set keeps up to the number specified in memory (so you need that amount of memory available).

    Furthermore, for a tailored fit, I could have combined the two functions into one function (and saved me some trouble with collections and pipelines), but I like to keep separate functionalities separated between functions. A bit of pointless vanity perhaps?

    Anyway, this may work, but even if it does, there's no way its performance will top (or even approach) a database.

    Cheers,
    Fred


    There's no place like 127.0.0.1

    Tuesday, September 1, 2015 12:51 PM
  • I agree that a method that streams would work however...it is a CSV which has headers and the CSVs are not all 1000000 lines but need to be combined.

    Yes there is a streaming solution but it is complex.  A database would remove the complexity and provide the performance.  Writing a solution is unnecessary as long as we have databases.

    I can write this in a streaming format in about 15 minutes.  I won't.  Not interested.

    Any takers?

    Use a database.


    \_(ツ)_/

    Tuesday, September 1, 2015 1:05 PM
  • My end goal here is to get these into MS SQL. I couldn't find a way in MS SQL to import multiple csv files at once so I wanted to get the number of files down to as few as possible. I also found that MS SQL import parsed the headers better when the file has been converted to excel which is why I need the 1 million rows per file. Excel can only handle just over 1 million rows. :-)

    I ended up getting it to work by doing small chunks at a time and using the original code that Leif-Arne posted. I just separated my csv files into folders containing 400 files per folder. The script can work though that in no time at all. (I'm running this on a pretty beefy systems with lots of RAM and SSD's.)


    John Marcum | Microsoft MVP - Enterprise Client Management
    My blog: System Center Admin | Twitter: @SCCM_Marcum | Linkedin: John Marcum

    Tuesday, September 1, 2015 2:27 PM
  • BulkCopy can ;load multiple CSV files at very high speed.  I have loaded at more than one-hundred thousand records per second.  I built a system that bulk loaded batches of files.  We loaded more than 250 million records a day between 6:0 and 7:00.

    BulkCopy is what you want to use.  Don't pay attention to the idiots who claim it cannot be used with CSV files.


    \_(ツ)_/

    Tuesday, September 1, 2015 2:42 PM
  • BulkCopy can ;load multiple CSV files at very high speed.  I have loaded at more than one-hundred thousand records per second.  I built a system that bulk loaded batches of files.  We loaded more than 250 million records a day between 6:0 and 7:00.

    BulkCopy is what you want to use.  Don't pay attention to the idiots who claim it cannot be used with CSV files.


    \_(ツ)_/

    Ugh... I was just trying to use the import wizard thing. Can you give an example of how I could load all these csv's into SQL 2012 using bulk copy?? Would it just be like this????

    BULK INSERT dbo.TableForBulkData
    FROM 'D:\MyDisk\*.csv'
    WITH
    (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n'
    )


    John Marcum | Microsoft MVP - Enterprise Client Management
    My blog: System Center Admin | Twitter: @SCCM_Marcum | Linkedin: John Marcum

    Tuesday, September 1, 2015 3:16 PM
  • SQL questions should be posted in a SQL forum rather than here (scripting forum).

    -- Bill Stewart [Bill_Stewart]

    Tuesday, September 1, 2015 4:01 PM
  • BulkCopy can ;load multiple CSV files at very high speed.  I have loaded at more than one-hundred thousand records per second.  I built a system that bulk loaded batches of files.  We loaded more than 250 million records a day between 6:0 and 7:00.

    BulkCopy is what you want to use.  Don't pay attention to the idiots who claim it cannot be used with CSV files.


    \_(ツ)_/

    Ugh... I was just trying to use the import wizard thing. Can you give an example of how I could load all these csv's into SQL 2012 using bulk copy?? Would it just be like this????

    BULK INSERT dbo.TableForBulkData
    FROM 'D:\MyDisk\*.csv'
    WITH
    (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n'
    )


    John Marcum | Microsoft MVP - Enterprise Client Management
    My blog: System Center Admin | Twitter: @SCCM_Marcum | Linkedin: John Marcum

    Actually you need to either create a custom bcp format file or use the BulkCopy ADO method:

    Here:

    # Copy-SQLBulk.ps1
    # This shows how to use BulkCopy in PowerShell by uploading a spreadsheet to an MSSQLServer data table.
    
    $filepath = 'C:\scripts\All.xlsx'
    $excelConnection="Provider=Microsoft.ACE.OLEDB.12.0;Data Source=$filepath;Extended Properties='Excel 12.0 Xml;HDR=YES;IMEX=1'"
    $sqlConnection='Integrated Security=SSPI;Persist Security Info=False;Initial Catalog=issue;Data Source=OMEGA\SQLEXPRESS;'
    $excelQuery='select * from [SearchResult$]'
    $tablename='SearchResult'
    
    Try{
        $conn = New-Object System.Data.OleDb.OleDbConnection($excelConnection) 
        $conn.open()
        $cmd=$conn.CreateCommand()
        $cmd.CommandText=$excelQuery
        $rdr=$cmd.ExecuteReader()
        
        # create the BC object
        $sqlbc=[System.Data.SqlClient.SqlBulkCopy]$sqlConnection
        $sqlbc.DestinationTableName=$tableName
        
        # add all columns - you can add as few  as you like.
        for($i=0; $i -lt $rdr.FieldCount;$i++){
            $fname=$rdr.GetName($i)
            Write-Host $fname -ForegroundColor green
            [void]$sqlbc.ColumnMappings.Add($fname, $fname)
        }
        
        # write all of the data to the table
        $sqlbc.WriteToServer($rdr)
    }
    Catch{
        Write-Host "$_" -ForegroundColor red
    }
    
    $sqlbc.Close()
    $conn.Close()
    
    
    

    Just change the connection to use a CSV file.


    \_(ツ)_/

    Tuesday, September 1, 2015 5:35 PM
  • Here is another version that uses a CSV as an example:

    Param (
    	[parameter(Mandatory = $true)]
    	[string]$CsvPath,
    	[parameter(Mandatory = $true)]
    	[string]$CsvFile,
    	[parameter(Mandatory = $true)]
    	[string]$TargetServer,
    	[parameter(Mandatory = $true)]
    	[string]$TargetDatabase,
    	[parameter(Mandatory = $true)]
    	[string]$TargetTable,
    	[switch]$Truncate
    	#$ColumnMappings
    )
    
    Begin {
    	
    	Function Get-CsvReader {
    		Param (
    			$csvPath,
    			$csvFile
    		)
    		
    		$csvFile = $csvFile.Replace('.', '#')
    		$connStr = 'Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties="Text;HDR=Yes;FORMAT=Delimited"' -f $csvPath
    		$csvConnection = New-Object System.Data.OleDb.OleDbConnection($connStr)
    		$csvConnection.Open()
    		$csvCommand = $csvConnection.CreateCommand()
    		$csvCommand.CommandText = "select * from [$csvFile]"
    		$csvCommand.ExecuteReader()
    	}
    }
    
    Process {
    	Try {
    		$csvReader = Get-CsvReader -csvPath $svPAth -csvFile $csvFile
    		$targetConnStr = "Data Source=$ServerName;Initial Catalog=$DbName;Integrated Security=True;"
    		$bulkCopy = New-Object Data.SqlClient.SqlBulkCopy($targetConnStr, [System.Data.SqlClient.SqlBulkCopyOptions]::KeepIdentity)
    		$bulkCopy.DestinationTableName = $TargetTable
    		$bulkCopy.WriteToServer($csvReader)
    		Write-Host "Table $SrcTable in $SrcDatabase database on $SrcServer has been copied to table $DestTable in $DestDatabase database on $DestServer"
    	} Catch {
    		throw $_
    	} Finally {
    		$csvReader.Close()
    		$bulkCopy.Close()
    	}
    }
    

    Between the two you should be able to see how to do this.


    \_(ツ)_/

    Tuesday, September 1, 2015 5:38 PM
  • So you are actually leveraging powershell to call the bulk import into SQL? Okay, I'll give this a shot. Next time I should post my end goal and not just the baby step that I thought I needed to take to get there. :-)

    John Marcum | Microsoft MVP - Enterprise Client Management
    My blog: System Center Admin | Twitter: @SCCM_Marcum | Linkedin: John Marcum

    Tuesday, September 1, 2015 7:10 PM
  • You are going to have to modify the examples to fit your scenario.


    \_(ツ)_/

    Tuesday, September 1, 2015 7:18 PM