Introduction

This article will demonstrate how a “master csv file” can be split into multiple “child files” having approximately matching file size (bytes) and indexing file name using SSIS.

Problem

Exporting a fixed size file will be a difficult task as we cannot compromise file contents at the cost of file size accuracy. Moreover each row size can be different. Even if we read a source file, line by line and copy it to another file we may need to identify row size so that we can compromise row/rows rather than a file size. There can be other alternative of compressing a file such as by using WinZip however the output files will remain as a zip file. We might need to zip and unzip it to and fro.

Solution

The size of each Child file is compared to a predefine size (bytes) and each row is inserted from the Master file until the expected size of the Child file exceeds the threshold or the nearest size.

In this solution we are taking care of data truncation problem and taking care of the size variations in each row.

A log file has been created to capture the information of each Child file. It provides information like file name, size in bytes, total rows and extract date and time for each files.

 

Open SQL Server Data Tools or BIDS and Create an SSIS project. The Name of the package is “ExtractMultipleFile.dtsx”.


Step 1: Adding SSIS Variable

Add a SSIS variable "ExtractPath" to store the path where we can extract all Child Files.  Let us assign a default path say, "C:\\TESTING\\ExtractPath"

Step 2: Exporting Master CSV File from OLEDB Source

Add a data flow task to extract a Master file(csv) from OLEDB source. The name of the master file is "MasterFile.csv" and it is exported under "C:\\TESTING\\ExtractPath” i.e “ExtractPath”.

Step 3: Splitting Master File into Child Files

Add a script task to the package and set the ReadOnlyVariable property as “ExtractPath" variable. Refer below list of variables that are also declared to stored Master, Child and Log file path in the script task.

       

Dim MasterPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\MasterFile.csv"

Dim ChildPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\ChildFile.csv"

Dim LogPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\log.txt"

       

The basic idea is to use System.IO namespace to create, read & write files. Each row of the Master file is read line by line and write the contents to a Child file stream until the split criteria is within the acceptable size. Refer Split File Condition below.

 

Create Child and Log File

We can create a new Child file by using the below procedure. It has two parameter i.e Path and fs i.e a file stream.  Once the file is created it will return a file stream object to the calling portion.

       
 
Sub CreateFile(ByVal Path As String, ByRef fs As FileStream)

        If File.Exists(Path) Then File.Delete(Path)

        fs = File.Create(Path)

End Sub

       
Each files has an indexed name to identify the sequence of the split process, for instance ChildFile_1.csv. The variable “fileCounter” identifies the nth file to be created.        


Dim childfs As FileStream

Dim logfs As FileStream

Call CreateFile(Replace(ChildPath, ".csv", "_" & fileCounter & ".csv"), childfs)

Call CreateFile(Logpath, logfs)

                 

 

Write Child and Log File

An integer variable “filesizeCounter” is used to store and measure the size of a Child File in bytes. Each time the line information from the Master File is written to a Child File stream, the size of the new line is calculated and stored incrementally to the variable “filesizeCounter”.

       
 Sub WriteFile(ByRef fs As FileStream,
               ByVal LineInfo As String, ByRef filesizeCounter As Integer)

        Dim info As Byte() = New Text.UTF8Encoding(True).GetBytes(LineInfo & vbNewLine)

        fs.Write (info, 0, info.Length)    ' Add some information to the file.

        filesizeCounter = filesizeCounter + info.Length

End Sub

       

 

Split File Condition

Suppose we want to create Child files of 400 KB each i.e SizeKB = 400. Even if the variable "FilesizeCounter" or the current Child File size exceeds the expected size (SizeKB * 1024 bytes),  the recent line read from the Master file will be written to the Child File and then reset the FilesizeCounter to zero. It means that some fractions of bytes can be up and down i.e ~400KB.
A new Child File will be created whenever the file size counter is reset to zero.


If filesizeCounter = 0 Then

     fileCounter = fileCounter + 1

     Call CreateFile(Replace(ChildPath, ".csv", "_" & fileCounter & ".csv"), childfs)

End If

 We can also increase the Child File size to MB by changing the below operand in IF statement.

       


If filesizeCounter < (SizeKB * 1024) Then

    Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter)

    If Parentsr.EndOfStream Then childfs.Close()  

Else

    Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter)

    childfs.Close()

    filesizeCounter = 0

End if

       

 

Complete SSIS Script Code in VB.NET

                           

Imports System.IO

Public Sub Main()

   Dim FileSize As Integer = 400   'specify in KB. Can be modified.

   Dim MasterPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\MasterFile.csv"

   Dim ChildPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\ChildFile.csv"

   Dim LogPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\log.txt"

   Try

        Call SplitFile(MasterPath, ChildPath, LogPath, FileSize)

   Catch ex As Exception

        MsgBox(ex.Message)

    End Try

    Dts.TaskResult = ScriptResults.Success

End Sub

Sub SplitFile(ByVal MasterPath As String, ByVal ChildPath As String, ByVal Logpath As String, ByVal SizeKB As Integer)

    Dim filesizeCounter As Integer

    Dim fileCounter As Integer = 0

    Dim RowCount As Integer = 0
       
    'Open the stream and read it back.

    Dim Parentsr As StreamReader = File.OpenText (MasterPath)

    Dim childfs As FileStream

    Dim logfs As FileStream

    Call CreateFile(Logpath, logfs)  'Create Log File

    Do While Parentsr.Peek() >= 0    'Looping Master File stream

       If filesizeCounter = 0 Then

           fileCounter = fileCounter + 1

           Call CreateFile(Replace(ChildPath, ".csv", "_" & fileCounter & ".csv"), childfs)

       End If

       If filesizeCounter < (SizeKB * 1024) Then    ' 409600/1024=400 KB

            Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter)

            If Parentsr.EndOfStream Then

                childfs.Close()

                Call WriteFile(logfs, "---------", 0)

                Call WriteFile(logfs, "File Name:" & _
                             Replace(ChildPath, ".csv", "_" & fileCounter & ".csv") & _

                             vbNewLine & "Row Count:" & RowCount & _

                             vbNewLine & "Size(Bytes):" & filesizeCounter & _

                             vbNewLine & "Extract End:" & Now().ToString, 0)

            End If

            RowCount = RowCount + 1

        Else

            Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter)

            childfs.Close()   ' Close Child File

            Call WriteFile(logfs, "---------", 0)

            Call WriteFile(logfs, "File Name:" & 
                              Replace(ChildPath, ".csv", "_" & fileCounter & ".csv") & _

                              vbNewLine & "Row Count:" & RowCount & _

                              vbNewLine & "Size(Bytes):" & filesizeCounter & _

                              vbNewLine & "Extract End:" & Now().ToString, 0)

            RowCount = RowCount + 1

            filesizeCounter = 0    ' Reset file size counter

        End If

    Loop

    Parentsr.Close()  ' Close Master file

    logfs.Close()     ' Close Log File

End Sub

Sub CreateFile(ByVal Path As String, ByRef fs As FileStream)

        If File.Exists(Path) Then File.Delete(Path)  'Delete the if already exist.

        fs = File.Create(Path)

End Sub

Sub WriteFile(ByRef fs As FileStream, ByVal LineInfo As String, ByRef filesizeCounter As Integer)

        Dim info As Byte() = New Text.UTF8Encoding(True).GetBytes(LineInfo & vbNewLine)

        fs.Write (info, 0, info.Length)    ' Add some information to the file.

        filesizeCounter = filesizeCounter + info.Length

End Sub
        

 

Step 4: Run the package and Output review





Conclusion

The Master CSV File has been splitted into 5 Child files each of ~400 KB except the last file. The number of rows in each Child Files are different however the size of each files is approximately 400 KB.


References