locked
Parsing text file RRS feed

  • Question

  • I have a text file as the input file, I would like to generate a csv or text file, each line has only class name, student id, score,average. Basically, I want to collect these four type information only and standardize them. There will be many classe blockes in the input file, each class block will have many students. The number of classes and students are Not fixed. The followings are just an example.

    Any suggestion on how to write a script on windows to read through the input file and generate the output file.

    output file

    "math, 55, 4, 4.5

    math, 34,3,4.5

    science, 56, 4,4"


    Input file

    "

    this is a report of classes:

    class name: math 

    student id=55, score=4

    student id=34,score=3

     Average is 4.5

     here are some cooments on this class

    therer are another class

    class name: science

    student id=56, score=4

    average is 4

    this may the end of report"


    • Edited by mail8mz Thursday, March 6, 2014 5:03 AM
    Thursday, March 6, 2014 4:56 AM

Answers

  • You'll only be holding parts of the file in memory at a time. With the way your file is structured, there's really no way around that unless you read through the file more than once, but I doubt you have so many students in a single class that the memory usage will become an issue. 

    Here's a quick revision which reads the average value from the text file (and it is assumed to be the very last line of a class's information block; all student records for that class come first, as in your example.) It's also modified slightly to keep memory usage down and improve efficiency for large data sets (using an ArrayList instead of an array).

    function ConvertFrom-ClassDataFile
    {
        [CmdletBinding()]
        param (
            [Parameter(ValueFromPipeline = $true)]
            [string[]]
            $InputObject
        )
    
        begin
        {
            $className = $null
            $studentList = New-Object System.Collections.ArrayList
        }
    
        process
        {
            foreach ($string in $InputObject)
            {
                foreach ($line in $string -split '\r?\n')
                {
                    switch -Regex ($line)
                    {
                        '^\s*class\s*name:\s*(.+?)\s*$'
                        {
                            $className = $matches[1]
                            break
                        }
    
                        '^\s*student\s*id\s*=\s*(\d+)\s*,\s*score\s*=\s*([\d\.-]+)'
                        {
                            $student = New-Object psobject -Property @{
                                StudentID = $matches[1]
                                Score = $matches[2]
                            }
    
                            $null = $studentList.Add($student)
    
                            break
                        }
    
                        '^\s*average\s*is\s*([\d\.-]+)'
                        {
                            $avg = $matches[1]
    
                            for ($i = 0; $i -lt $studentList.Count; $i++)
                            {
                                $studentList[$i] |
                                Add-Member -MemberType NoteProperty -Name ClassName -Value $className -PassThru |
                                Add-Member -MemberType NoteProperty -Name Average -Value $avg -PassThru |
                                Select-Object -Property ClassName,StudentID,Score,Average
    
                                $studentList[$i] = $null
                            }
    
                            $studentList.Clear()
                        }
    
                    } # switch -Regex ($line)
    
                } # foreach ($line in $string -split '\r?\n')
    
            } # foreach ($string in $InputObject)
    
        } # process
    
    } # function ConvertFrom-ClassDataFile
    
    $inputFile = '.\test.txt'
    $outputFile = '.\output.csv'
    
    Get-Content -Path $inputFile |
    ConvertFrom-ClassDataFile |
    Export-Csv -NoTypeInformation -Path $outputFile
    
    

    • Marked as answer by mail8mz Friday, March 7, 2014 3:37 AM
    Thursday, March 6, 2014 6:23 AM

All replies

  • Is the "average is 4.5" type of line actually in the file, or do you want that to be calculated based on the student / score lines?  Will a particular class be duplicated in the file (more than one "math" class, for instance)?  If so, do you want the output file to distinguish between them in some way (such as Math#2, etc)?

    Here's an example of how you could do this with PowerShell (assuming that you want to calculate the average based on student scores, and aren't concerned with any possible duplicate names.

    function ConvertFrom-ClassDataFile
    {
        [CmdletBinding()]
        param (
            [Parameter(ValueFromPipeline = $true)]
            [string[]]
            $InputObject
        )
    
        begin
        {
            $className = $null
            $studentList = @()
    
            function FinishClass()
            {
                $avg = $studentList | Measure-Object -Property Score -Average | Select-Object -ExpandProperty Average
    
                foreach ($student in $studentList)
                {
                    $student.Average = $avg
                    $student
                }
            }
        }
    
        process
        {
            foreach ($string in $InputObject)
            {
                foreach ($line in $string -split '\r?\n')
                {
                    switch -Regex ($line)
                    {
                        '^\s*class\s*name:\s*(.+?)\s*$'
                        {
                            FinishClass
    
                            $studentList = @()
                            $className = $matches[1]
    
                            break
                        }
    
                        '^\s*student\s*id\s*=\s*(\d+)\s*,\s*score\s*=\s*([\d\.-]+)'
                        {
                            if ($null -eq $className)
                            {
                                Write-Warning 'Student records detected before class name.'
                            }
                            else
                            {
                                $studentList += New-Object psobject -Property @{
                                    ClassName = $className
                                    StudentID = $matches[1]
                                    Score = $matches[2]
                                    Average = $null
                                }
                            }
    
                            break
                        }
    
                    } # switch -Regex ($line)
    
                } # foreach ($line in $string -split '\r?\n')
    
            } # foreach ($string in $InputObject)
    
        } # process
    
        end
        {
            FinishClass
        }
    
    } # function ConvertFrom-ClassDataFile
    
    $inputFile = '.\test.txt'
    $outputFile = '.\output.csv'
    
    Get-Content -Path $inputFile |
    ConvertFrom-ClassDataFile |
    Export-Csv -NoTypeInformation -Path $outputFile
    

    Thursday, March 6, 2014 5:27 AM
  • Thanks for you solution!

    The 'average is 4.5' is actually in the file. we don't need calculate it, just parse it and copy to output.  However, I have a concern on the memory. If the file is very big, will it work? The file may contain many claases and each class have many students.

    Is the "average is 4.5" type of line actually in the file, or do you want that to be calculated based on the student / score lines?  Will a particular class be duplicated in the file (more than one "math" class, for instance)?  If so, do you want the output file to distinguish between them in some way (such as Math#2, etc)?

    Here's an example of how you could do this with PowerShell (assuming that you want to calculate the average based on student scores, and aren't concerned with any possible duplicate names.

    function ConvertFrom-ClassDataFile
    {
        [CmdletBinding()]
        param (
            [Parameter(ValueFromPipeline = $true)]
            [string[]]
    th        $InputObject
        )
    
        begin
        {
            $className = $null
            $studentList = @()
    
            function FinishClass()
            {
                $avg = $studentList | Measure-Object -Property Score -Average | Select-Object -ExpandProperty Average
    
                foreach ($student in $studentList)
                {
                    $student.Average = $avg
                    $student
                }
            }
        }
    
        process
        {
            foreach ($string in $InputObject)
            {
                foreach ($line in $string -split '\r?\n')
                {
                    switch -Regex ($line)
                    {
                        '^\s*class\s*name:\s*(.+?)\s*$'
                        {
                            FinishClass
    
                            $studentList = @()
                            $className = $matches[1]
    
                            break
                        }
    
                        '^\s*student\s*id\s*=\s*(\d+)\s*,\s*score\s*=\s*([\d\.-]+)'
                        {
                            if ($null -eq $className)
                            {
                                Write-Warning 'Student records detected before class name.'
                            }
                            else
                            {
                                $studentList += New-Object psobject -Property @{
                                    ClassName = $className
                                    StudentID = $matches[1]
                                    Score = $matches[2]
                                    Average = $null
                                }
                            }
    
                            break
                        }
    
                    } # switch -Regex ($line)
    
                } # foreach ($line in $string -split '\r?\n')
    
            } # foreach ($string in $InputObject)
    
        } # process
    
        end
        {
            FinishClass
        }
    
    } # function ConvertFrom-ClassDataFile
    
    $inputFile = '.\test.txt'
    $outputFile = '.\output.csv'
    
    Get-Content -Path $inputFile |
    ConvertFrom-ClassDataFile |
    Export-Csv -NoTypeInformation -Path $outputFile


    Thursday, March 6, 2014 5:46 AM
  • You'll only be holding parts of the file in memory at a time. With the way your file is structured, there's really no way around that unless you read through the file more than once, but I doubt you have so many students in a single class that the memory usage will become an issue. 

    Here's a quick revision which reads the average value from the text file (and it is assumed to be the very last line of a class's information block; all student records for that class come first, as in your example.) It's also modified slightly to keep memory usage down and improve efficiency for large data sets (using an ArrayList instead of an array).

    function ConvertFrom-ClassDataFile
    {
        [CmdletBinding()]
        param (
            [Parameter(ValueFromPipeline = $true)]
            [string[]]
            $InputObject
        )
    
        begin
        {
            $className = $null
            $studentList = New-Object System.Collections.ArrayList
        }
    
        process
        {
            foreach ($string in $InputObject)
            {
                foreach ($line in $string -split '\r?\n')
                {
                    switch -Regex ($line)
                    {
                        '^\s*class\s*name:\s*(.+?)\s*$'
                        {
                            $className = $matches[1]
                            break
                        }
    
                        '^\s*student\s*id\s*=\s*(\d+)\s*,\s*score\s*=\s*([\d\.-]+)'
                        {
                            $student = New-Object psobject -Property @{
                                StudentID = $matches[1]
                                Score = $matches[2]
                            }
    
                            $null = $studentList.Add($student)
    
                            break
                        }
    
                        '^\s*average\s*is\s*([\d\.-]+)'
                        {
                            $avg = $matches[1]
    
                            for ($i = 0; $i -lt $studentList.Count; $i++)
                            {
                                $studentList[$i] |
                                Add-Member -MemberType NoteProperty -Name ClassName -Value $className -PassThru |
                                Add-Member -MemberType NoteProperty -Name Average -Value $avg -PassThru |
                                Select-Object -Property ClassName,StudentID,Score,Average
    
                                $studentList[$i] = $null
                            }
    
                            $studentList.Clear()
                        }
    
                    } # switch -Regex ($line)
    
                } # foreach ($line in $string -split '\r?\n')
    
            } # foreach ($string in $InputObject)
    
        } # process
    
    } # function ConvertFrom-ClassDataFile
    
    $inputFile = '.\test.txt'
    $outputFile = '.\output.csv'
    
    Get-Content -Path $inputFile |
    ConvertFrom-ClassDataFile |
    Export-Csv -NoTypeInformation -Path $outputFile
    
    

    • Marked as answer by mail8mz Friday, March 7, 2014 3:37 AM
    Thursday, March 6, 2014 6:23 AM